Summarisation Procedure
At the pre-processing stage, the keyword extraction algorithm selects the documents
keywords according to a pure TFIDF procedure. We propose computing the probability
of each word appearing in the given text. The probability is given by simply calculating
the ratio of the number the word appears in the text over the total number of words
in the document. In our algorithm, we select words that are present in the top N
percent of the keyword ranking array, where N is a number given parametrically.
This effectively gives the user the choice to eliminate common words from the summarisation
process. #Calchas uses the term’s TFIDF value as the term weight. This value is
usually less than one (or possibly in extraordinary cases, equal to one) which means
that when, raised to a power, its value is much less responsive than the term probability
which is usually less than 0.05. We will see where this comes into play below.
After the probabilities have been computed, we assign a weight coefficient to each
sentence in the text, which is nothing more than the average probability of words
in that sentence. The selection mechanism is quite interesting even though it proved
inadequate in its simple form for #Calchas. We propose the selection of the best
scoring sentence that contains the highest probability term. This means that we
start at the top of the sentence ranking array and find the first sentence that
contains the topmost term in the keyword ranking array. Then the we recompute the
keyword ranking array for each term that was present in the selected sentence by
squaring each term’s probability. Since probabilities are less than one, this recomputation
results in a new keyword ranking array, which in turn is used to recalculate the
sentence ranking array. The authors argue that intuitively we can think of the squared
probability of the included terms as “the probability of the term appearing twice
in the summary text”. If the term’s probability is high enough, squaring it will
not necessarily block the term from being included in the summary again. Essentially,
the summary is made more sensitive to context, since the importance of terms is
updated continuously as sentences are selected in the summary text. Additionally,
words that ranked low in the initial keyword ranking array are given the opportunity
to affect the subsequent choice of summary sentences. Note that the order in which
the sentences are selected does not affect the order they are displayed in the summary;
the summary sentences are ordered according to how they appear in the original document.
The selection and recalculation process is repeated until the parametrically given
summary length is reached.
In our implementation, however, we do not use the term’s probability as its weight,
but we use the term’s TFIDF value. This value shows a significantly different behaviour
than probability values, which, as mentioned earlier, has a much higher value than
the probability and is consequently much less responsive when raised to a power.
On the other hand, even in the probability case that Nenkova and Vanderwende describe,
the probability values are all very low and thus squaring them does not necessarily
mean that e.g. the high probability word moves further down the list; it could well
still be in the top-five. Hence, we have introduced a new parameter that handles
summary sensitivity to context. Its value signifies the power to which we will raise
the weights of the words in the selected sentences and as that power increases,
so does the summary’s sensitivity to context. Additionally, we choose to nullify
(actually, we set it to zero) the TFIDF to terms with initial TFIDF equal to one.
This is because we assert that terms with TFIDF value of one (i.e. they appear only
in the current document) are either bogus results (i.e. nonexistent terms that have
been mistakenly been produced by the stemmer) or particular terms for the specific
document (e.g. author names, product names, etc.). Either way, including these terms
would adversely affect the quality of the summary sentences selected and it would
eliminate the effect of the sensitivity-to-context parameter that we have implemented.
Unfortunately, the scoring algorithm did not work as planned. The most important
problem was that section headings and sentence fragments (e.g. bullet points) were
selected as summary sentences. Additionally, short sentences with little or no useful
meaning were selected in the summary text, since these short sentences did not include
many terms to lower their average probability weights. In order to correct these
fallacies we implemented a series of measures. First, we chose not to select sentences
that began with a non-alpha character (e.g. numeric, hyphen, etc). This eliminated
headings and bullet points from the summary sentences. We also incorporated two
penalty coefficients, the values of which are given parametrically by the user.
The first coefficient is used to penalise short sentences and thus negate the effects
of the short length on the average weights. The definition of “short” is given as
a parameter that states the sentence length under which a sentence is considered
“short”. The second coefficient is the result of an algorithm that aims to improve
the chance of selecting sentences with a high concentration of high probability
words. By selecting only the top-N percent terms with high TFIDF values, the scoring
algorithm sets the TFIDF for other terms to zero. Hence, if the sentence has many
high-probability words (i.e. words included in the top-N percent of the TFIDF table
for that document) then its average weight increases. If however many terms have
zero TFIDF value (i.e. are not included in the table) then the sentence is penalised
and its average weight drops. Note that implementing this coefficient increased
the tendency of the summary selection to select short sentences and we were forced
to decrease the penalty multiplier mentioned from 0.675 to 0.475 for the default
parameter set. The rationalisation behind the implementation of these two coefficients
comes from economics and, more particularly, from contract theory. Since, pure probability
(or TFIDF) analysis of the sentence terms cannot provide us with definitive information
about the importance of the sentence to the document summary, we are forced to make
use of the fundamentals of signalling to infer the information that the selection
algorithm requires. Signalling is the notion that, in a negotiation situation, one
party (termed the principal) uses openly available, yet useless in its pure form,
information about the other party (the agent) to make conjectures on obscure, useful
information about that party. In this case, the agent has some information that
is useful to the principal, but cannot pass it on to the principal; the agent can
only pass on signals that help the principal deduce the useful information about
the agent. This is known as asymmetric information. A typical example of signalling
comes from recruiting. Recruiters usually want to know the quality of candidates
with respect to the work environment, e.g. if they are hard-working, dedicated,
intelligent, etc. However, this information is not available to them so they use
signals, like the level of education, their course grades, recommendations, etc.
In our implementation, the sentence length and the concentration of high probability
words are signals that provide us with information as to whether we should include
the sentences in the summary or not.