Τελευταια Νεα

26 Ιουν. 2020

Σχεδίαση & υλοποίηση δικτυακού τόπου για την επιχείρηση Aggelos Kastoris Photography από την Antwork Πληροφορική.

διαβάστε περισσότερα...


18 Ιουλ. 2019

Σχεδίαση & υλοποίηση δικτυακού τόπου για την επιχείρηση Faidra Olive Grove Luxury Villa από την Antwork Πληροφορική.

διαβάστε περισσότερα...


29 Οκτ. 2018

Σχεδίαση & υλοποίηση δικτυακού τόπου για την επιχείρηση DiTsi Photography από την Antwork Πληροφορική.

διαβάστε περισσότερα...


20 Νοε. 2017

Σχεδίαση & υλοποίηση δικτυακού τόπου για την επιχείρηση Περιβαλλοντολόγοι Energy από την Antwork Πληροφορική.

διαβάστε περισσότερα...


20 Απρ. 2015

Σχεδίαση & υλοποίηση δικτυακού τόπου για την επιχείρηση Oxytools από την Antwork Πληροφορική.

διαβάστε περισσότερα...


Κοινωνικα Δικτυα

Summarisation Procedure

At the pre-processing stage, the keyword extraction algorithm selects the documents keywords according to a pure TFIDF procedure. We propose computing the probability of each word appearing in the given text. The probability is given by simply calculating the ratio of the number the word appears in the text over the total number of words in the document. In our algorithm, we select words that are present in the top N percent of the keyword ranking array, where N is a number given parametrically. This effectively gives the user the choice to eliminate common words from the summarisation process. #Calchas uses the term’s TFIDF value as the term weight. This value is usually less than one (or possibly in extraordinary cases, equal to one) which means that when, raised to a power, its value is much less responsive than the term probability which is usually less than 0.05. We will see where this comes into play below.

After the probabilities have been computed, we assign a weight coefficient to each sentence in the text, which is nothing more than the average probability of words in that sentence. The selection mechanism is quite interesting even though it proved inadequate in its simple form for #Calchas. We propose the selection of the best scoring sentence that contains the highest probability term. This means that we start at the top of the sentence ranking array and find the first sentence that contains the topmost term in the keyword ranking array. Then the we recompute the keyword ranking array for each term that was present in the selected sentence by squaring each term’s probability. Since probabilities are less than one, this recomputation results in a new keyword ranking array, which in turn is used to recalculate the sentence ranking array. The authors argue that intuitively we can think of the squared probability of the included terms as “the probability of the term appearing twice in the summary text”. If the term’s probability is high enough, squaring it will not necessarily block the term from being included in the summary again. Essentially, the summary is made more sensitive to context, since the importance of terms is updated continuously as sentences are selected in the summary text. Additionally, words that ranked low in the initial keyword ranking array are given the opportunity to affect the subsequent choice of summary sentences. Note that the order in which the sentences are selected does not affect the order they are displayed in the summary; the summary sentences are ordered according to how they appear in the original document. The selection and recalculation process is repeated until the parametrically given summary length is reached.

In our implementation, however, we do not use the term’s probability as its weight, but we use the term’s TFIDF value. This value shows a significantly different behaviour than probability values, which, as mentioned earlier, has a much higher value than the probability and is consequently much less responsive when raised to a power. On the other hand, even in the probability case that Nenkova and Vanderwende describe, the probability values are all very low and thus squaring them does not necessarily mean that e.g. the high probability word moves further down the list; it could well still be in the top-five. Hence, we have introduced a new parameter that handles summary sensitivity to context. Its value signifies the power to which we will raise the weights of the words in the selected sentences and as that power increases, so does the summary’s sensitivity to context. Additionally, we choose to nullify (actually, we set it to zero) the TFIDF to terms with initial TFIDF equal to one. This is because we assert that terms with TFIDF value of one (i.e. they appear only in the current document) are either bogus results (i.e. nonexistent terms that have been mistakenly been produced by the stemmer) or particular terms for the specific document (e.g. author names, product names, etc.). Either way, including these terms would adversely affect the quality of the summary sentences selected and it would eliminate the effect of the sensitivity-to-context parameter that we have implemented.

Unfortunately, the scoring algorithm did not work as planned. The most important problem was that section headings and sentence fragments (e.g. bullet points) were selected as summary sentences. Additionally, short sentences with little or no useful meaning were selected in the summary text, since these short sentences did not include many terms to lower their average probability weights. In order to correct these fallacies we implemented a series of measures. First, we chose not to select sentences that began with a non-alpha character (e.g. numeric, hyphen, etc). This eliminated headings and bullet points from the summary sentences. We also incorporated two penalty coefficients, the values of which are given parametrically by the user. The first coefficient is used to penalise short sentences and thus negate the effects of the short length on the average weights. The definition of “short” is given as a parameter that states the sentence length under which a sentence is considered “short”. The second coefficient is the result of an algorithm that aims to improve the chance of selecting sentences with a high concentration of high probability words. By selecting only the top-N percent terms with high TFIDF values, the scoring algorithm sets the TFIDF for other terms to zero. Hence, if the sentence has many high-probability words (i.e. words included in the top-N percent of the TFIDF table for that document) then its average weight increases. If however many terms have zero TFIDF value (i.e. are not included in the table) then the sentence is penalised and its average weight drops. Note that implementing this coefficient increased the tendency of the summary selection to select short sentences and we were forced to decrease the penalty multiplier mentioned from 0.675 to 0.475 for the default parameter set. The rationalisation behind the implementation of these two coefficients comes from economics and, more particularly, from contract theory. Since, pure probability (or TFIDF) analysis of the sentence terms cannot provide us with definitive information about the importance of the sentence to the document summary, we are forced to make use of the fundamentals of signalling to infer the information that the selection algorithm requires. Signalling is the notion that, in a negotiation situation, one party (termed the principal) uses openly available, yet useless in its pure form, information about the other party (the agent) to make conjectures on obscure, useful information about that party. In this case, the agent has some information that is useful to the principal, but cannot pass it on to the principal; the agent can only pass on signals that help the principal deduce the useful information about the agent. This is known as asymmetric information. A typical example of signalling comes from recruiting. Recruiters usually want to know the quality of candidates with respect to the work environment, e.g. if they are hard-working, dedicated, intelligent, etc. However, this information is not available to them so they use signals, like the level of education, their course grades, recommendations, etc. In our implementation, the sentence length and the concentration of high probability words are signals that provide us with information as to whether we should include the sentences in the summary or not.