\documentclass[12pt]{article} \usepackage{pslatex} \usepackage{amsfonts} \usepackage{newexam+shield} \usepackage{epsf} \usepackage{myfancyverb} \usepackage{examAnswers} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% TOGGLES %\showAnswers \hideAnswers %\mscPaper \ugPaper %\input{toggle} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \pressmark{COM3110} \mscOnly{\pressmark{COM6150}} \pressmark{COM3110/COM6150} \department{\bf {DEPARTMENT OF COMPUTER SCIENCE}} \examtitle{{\bf TEXT PROCESSING}} \examdate{{\bf Autumn Semester 2005-06}} \examtime{{\bf 2 hours}} \setter{Mark Hepple \\ Mark Stevenson} \rubric{{\bf Answer THREE questions. All questions carry equal weight. Figures in square brackets indicate the percentage of available marks allocated to each part of a question.}} \mscOnly{\rubric{{\bf Answer the Question in Section A and TWO further questions from Section B. All questions carry equal weight. Figures in square brackets indicate the percentage of available marks allocated to each part of a question.}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \begin{exam} \turnover \mscOnly{\begin{center}\bf SECTION A \end{center}} \begin{question} \begin{qupart} Explain what is meant by the bag of words'' model which is used for various text processing tasks. \mypercent{10} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% The bag of words model assumes that text can be represented as a simple list of the words it contains and, possibly, their frequencies in the text. Information about the relations between these words and their position in the text is ignored. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Explain how the bag of words model can be used in Information Retrieval and word sense disambiguation. \mypercent{15} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% The standard approach to IR models each document as a bag of words, and identifies relevant documents by comparing queries to documents using similarity measures that work over such bag-of-words representations. There are various applications in word sense disambiguation including naive Bayesian classifiers which treat the context of an ambiguous word as a bag of words and the Lesk algorithm for disambiguation using dictionary definition overlap treats each definition as a bag of words. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Write down Bayes Rule and explain how it is used in the naive Bayesian classifier for word sense disambiguation. \mypercent{20} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Bayes rule is: $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ The naive Bayesian classifier aims to estimate the probability of a sense given the context. i.e $P(s|c)$, where $s$ is a particular sense and $c$ is the context used by the algorithm. However, this cannot be estimated directly so Bayes rule is applied and the formula re-written as: : $\frac{P(c|s)P(s)}{P(c)}$. $P(c)$ is constant for all senses and can be ignored. We approximate $P(c|s)$ as $P(a_1|s)\times\ldots\times{}P(a_n|s)$, where $a_1\ldots a_n$ are the multiple features that make up context', i.e. making the naive bayesian assumption that these features are conditionally independent. The probabilities $P(s)$ and $P(a_i|s)$ can be estimated from training text, } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \answersonly{\newpage} \begin{qupart} What are the similarities between part of speech tagging and word sense disambiguation? Why are techniques for part of speech tagging not suitable for word sense disambiguation? \mypercent{15} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Both part of speech tagging and WSD aim to add extra information to each token in a text (or possibly just each content word). Part of speech tagging adds syntactic information while WSD selects amongst word meanings. Part of speech tagging approaches generally examine a narrow context around the word being annotated, for example the previous one or two words. WSD requires a wider context. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Write a short Perl program that will read from a file on {\tt STDIN} and that does nothing with the line that is read {\it until\/} it encounters a line containing the string {\tt}. Thereafter, each subsequent line will be printed (to {\tt STDOUT}) {\it until\/} it encounters a line containing the string {\tt }. This line and any that follow are not printed. You may assume that the {\tt} and {\tt} strings will occur only once in a file. \mypercent{20} \end{qupart} \begin{SaveVerbatim}{vrb1} $print=0; while (<>) { if (//) {$print=1; } elsif (//) { $print=0; } elsif ($print) { print; } } \end{SaveVerbatim} \begin{SaveVerbatim}{vrb2} while (<>) { last if //; } while (<>) { last if //; print; } \end{SaveVerbatim} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Could be done in more than one way: e.g. as:~~ \fbox{~\useVerbStretchFNS{.9}{vrb1}~} ~~ or as:~~ \fbox{~\useVerbStretchFNS{.9}{vrb2}~} } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Assume that, in reading text from an HTML file, we can identify a URL (web address') expression {\it within\/} a line of text provided that the string meets the following requirements: \begin{itemize} \item[i.] the string should fall between double quotes ({\tt "}) \item[ii.] it should start with {\tt http://} \item[iii.] it should end with {\tt.htm} or {\tt.html} \item[iv.] between this start and end, there may be any sequence of characters except that {\tt "}, {\tt <} and {\tt >} may not appear. \end{itemize} Write a Perl regular expression that will match against strings that contain a URL expression, with the the URL string found being assigned to the variable \verb+$1+. \mypercent{10} \end{qupart} \SaveVerb{vrb1}+/\"(http:[^\"<>]*\.htm(l)?)\"/+ \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% \hspace*{40mm} \UseVerb{vrb1} \medskip } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{SaveVerbatim}{vrb1}$_ = 'abcde'; s/(\w)(\w)/$2$1/g; print "$_\n"; \end{SaveVerbatim} \begin{SaveVerbatim}{vrb2}$_ = 'abcde'; while (s/(a)(\w)/$2$1/) { } print "$_\n"; \end{SaveVerbatim} \begin{qupart} Specify what will be printed by each of the following pieces of Perl code: \useVerbLineCentering{t} \begin{itemize} \item[i.] \fbox{~\useVerbStretchFNS{.9}{vrb1}~} \mypercent{5} \medskip \item[ii.] \fbox{~\useVerbStretchFNS{.9}{vrb2}~} \mypercent{5} \bigskip \end{itemize} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% \begin{itemize} \item[i.] gives: {\tt 'badce'} \medskip \item[ii.] gives: {\tt 'bcdea'} -- iteratively transposes the {\tt a} down to the end of the string. \end{itemize} } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \end{question} %\ugOnly{\medskip}% \mscOnly{\continued \begin{center}\bf% SECTION B% \end{center}}% \begin{question} \begin{qupart} Describe Lesk's algorithm for word sense disambiguation using dictionary definition overlap and explain its limitations. \mypercent{25} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Lesk's algorithm relies on the textual definitions of words in a dictionary to disambiguate words. The pairwise combinations of senses are considered and the number of words they share in common counted. The pair of senses with the largest number of words in common are chosen as the senses. To work well the algorithm requires the definitions to be pre-processed in a number of ways. Empty heads and stop words are removed. (Empty heads are short expressions generally found at the start of dictionary definitions which indicate the hypernym of the definition, such as "a type of".) The remaining words are stemmed. The limitations are: \begin{enumerate} \item The approach depends on the correct senses sharing words in their definitions. This means that the approach depends on the particular dictionary used and will prefer senses with longer definitions over shorter ones. \item The approach may not be tractable for long sentences; the number of sense combinations which have to be checked could be prohibitive. \end{enumerate} } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Describe two techniques which can be used to annotate text with word sense information automatically without the need for manual annotation. These techniques are used to provide data which can be used to train and/or test a word sense disambiguation system. Describe the disadvantages of each.\mypercent{15} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Two techniques are: \begin{enumerate} \item Make use of parallel corpora where the sense is defined as the translation of a word into another language. \item Use pseudowords to automatically introduce ambiguity into text. (Pseudowords are created by choosing two or more words, e.g. car and bottle, and replacing each occurrence of each with their concatenation, car-bottle. The task of the disambiguation algorithm is to identify the original word.) \end{enumerate} Their disadvantages are: \begin{enumerate} \item Parallel text may be hard to obtain. \item Sense distinctions in pseudowords may not be appropriate for actual applications (e.g. translation). The sense distinctions are artificial and so an algorithm may learn to disambiguate between contexts in which each of the words is likely to occur rather than meanings of the words. \end{enumerate} } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} What assumption is made by a naive Bayes classifier when it is used for word sense disambiguation? \mypercent{10} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% The naive Bayes classifier assumes that the probabilities of the words in the context of the ambiguous words are conditionally independent. That is, the presence, or otherwise, of a word in the context has no influence on other words. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \answersonly{\newpage} \begin{qupart} Explain the differences between direct, transfer and interlingua approaches to Machine Translation. \mypercent{30} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% The key difference between the three approaches is the level of analysis which is applied to the source text. Direct approaches apply very little analysis to the the source text and rely on simple translation of each word in the source text. Statistical MT could be considered to be a direct approach. Transfer approaches attempt to analyse the structure of the source text to produce an intermediate representation. The intermediate representation of a sentence from the input text is then used in generating the translated sentence in the target language. Transfer approach can employ syntactic or semantic representations. Interlingua approaches rely on a representation of the meaning which is independent of both the source and target language. The source sentence goes through syntactic and semantic analysis to be translated into the interlingua. This representation is then used to generate a sentence in the target language. The difference between transfer approaches which use semantic representations and interlingua approaches rests on the independence of the system used to represent meaning; interlinguas are completely independent of the source and target languages while the representation used in semantic transfer simply aims to capture enough information to allow translation. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Describe the approach used by the BLEU system for evaluation of Machine Translation systems.\mypercent{20} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% The BLEU system relies on multiple reference translations, each of which represents a possible way in which a text could be translated into a target language. The translation being evaluated (the candidate) is compared against the reference translations by counting the number of possible ngrams (strings of words) which occur in them. BLEU assumes that there are several possible ways to translate a text, each of which are equally valid, and uses multiple reference reference translations to provide these alternatives. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \end{question} \ugOnly{\continued} \mscOnly{\turnover} \begin{question} \begin{qupart} What are stop words, in the context of an Information Retrieval system? Why are they generally not included as index terms? \mypercent{10} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Stop words are generally words belonging to closed grammatical classes such as determiners, conjunctions and prepositions. They are not included as index terms as they occur so frequently in documents that they are not good discriminators between relevant and irrelevant documents. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \answersonly{\newpage} \begin{qupart} Consider the following collection of three short documents:\\ \\ \hspace{10ex}Document 1: Tyger, tyger, burning bright\\ \hspace{10ex}Document 2: Tyger sat on the mat\\ \hspace{10ex}Document 3: The bright mat\\ \\ Show how the documents would be represented in a vector space model for Information Retrieval, as vectors in which term weights correspond to term frequencies. Do not remove stop words, and do not use stemming in creating these representations. \mypercent{10} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% The documents would be represented as follows: \begin{center} \begin{tabular}{|cccccccc|} \hline & bright & burning & mat & on & sat & the & tyger\\ Document 1 & 1 & 1 & 0 & 0 & 0 & 0 & 2\\ Document 2 & 0 & 0 & 1 & 1 & 1 & 1 & 1\\ Document 3 & 1 & 0 & 1 & 0 & 0 & 1 & 0\\ \hline \end{tabular} \end{center} } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} The cosine coefficient can be used by Information retrieval systems to rank the relevance of documents in relation to a query. Compute the similarity that would be produced between Document 1 and the query burning tyger'' using this measure. \mypercent{15} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% \begin{math} cos(Document 1, \textrm{burning tyger''}) = \frac{1.0 + 1.1 + 0.0 + 0.0 + 0.0 + 0.0 + 2.1}{\sqrt{1^{2} + 1^{2} + 0^{2} + 0^{2} + 0^{2} + 0^{2} + 2^{2}}\sqrt{1^{2} + 1^{2}}} = \frac{3}{\sqrt{6}\sqrt{2}} = 0.866 \end{math} } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Explain what is meant by term weighting in Information Retrieval systems and why it is used. \mypercent{15} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Term weighting is the process of deciding on the importance of each term and is normally carried out by assigning a numerical score to each term. Term weighting is used in IR systems because not all terms are equally useful for retrieval; some occur in many of the documents in the collection (and are therefore bad discriminators) while others occur infrequently (and will return few documents). Term weighting aims to assign high scores to terms which are likely to discriminate between relevant and irrelevant documents. The term weights are taken into account by the ranking function with the aim of improving retrieval performance. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Explain how tf.idf term weighting is used to assign weights to terms in Information Retrieval. Include the formula for computing tf.idf. \mypercent{15} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% tf.idf assigns weights to terms by taking into account two elements: the frequency of that term in a particular document and the proportion of documents in the corpus in which it occurs. The tf part prefers terms which occur frequently in a document and the idf part gives extra weight to terms which do not occur in many documents in the collection. The tf.idf weight is produced by computing the product of these two terms. The formula for computing tf.idf is:$tf.idf = tf_{ik} \times log_{10}\left(\frac{N}{n_{k}}\right)$where$tf_{ik}$is the frequency of term$k$in document$i$, N is the total number of documents in the collection and$n_{k}$the number which contain the term$k$. } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \answersonly{\newpage} \begin{qupart} Show the weights which would be generated if tf.idf weighting was applied to Document 1 in the document collection of three documents shown above. \mypercent{10} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% Using tf.idf the weights in document 1 would become: \begin{center} \begin{tabular}{|cccccccc|} \hline & bright & burning & mat & on & sat & the & tyger\\ Document 1 & 0.176 & 0.477 & 0 & 0 & 0 & 0 & 0.352\\ \hline \end{tabular} \end{center} } %%%%%%%%%%%%% **** END ANSWER **** %%%%%% \begin{qupart} Consider the following ranking of ten documents produced by an Information Retrieval system. The symbol$\checkmark$indicates that a retrieved document is relevant and$\times$that it is not. \begin{center} \begin{tabular}{|c|c|c|} \hline Document & Ranking & Relevant\\ \hline d6 & 1 &$\checkmark$\\ d1 & 2 &$\times$\\ d2 & 3 &$\checkmark$\\ d10 & 4 &$\times$\\ d9 & 5 &$\checkmark$\\ d3 & 6 &$\checkmark$\\ d5 & 7 &$\times$\\ d4 & 8 &$\times$\\ d7 & 9 &$\checkmark$\\ d8 & 10 &$\times$\\ \hline \end{tabular} \end{center} \vspace{4eX} Compute the (uninterpolated) average precision and interpolated average precision for this ranked set of documents. Explain your working. \mypercent{25} \end{qupart} \answer{%%%%%% **** BEGIN ANSWER **** %%%%%% \begin{center} \begin{tabular}{|c|c|c|c|c|c|} \hline Document & Ranking & Relevant & Recall & Precision & Interpolated Precision\\ \hline d6 & 1&$\checkmark$& 0.2& 1 & 1\\ d1 & 2&$\times\$& 0.2& 0.5 & 0.67\\