\documentclass[12pt]{book} \usepackage{tuos_exam} \usepackage{amsfonts} \usepackage{amsmath} %\showanswers % *** TOGGLE IN/OUT TO SHOW/HIDE ANSWERS *** \pressmark{COM3110/COM6150} %\pressmark{COM3110} %\pressmark{COM6150} \department{{\bf DEPARTMENT OF COMPUTER SCIENCE}} \examtitle{{\bf TEXT PROCESSING}} \examdate{{\bf Autumn Semester 2008-09}} \examtime{{\bf 2 hours}} % \dataProvided{Deduction Rules} \setter{Mark Hepple} \rubric{{\bf Answer THREE questions. All questions carry equal weight. Figures in square brackets indicate the percentage of available marks allocated to each part of a question.}} %\doNotRemove %% MODIFIES FRONT PAGE TO "DO NOT REMOVE FROM HALL" FORMAT %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\stretchit}[1]{\addtolength{\itemsep}{#1mm}} \newcommand{\argmax}{\operatornamewithlimits{argmax}} \newcommand{\myargmax}[1]{\begin{array}[t]{c}\argmax\\{^#1}\end{array}} \newenvironment{myfbox}[2]{% \setlength{\unitlength}{1mm}% \begin{picture}(0,0)(0,-4) \put(0,-#2){\framebox(#1,#2){}} \end{picture}~~\begin{minipage}[t]{#1mm}}% {\end{minipage}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \begin{exam} \begin{question} \begin{qupart} What is {\it stemming\/}? Briefly indicate the different kinds of approach taken to this task. Suggest a text processing context where stemming might be used. \mypercent{20} \begin{answer} {\it Stemming\/} refers to the process of reducing words that are morphological variants to their common root or stem, e.g. so that variants {\it computer, computes, computed, computing}, etc, are reduced to the stem {\it compute\/} (or some surrogate for the real root, such as {\it comput\/}). This is a form of {\it term conflation}. Approaches to this task range from linguistically-based lemmatizers to fast/simple suffix-stripping methods. Lemmatizers embody significant linguistic knowledge, and reduce words to their linguistically-correct {\it root\/} form, using lexical look-up to deal with irregular forms (e.g. {\it goose/geese\/}), and otherwise using morphological rules to undo' processes of morphological derivation. Simple stemmers encode little or no linguistic knowledge, do not use lexical look-up (allowing them to be much faster), and use simple string-matching rules to strip (or sometimes substitute) suffixes. Their output forms are often not real roots. A well-known example of a simple stemmer is the {\it Porter stemmer}. One text processing context where stemming may be used is IR, where its use can allow that a query containing one morphological variant of a word can retrieve documents not containing that term, but rather other variants of the same stem (e.g. a query with {\it computing\/} retrieving a document containing {\it computer\/}). The usefulness of this move, however, is debated. Stemming will also produce some reduction in the size of document indexes. Other cases where stemming might be used include many applications where document similarity is computed, e.g. document clustering, or where similarity is computed at other levels, e.g. for sentences, in a sentence-extraction approach to summarisation. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} What is {\it part-of-speech-tagging\/}? Explain why this task is difficult, i.e. why it requires something more than just simple dictionary look-up. \mypercent{20} \begin{answer} This is the task of assiging to each word (or more generally token) in texts, a tag or label for the word's part of speech class, such as noun, verb or adjective. These classes group together words that exhibit similar distributional behaviour, i.e. which play similar roles in the syntax of the language. The task is non-trivial due to (i)~ambiguity, i.e. many words have more than one POS, and the correct one must be selected (on the basis of local context), and because (ii) it is common to encounter unknown words (i.e. ones not present in the dictionary, even if this has been compiled from a large amount of text), for which a POS tag must be guessed' (using context and morphological information). \end{answer} \end{qupart} \begin{qupart} What is the noisy channel model? Give a diagram of the model as part of your answer. Suggest a text processing context where the noisy channel model has been used. \mypercent{20} \begin{answer} Using information theory, Shannon modeled the goal of communicating down a telephone line -- or in general across any channel -- in the following way. Imagine communication along a low-quality telephone line where the sender creates a message which is being transmitted over a channel with limited capacity; the receiver has to guess what the original message was. It is assumed that the output of the channel depends probabilistically on the input. The goal is to encode the message in such a way that it occupies minimal space while still containing enough redundancy to be able to detect and correct errors. On receipt, the message is decoded to give what was most likely the original message. $\stackrel{W}{\longrightarrow}\fbox{Encoder} \stackrel{X}{\longrightarrow}\fbox{\parbox{1in}{Noisy Channel \center{p(y|x)}}} \stackrel{Y}{\longrightarrow}\fbox{Decoder} \stackrel{\hat{W}}{\longrightarrow}$ A simplified version of the noisy channel model has been used to model the machine translation process. As an example, suppose we want to translate a text from English to French. The channel model for translation assumes that the true text is French, but that unfortunately when it was transmitted to us, it went through a noisy communication channel and came out as English. All we need to do in order to translate is to recover the original French, i.e., to decode the English to get the French. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Write a short program, in python or perl, which reads text from a file and prints out any lines of the file that contain, within a single line, a URL (web address') expression. Assume the following requirements for identifying a URL string: \begin{exlist} \item[i.] the string should fall between double quotes ({\tt "}) \vspace*{-2mm} \item[ii.] it should start with {\tt http://} \vspace*{-2mm} \item[iii.] it should end with {\tt.htm} or {\tt.html} \vspace*{-2mm} \item[iv.] between this start and end, there may be any sequence of characters except that {\tt "}, {\tt <} and {\tt >} may not appear. \end{exlist} \vspace*{-5mm} \mypercent{20} \begin{answer} A solution in Python: \medskip \hspace*{20mm} \begin{myfbox}{100}{35} \begin{small} \begin{verbatim} import sys, re f = open(sys.argv[1],'r') urlRE = re.compile(r'"http://[^"<>]+.html?"') for line in f: if urlRE.search(line): print line, f.close() \end{verbatim} \end{small} \end{myfbox} \end{answer} \end{qupart} \begin{qupart} Write a short program, in python or perl, that reads a file containing an unsorted list of words plus count values (one word and count value per line, separated by whitespace). The program should then print out the words in {\it descending order\/} of their associated count values. Partial credit (up to 15\%) will be given for an answer that instead prints the words in {\it alphabetically sorted order}. \mypercent{20} \begin{answer} A solution in Python: \medskip \hspace*{20mm} \begin{myfbox}{100}{59} \begin{small} \begin{verbatim} import sys f = open(sys.argv[1],'r') counts={} for line in f: toks = line.split() counts[toks[0]]=int(toks[1]) sortwords = counts.keys() sortwords.sort(lambda x,y:counts[y]-counts[x]) for w in sortwords: print w, counts[w] \end{verbatim} \end{small} \end{myfbox} \end{answer} \end{qupart} \end{question} \smallskip \begin{question} \begin{qupart} What is summarisation? Does summarisation always involve text? Identify some of the ways in which different types of summary have been classified. Discuss some of the characteristics of a good textual summary? \mypercent{25} \begin{answer} In its most general sense, summarisation is the art of abstracting key content from one or more information sources. This has most commonly been done for textual sources (e.g. newpaper articles, journal papers), but this is not necessarily so. In fact, either, or both, of the input to/output from summarisation might be non-textual. For example, the story told by a movie might be summarised in a critical review (video$>$text), A news report on troop movements might be summarised by a short animation (text$>$graphic), or a full-length video of a football match might be summarised by a short video of edited highlights (video$>$video). Standard ways in which summaries have been distinguished are as follows. Answers should explain the terms used. \begin{enumerate} \item abstract vs.\ extract \item indicative vs.\ informative vs.\ evaluative (critical) summaries \item single vs.\ multi-document summaries \item generic vs.\ query-based vs. user-profile-based summaries \item summary for expert vs.\ novice users \end{enumerate} Properties of a good text summary include the following. It should be considerably shorter than the input text, whilst covering the main points of the original (or, for non-generic summaries, cover the main points relevant to the user's needs, etc). It should be {\it truth-preserving}, i.e. should not carry false implications. It should be a good text in its own right, i.e. read well/be coherent. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Explain what is meant by {\it sentence selection\/} as an approach to summarisation. Discuss some of the common problems for summaries produced in this way. \mypercent{15} \begin{answer} Sentence selection is a shallow approach to text summarisation (i.e. not based on deep analysis of meaning) which is the most common version of {\it extraction\/}-based summarisation, i.e. where summaries consist of fragments of text taken from the original document, these fragments specifically being sentences. The approach typically involves using shallow criteria to score sentences, with the top-scoring sentences being selected for inclusion in the summary, in accordance with some compression requirement. The selected sentences are then concatenated, perhaps in the order in which they appear in the original document, or in their rank order, to produce the final summary. Summaries produced in this way commonly exhibit a number of problems, most notably in terms of {\it cohesion\/} (whether text hangs together), and {\it coherence\/} (whether text makes sense). Cohesion is often signalled by linguistic devices such as co-reference, lexical repetition, etc, which serve to {\it link\/} adjacent/nearby sentences together. A concatenation of extracted sentences will fail to exhibit such linking, and, furthermore, individual extracted sentences will contain uses of the above linguistic devices that served to link them into their original context but for which those links are broken' in the summary, e.g. anaphors whose antecedent is absent. A coherent text is rhetorically well-structured, with the sequencing of sentences realising rhetorical relations such as elaboration, exemplification, etc. The process of extracting/concatenating sentences in general makes no reference to the rhetorical structure of the original text, and cannot be expected to produce a result that is itself rhetorically well-structured, and so extracted summaries often seem to be highly disjointed. In the worst case, the juxtaposition of extracted sentences in a summary may carry false implications, e.g. a summary comprising the first and third, but not the second, of the following sentences will carry the false implication that I think Tom is a fool: \begin{quote} [Full text]~~ {\it I met Tom in town. We ran into Bill. I think he's a fool!} [Summary]~ {\it I met Tom in town. I think he's a fool!} \end{quote} \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Discuss the criteria that have been used by {\it shallow\/} approaches to automated summarisation for identifying text segments containing significant information in a document. \mypercent{20} \begin{answer} Methods used include \begin{itemize} \item {\bf keywords}: \begin{itemize} \item assume frequent content words indicate topic of document \item identify set of keywords by (e.g.) TF.IDF criterion \item score sentences w.r.t. occurrence of keywords within them \end{itemize} \item {\bf title method}: \begin{itemize} \item assume title/subheadings indicative of important content \item use words from them to identify useful sentences \end{itemize} \item {\bf position}: \begin{itemize} \item use position in document as indicator of importance of text segment \item notable positions include: (i) being close to start of text, (ii) in certain sections (e.g. intro/conclusion), (iii) first sentence of paragraph, etc. \end{itemize} \item {\bf cue method}: \begin{itemize} \item certain phrases may signal important content (e.g. this paper presents \dots'') \item may distinguish both positive and negative indicators, as bonus' and stigma' words/phrases \end{itemize} \end{itemize} \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Consider a sentence extraction-based approach to summarisation which uses more than one of the criteria you have listed in part (d) as features in the process of sentence selection. Discuss alternative ways of combining such features in the selection process. \mypercent{20} \begin{answer} We have seen two approaches to combining multiple criteria in the sentence selection process. Edmundson (1969) used a linear combination method, with weights to vary the contributions of different features. This approach requires that the features must each individually produce a {\it numeric score\/} for each sentence, and these scores are then combined using weights ($\alpha,\beta,\ldots$), to give a combined score, e.g. as in: \hspace*{10mm} {\it Score(S) = $\alpha$.Title(S) $+$ $\beta$.Cue(S) $+$ $\gamma$.Keyword(S) + $\delta$.Position(S)} These combined score values can be used to rank sentences, with the top-ranked sentences being selected (according to some compression requirement). The weight values in the equation can be determined using training data and a suitable minimization technique. \medskip Kupiec et al (1995) combine the different criteria using a modified Naive Bayesian approach. A number of features are computed for each sentence, all of which are discrete valued, e.g. a binary sentence length feature (true if sentence longer than a certain threshold value), a key phrase feature (true if any of a list of phrases appears in sentence), etc. The system determines the probability that a sentence $s$ should be included in the summary $S$ (i.e. $s\in S$), given the features $F_{1}, \ldots,F_{k}$. Using the Bayes Rule and the Naive Bayes assumption of conditional independence, this probability can be expressed as: $P(s \in S | F_{1}, \ldots,F_{k}) = \frac{P(s \in S)\prod_{j=1}^{k}{P(F_{j}| s \in S)}}{\prod_{j=1}^{k}{P(F_{j})}}$ for which $P(s \in S)$ is constant, and probabilities $P(F_j |s \in S)$ and $P(F_j)$ can be estimated from training data by counting occurrences. This conditional probability is then used as a score to rank sentences, and the top-ranked sentences are selected (to meet some compression requirement). \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Discuss alternative approaches to evaluating the performance of automatic summarisation systems. \mypercent{20} \begin{answer} Evaluation may be intrinsic (i.e. address the quality of the summary itself) or extrinsic (assess impact of subjects using the summary on their performance at some task). Tasks used in extrinsic evaluations include the subject's ability to answer questions about topic of original text, or to assign a topic category to the original text, when given the automatic summary, as opposed to using the original text or a human summary. Intrinsic evaluation may be done by human judgement (e.g. of whether key topics of original text are present, whether summary is coherent) or be done automatically. Human judgements are subjective, and often vary between judges, and the approach is time-consuming (and therefore costly). Automatic evaluation requires the creation of gold standard materials, i.e. collections of documents with associated reference summaries (which are costly to create). It primarily addresses whether the content of reference summaries is represented in automated summaries, not whether the latter are coherent. For sentence extraction approaches, evaluation against extraction-based reference summaries may be done using measures such as precision and recall. This is not possible for abstraction approaches. Alternative approaches measure textual overlap between automatic and reference summaries, as an indicator of conceptual overlap. A key approach is the ROUGE system (Recall Oriented Understudy for Gisting Evaluation), which measures (a version of) word n-gram overlap between automatic and reference summaries. \end{answer} \end{qupart} \end{question} \newpage \begin{question} \begin{qupart} Explain what an {\it inverted index\/} is, and describe the information that such indices typically encode. Why are inverted indices important in information retrieval systems? \mypercent{15} \begin{answer} An inverted index is a data structure that lists for each word in a document collection all documents that contain it and (usually) also the frequency of its occurrence in each document. A more sophisticated version of an inverted index might also record position information for occurrences (encoded as a byte offset relative to the beginning of the document), to allow for efficient retrieval of phrases. An inverted index makes it easy to search for hits' of a query word, i.e. one just goes to the part of the inverted index that corresponds to the query word to get a list of document identifiers for documents containing the word. This move is crucial to efficient implementations of retrieval systems for various different models. \end{answer} \end{qupart} \begin{qupart} In the context of an information retrieval system that uses term frequencies in the retrieval process, show how the following two documents would be stored in an inverted index. Assume an approach which {\it does not\/} use stemming, but which {\it does\/} use a {\it stoplist\/} that includes the following terms: {\tt \{at, can, my, the, you\}}. \smallskip \begin{quote}\small {\bf Document 1}:~ Sea shells, buy my sea shells! \medskip {\bf Document 2}:~ You can buy lovely sea shells at the sea produce market. \end{quote} \vspace*{-5mm} \mypercent{20} \begin{answer} Given the specified stoplist, and assuming that there are no other documents (so that only terms in these document need be considered) we get the following inverted index: \begin{table}[h!] \begin{center} \begin{tabular}{|c|l|l|}\hline \it term-id & word & docs\\ \hline 1 & buy & d1:1, d2:1 \\ 2 & lovely & d2:1 \\ 3 & market & d2:1 \\ 4 & produce & d2:1 \\ 5 & sea & d1:2, d2:2 \\ 6 & shells & d1:2, d2:1 \\ \hline \end{tabular} \end{center} \end{table} Here, entries record frequency counts, e.g. the entry for {\it shells\/} tells us that the term appears twice in Document 1 (d1:2) and once in Document 2 (d2:1). \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} What is the similarity measure used by the {\it vector space model\/} of information retrieval? Compute the similarity score between the following query and the two documents given in part (b), to determine which (if either) of them would be ranked more highly as relevant to the query. Assume that term frequency values are used for the term weights in document vectors. \smallskip \begin{quote}\small {\bf Query}:~ buy lovely sea shells \end{quote} \vspace*{-5mm} \mypercent{25} \begin{answer} The vector space model treats documents, and queries, as vectors of term weights, and uses the cosine of the angle between two vectors as the measure of their similarity. The cosine between two vectors $\vec{x}$ and $\vec{y}$ is computed as: $cos(\vec{x},\vec{y}\,) = \frac{\vec{x} \cdot \vec{y}}{|\vec{x}| |\vec{y}|} = \frac{\sum^{n}_{i=1}x_iy_i}{\sqrt{\sum^{n}_{i=1}x_i^2} \sqrt{\sum^{n}_{i=1}y_i^2}}$ \medskip Using the term order taken from the inverted index above, we can represent the two documents and the query as vectors as follows: \medskip \begin{tabular}{rrl} ~~~ & d1: & $\left<\, 1,0,0,0,2,2 \,\right>$ \\ & d2: & $\left<\, 1,1,1,1,2,1 \,\right>$ \\ & q: & $\left<\, 1,1,0,0,1,1 \,\right>$ \\ \end{tabular} \medskip The vector magnitudes and cosine values are then: $|{\rm d1}| = \sqrt{1^2+0^2+0^2+0^2+2^2+2^2} = \sqrt{9} = 3$ $|{\rm d2}| = \sqrt{1^2+1^2+1^2+1^2+2^2+1^2} = \sqrt{9} = 3$ $|{\rm q}| = \sqrt{1^2+1^2+0^2+0^2+1^2+1^2} = \sqrt{4} = 2$ $cos({\rm d1,q}) = \frac{1.1+0.1+0.0+0.0+2.1+2.1}{3.2} = 5/6$ $cos({\rm d2,q}) = \frac{1.1+1.1+1.0+1.0+2.1+1.1}{3.2} = 5/6$ Thus, the cosine values computed for the two documents are the same, and so the two are equally rated as relevant to the query. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Describe and explain the TF.IDF term weighting scheme. Include the formula (or formulae) for computing TF.IDF values as part of your answer. \mypercent{20} \begin{answer} TF.IDF assigns weights to terms by taking into account two elements: the frequency of that term in a particular document and the proportion of documents in the corpus in which it occurs. The TF part prefers terms which occur frequently in a document and the IDF part gives extra weight to terms which do not occur in many documents in the collection. The idf value for a term $w$ is computed as follows, where $D$ is the set of documents in the document collection and ${\it df}_w$ is the document frequency of $w$ (the count of documents in which $w$ appears): $idf_{w,D} = log \frac{|D|}{df_{w}}$ The TF.IDF value for a given term ($w$) in a given document ($d$) is the product of the term's frequency in $d$ (${\it tf}_w,d$) and its IDF value, i.e.: ${\it tf.idf}_{w,d,D} = {\it tf}_{w,d} \cdot idf_{w,D}$ \end{answer} \end{qupart} \begin{qupart} We want to calculate the similarity of two documents using the following measure (know as the {\it Jacquard coefficient\/}): the proportion of terms appearing in {\it either\/} document that appear in {\it both\/} documents. (Note that this measure makes no use of the {\it counts\/} of terms in the document files, only whether they are present or absent.) \vspace*{2mm} Assume that we have code that identifies the terms that appear in a file, and stores this information in a dictionary (python) or hash (perl) data structure, with the terms being the keys (where the associated values are not very important, but might be either the counts of the terms in the file, or just the value 1). Write some code, in either python or perl, that takes the term information computed for {\it two\/} different files, and uses it to compute their similarity score. \mypercent{20} \begin{answer} A solution in python, in the form of a function which would be supplied with the two dictionaries of counts as arguments. \bigskip \hspace*{10mm} \begin{myfbox}{98}{35} \begin{small} \begin{verbatim} def compute_jacquard(a,b): overlap=0.0 for term in a: if term in b: overlap=overlap+1 total=len(a.keys())+len(b.keys())-overlap return overlap/total \end{verbatim} \end{small} \end{myfbox} \end{answer} \end{qupart} \end{question} \newpage \begin{question} \begin{qupart} Explain what is meant by an {\it interlingual\/} approach to Machine Translation. What is the key advantage of using an interlingual approach for translation amongst multiple languages, as compared to alternative approaches? Identify at least one problem that arises for the creation of a genuinely interlingual representation. \mypercent{20} \begin{answer} Interlingual MT approaches use an intermediate representation in the translation process that is an {\it interlingua}, which is a meaning representation language whose characteristics are independent of both the source and target languages of translation. A source language sentence goes through syntactic and semantic analysis to be translated into the interlingua. This representation is then used to generate a sentence in the target language. The fact that the interlingua is independent of the specific source/target languages of translation means that the same representation can be used for translation between {\it any\/} pair of languages. This feature allows that we can achieve full translation coverage amongst $N$ different languages by providing only $N$ analysis/generation components, i.e. a component to analyse sentences of each language into the interlingua and one to generate sentences of the language from the interlingua. Then we can translate from any one of the $N$ languages to any other by chaining the appropriate analysis/generation components, mapping content via the interlingua. Other MT approaches require that a specific translation system be provided for each language pair {\it and direction}, giving rise to the $O(N^2)$ problem, i.e. that to achieve translation amongst $N$ languages, we need $N^2-N$ translation systems, as there are $(N^2-N)/2$ language pairs (and twice as many pairs/directions). A key obstacle to formulating a genuine interlingua, i.e. one capable of representing meaning for all languages, is that the lexical concepts of different languages carve up conceptual space in different ways. E.g. German has two words for {\it wall}, one for {\it internal\/} walls, the other for {\it external\/} walls. Japanese has different words for {\it elder\/} vs.\ {\it younger\/} brother. Spanish has different words for {\it human\/} vs.\ {\it other\/} legs. Thus, different languages offer competing choices for the basic conceptual units, and the best choice is not obvious. Furthermore, the conceptual inconsistencies between languages suggests the need for complex semantic processing for decomposing/reassembling conceptual units \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Explain what is meant by {\it direct\/} and {\it transfer\/} approaches to Machine Translation, identifying the differences between them and an interlingual approach. \mypercent{20} \begin{answer} Different MT approaches can be distinguished in terms of the level of analysis which is applied to the source text. Interlingual approaches perform a {\it deep\/} syntactic/semantic analysis of the text to create the interlingual meaning representation. In contrast, direct approaches apply very little analysis to the source text and rely on doing fairly simple translation of each word in the source text (which may be dependent on the context in which the word appears). Furthermore, direct MT makes no use of intermediate representations --- distinguishing it both from transfer and interlingual approaches. Although classic direct MT approaches differ considerably from modern statistical MT methods, the latter can be viewed as a version of a direct MT approach. Transfer approaches attempt to analyse the structure of source language sentences to produce an {\it intermediate representation\/} (IR). A {\it transfer\/} process then applies to convert this source-language IR to a target-language IR --- which is the key translation' step. The target-language IR is then used to generate target-language sentences. These IRs may be of either a syntactic, semantic or {\it hybrid\/} character, and they may have characteristics that are specific to the task of achieving translation between a particular language pair. The transfer process is performed by a set of transfer rules that are specific to the given source/target language translation task. These features of a transfer approach distinguish it from an interlingual approach, where there is only a single language-independent IR, and no need for a transfer step. \end{answer} \end{qupart} \begin{qupart} When applied to translating from French to English, the IBM approach to statistical machine translation might be expressed by the following equation: $\begin{array}{rcll} E^\ast &=& \myargmax{E} P(E) \cdot P(F|E) & ~ \end{array}$ Explain what this equation means, and indicate the role played by the components $P(E)$ and $P(F|E)$ in the process of translation. \mypercent{20} \begin{answer} The equation tells us that the optimal English translation $E^\ast$ of a French sentence $F$ is found by identifying the English sentence $E$ that maximises the equation, which has two components, that serve different purposes, as follows. The (monolingual) Language Model (LM) $P(E)$ favours candidate $E$s that are good strings of English, i.e. this component provides for the {\it fluency\/} of translations. The Translation Model (TM) $P(F|E)$ tests if candidate $E$s are likely translations of $F$, i.e. it provides for {\it faithfulness\/} in translation. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Show how the equation given in part (c) is derived using the Bayes Rule. What is the benefit of this approach as compared to one attempting to use the probability $P(E|F)$ directly? \mypercent{20} \begin{answer} The starting point is the idea that we want to find the English sentence $E$ that is most probable {\it given\/} the $F$ we want to translate, which is expressed by the following equation: $\begin{array}{rcl} E^\ast &=& \myargmax{E} P(E|F) \end{array}$ Using Bayes Rule, we can restate this as: $\begin{array}{rcll} E^\ast &=& \myargmax{E} \frac{P(E) \cdot P(F|E)}{P(F)} \end{array}$ Here, $P(F)$ is constant across different $E$, so it doesn't affect the maximisation. hence, it can be dropped to give: $\begin{array}{rcll} E^\ast &=& \myargmax{E} P(E) \cdot P(F|E) \end{array}$ It would be very difficult to model $P(E|F)$ directly, due to data sparseness problems, i.e. it would difficult to get good probability estimates from even a v.large amount of parallel text. The decomposition to $P(E) \cdot P(F|E)$ allows us to be sloppy, achieving good translations, but with probabilities that are easier to acquire. The TM $P(F|E)$ can be used to provide a collection of English {\it words\/} that are good candidates for translating $F$, which is much easier than determining complete ordered English sentences for translating $F$. The LM $P(E)$ then deals with finding an ordering of candidate words that constitutes good English. The TM and LM can be trained independently. The LM requires a large amount of monolingual text for training. Training the TM requires an amount of sentence-aligned parallel text (a bilingual corpus) that is large, but not infeasibly so. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Discuss alternative approaches that may be used for evaluating Machine Translation systems, indicating their advantages and disadvantages. Describe in some detail the approach used by the BLEU scheme for {\it automatic\/} evaluation. \mypercent{20} \begin{answer} Various approaches are possible. One way is to rely on human judges, who compare system outputs to manually-created reference translations, and attempt to assess the rate of incorrect sentences produced by the system, producing a `subjective sentence error rate' (SSER) score. Alternatively, subjects might read the system outputs and then complete a comprehension test, to measure the effectiveness of the translations in communicating the content of the source document. These methods have the advantage of involving direct human evaluation of translations, or direct human use of them, which seems most convincing as a means of evaluating their quality as translations. However, such human-based approaches are very costly of time/money, and can only occasionally be repeated (if at all), i.e. cannot frequently be redone to aid the system development process. Another possibility is to use the MT systems being evaluated as a component in some larger system, e.g. for cross-language IR or QA. Whatever means is available for evaluating/benchmarking the performance of that larger system can then be applied to determine whether the system works better or worse with one MT system than another, and hence, by implication, which of the MT systems is more effective. This is, however, a rather indirect means of evaluating the how good the MT system is at producing translations, so that its results have unclear status, and the approach will also carry with it whatever costs are associated with evaluation of the larger system. A further possibility is automatic methods of evaluation, using benchmarking materials and suitable metrics. The predominant approach of this kind is BLEU: Bilingual Evaluation Understudy. The BLEU system relies on multiple reference translations, each of which represents a possible way in which a text could be translated into a target language. The translation being evaluated (the candidate) is compared against the reference translations by counting the number of possible ngrams (strings of words) which occur in them. BLEU assumes that there are several possible ways to translate a text, each of which are equally valid, and uses multiple reference translations to provide these alternatives. A key question is whether such an approach is genuinely effective at evaluating the quality of translations, but there is some evidence to show that the ranking of MT systems produced using BLEU correlates quite well with that produced via a human evaluation approach. The key disadvantage of this approach is the cost of creating the benchmarking materials, which is considerable, whilst its key advantage is that these materials, once created, can be endlessly re-used to evaluate different systems, and to aid the development of individual systems. \end{answer} \end{qupart} \end{question} \questionsend \end{exam} \end{document}