Question_mt.tex 16.5 KB
 Loïc Barrault committed Mar 08, 2021 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 % Question on SMT \begin{question} \begin{qupart} Explain the differences between \textit{direct}, \textit{transfer-based} and \textit{interlingual} approaches to machine translation. Give the main advantage and disadvantage of each of these approaches. \mypercent{15} \begin{answer} The key difference between the three approaches is the level of analysis which is applied to the source text. [P1] (5\%) Direct approaches apply very little analysis to the the source text and rely on simple translation of each word in the source text. % Statistical MT could be considered to be a direct approach. Chief advantage is simplicty; chief weakness is inability to deal with different word ordering in different languages. [P2] (5\%) Transfer approaches attempt to analyse the structure of the source text to produce an intermediate representation. The intermediate representation of a sentence from the input text is then used in generating the translated sentence in the target language. Transfer approach can employ syntactic and/or semantic representations. Chief advantage is ability to deal with different word ordering in different languages; chief weakness is requirement for a parser and generator for each language pair. [P3] (5\%) Interlingual approaches rely on a representation of the meaning which is independent of both the source and target language. The source sentence goes through syntactic and semantic analysis to be translated into the interlingua. This representation is then used to generate a sentence in the target language. The difference between transfer approaches which use semantic representations and interlingua approaches rests on the independence of the system used to represent meaning; interlinguas are completely independent of the source and target languages while the representation used in semantic transfer simply aims to capture enough information to allow translation. Chief advantage is for each new language only one analyser and one generator are needed (rather than one per language pair as in transfer approaches); main disadvantage is that formulating an interlingua to represent meaning for all languages has proved impossible to date. \end{answer} \end{qupart} \begin{qupart} %Describe the two main models of a standard {\em phrase-based} approach to statistical machine translation. %Explain how these models are combined and how they are applied to generate a translation for a new segment. %\mypercent{30} % %\begin{answer} %\bigskip % %% The idea is to model $P(E|F)$, but this is difficult to be done directly. Instead, we break things apart to get good translations even if the probability numbers are not that accurate. %The two main models of a PBSMT system are $P(F|E)$ (translation model) and $P(E)$ (language model). % %[P1](7\%) $P(F|E)$: faithfulness model - will ensure that a translation $E$ will have words that generally translate to words in $F$. % %\medskip % %[P2](7\%) $P(E)$: fluency model - will ensure that $E$ reads well and is grammatical in the target language. % %\medskip % %% Other models include a reordering model, which accounts for the fact that the order of words or phrases in the target language might be different from that of the source language, a penalty on the number of words and/or phrases to bias translations that are closer in length to the source and that use fewer (but longer) phrases. % %[P3](8\%) These as well as the other models are computed as independent functions $h$ and combined using a log-linear model of the type $\sum_{i=1}^{m}\lambda_ih_i$, where $\lambda_i$ is the weight of each model. % %\medskip % %[P4](8\%) To translate a new source segment, a {\it decoder} applies this combined model to find the translation that covers all source words and maximises the weighted joint translation probability $\sum_{i=1}^{m}\lambda_ih_i$ %for that source segment. %% to segments the new source sentence in different possible ways and builds a search graph with possible translation %% options from the phrase dictionary. %The paths in this graph are weighted according to the model components and %\end{answer} \begin{exlist} \exitem What is the noisy channel model and how can it be applied to machine translation? \mypercent{15} \begin{answer} [P1] (7\%) The noisy channel model is a general model of communication developed by Claude Shannon. In the model a message is sent from a source via a noisy channel to a recipient. The noisy channel distorts the message to some degree, but the distorted signal arriving at the recipient is assumed to depend probabilistically on the source message. The challenge for the recipient is to decode the distorted signal and recover the original message by learning a model of the noise introduced by the channel. [P2] (8\%) The model is applied to translation as follows. Suppose we want to translate French to English. We assume the source message is in English but has been distorted by the transmission channel and arrives at the recipient in French. The recipient must decode the signal (French) and recover the orginal English source. \end{answer} \exitem State the fundamental probabalistic equation formalising the noisy channel model for machine translation and explain how it relates to that model. Show how the equation can be rewritten using Bayes Theorem and then simplified. Be sure to state in words what each of the terms in the equation is. \mypercent{15} \begin{answer} [P1] (5\%) The fundamental probabalistic equation formalising the noisy channel model for machine translation is: $E^\ast = \myargmax{E} P(E|F)$ [P2] (5\%) Here $E^\ast$ is the best estimate of a translation into English of the French sentence F. $E^\ast$ is that English sentence E that is the most probable English message (source) to have given rise to the observed French sentence F (signal) that is the output of the noisy channel. [P3] (5\%) The right hand side of this equation can be rewritten using Bayes Theorem as follows: $\begin{array}{rcll} E^\ast &=& \myargmax{E} P(E|F) \\ % & = & \myargmax{E} \frac{P(F|E) \cdot P(E) \cdot }{P(F)} & \end{array}$ This can then be simplified to $\begin{array}{rcll} E^\ast &=& \myargmax{E} P(F|E) \cdot P(E)& \end{array}$ since the term in the denominator $P(F)$ is the same for all $E$ and hence has no effect on which $E$ maximises the righthand side of the equation. \end{answer} %\exitem Show how the equation of of 2. b) (ii) can be rewritten using Bayes Theorem and then simplified. \mypercent{10} \exitem The simplified equation of 2(b)(ii) has three components that need to be implemented to build a working machine translation system. Name each of these components and describe briefly what its role in the translation system is. \mypercent{15} \begin{answer} The three components are: \begin{description} \item[P1(5\%)] $P(F|E)$: the translation model. $P(F|E)$ ensures that $E$ is a \textit{faithful translation} of $F$ since it is higher the more likely it is that $E$ is the orginal message that gave rise to the observed, distorted signal $F$. \item[P2 (5\%)] $P(E)$: the language model. $P(E)$ is the prior probability of a candidate translation $E$. It ensures that a f\textit{fluent} candidate $E$ is chosen, as it is larger the more probable $E$ is to occur as a string of English. \item[P3 (5\%)] $\myargmax{E}$: the decoder. Given translation and language models we additionally need an effective way to search the space of possible $E$'s to arrive at an $E$ that maximises the product of $P(F|E)$ and $P(E)$ . The decoder is the component that carries out this search. \end{description} \end{answer} \end{exlist} \end{qupart} \begin{qupart} %Explain the intuition behind the IBM Model 1 in the context of Statistical Machine Translation (SMT). %Give the most important outcome of this model for an SMT system. %Give one direction in which this model can be improved. %\mypercent{30} % %\begin{answer} %[P1](15\%) %IBM Model 1 is a model for word-alignment between bilingual segments, i.e. to identify likely correspondences between two languages at word level. %The model is trained in an unsupervised way based on a (large) training parallel corpus where sentences have been aligned, i.e., %a corpus of sentence pairs which are translation of each other. %The algorithm is a direct application of the Expectation Maximization (EM) algorithm: % %1) Initially consider all alternative word alignments as equally likely % %2) Observe across sentences that (e.g.) source word 'x' often links to target word 'y' % %3) Increase probability of this word pair aligning % %4) Iteratively redistribute probabilities, until probabilities stop changing (convergence) % %\medskip % %[P2](5\%) Once the corpus is aligned, a dictionary of source-target word pairs with translation probabilities can be extracted by simply taking the output of the final iteration. %This dictionary can then be used as the most basic component in SMT. % %\medskip % %[P2](10\%) %Possible directions (only one necessary): %1) add a 'fertility' component, i.e., words in the source language can generate (align too) multiple words in the target language %2) Add component to model absolute position of the words in the two sentences, i.e., words in the certain positions in the source segment tend to align to words in certain positions in the target segment, for example, to learn that the first word in the source segment may be more likely to align to the first or one of the first words in the target language as opposed to one of the last words in the target language %3) Add a component to model word order dependencies, i.e., the fact the alignment of one source word to a given target word may be dependent on the neighboring words in one of both languages %\end{answer} Explain in a general way how word alignments are learnt from a parallel corpus in IBM model 1. Full mathematical details are not necessary. \mypercent{20} \begin{answer} IBM Model 1 is a model for word-alignment between bilingual segments, i.e. to identify likely correspondences between two languages at word level. The model is trained in an unsupervised way based on a (large) training parallel corpus where sentences have been aligned, i.e., a corpus of sentence pairs which are translations of each other. The algorithm is a direct application of the Expectation Maximization (EM) algorithm: 1) Initially consider all alternative word alignments as equally likely 2) Observe across sentences that (e.g.) source word 'x' often links to target word 'y' 3) Increase probability of this word pair aligning 4) Iteratively redistribute probabilities, until probabilities stop changing (convergence) \medskip Once the corpus is aligned, a dictionary of source-target word pairs with translation probabilities can be extracted by simply taking the output of the final iteration. %This dictionary can then be used as the most basic component in SMT. \end{answer} \end{qupart} \begin{qupart} % Explain two alternative approaches to BLEU -- one based on human judgements and one based on automatic metrics -- that can be used for evaluating Machine % Translation systems, indicating their advantages and disadvantages. % \mypercent{20} % \begin{answer} % Various approaches are possible for both manual and automatic evaluation - some possible answers (only 2 are necessary). % 1) One way is to rely on human judges, % who compare system outputs to source sentences or manually-created reference translations, % and attempt to assess the rate of incorrect sentences produced by the % system by assessing criteria such as fluency and adequacy. % 2) Alternatively, subjects might read the system outputs and then % complete a comprehension test, to measure the effectiveness of the % translations in communicating the content of the source document. % These two methods have the advantage of being more reliable, since they involve humans. However, such human-based approaches are very costly of % time/money, and can only occasionally be repeated (if at all), % i.e. cannot frequently be redone to aid the system development % process. %In addition, human assessment of translation quality is a very subjective task, prone to disagreement. % 3) Another possibility is to use the MT systems being evaluated as a % component in some larger system, e.g. for cross-language IR or % QA. Whatever means is available for evaluating/benchmarking the % performance of that larger system can then be applied to determine % whether the system works better or worse with one MT system than % another, and hence, by implication, which of the MT systems is more % effective. This is, however, a rather indirect means of evaluating the % how good the MT system is at producing translations, so that its % results have unclear status, and the approach will also carry with it % whatever costs are associated with evaluation of the larger system. % 4) The back-translation or round-trip translation evaluation approach translates the original text, written in L1, into another language L2 and then back into L1. The quality of the translation is evaluated by checking how close the text produced is to the original text. % The advantages are that this approach is that it does not require knowledge of another language (L2) or significant resources (such as reference translations). The disadvantages are that two Machine Translation systems are involved (L1 to L2) and (L2 to L1). The approach is cheap but it cannot distinguish between them and cannot tell which MT system introduced any errors. In addition, one MT system may fix errors made by the other but this will not be detected. %In addition the basic premise behind this approach, that the text produced by round trip translation should be the same as the original text is flawed since human translators would not necessarily produce the same text. % Other metrics for automatic evaluation include Translation Error Rate (TER) and Meteor. % \end{answer} Explain briefly how the BLEU measure, which is used to automatically evaluate the quality of machine translated texts, is calculated. \mypercent{20} \begin{answer} BLEU assumes that the closer a machine translation is to a professional human translation, the better it is. The approach therefore compares a candidate translation to one or more human-created reference translations. %The human reference translations ar costly to produce, but endlessly reusable. Comparison approach: \begin{itemize} \item \textit{Similarity} between MT output and reference measured as: \begin{itemize} \stretchit{2} % \item count word n-grams MT and reference(s) have in common % \begin{itemize} % \item n-grams here are just sequences of words % \end{itemize} \item compute word n-gram overlap between candidate translation and reference translation for each n=1--4, and then combine these into a single score \item for n=1 the measure assesses word choice \item for n$>$1: the measure assesses both word choice and word order \item this gives an n-gram precision measure: \hspace*{20mm} \begin{tabular}{c} \it correct n-grams in candidate \\ \hline \it total n-grams in candidate \end{tabular} \end{itemize} \item Count clipping \begin{itemize} \stretchit{2} \item Counts are clipped not to let any n-gram count more than the maximum number of times it appears in reference translations. Otherwise a candidate translation that consisted of multiple occurrences of one correctly translated word could obtain a perfect score. \item Scores are computed over a {\it multi-sentence test set} \smallskip \begin{itemize} \stretchit{2} \item system produces a set of candidate translations ($\{Cand\}$) \item then, compute corpus precision for n-grams of length $n$ as: $p_n = \frac{\sum_{C\in\{Cand\}}\sum_{ngram\in C} Count_{clip}(ngram)}{\sum_{C'\in\{Cand\}}\sum_{ngram'\in C'} Count(ngram')} \rule{20mm}{0mm}$ \end{itemize} \end{itemize} \item Precision scores for different n-gram lengths 1--4 combined as a geometric mean, i.e.: \vspace*{-5mm} $(p_1.p_2.p_3.p_4)^\frac{1}{4} \rule{10mm}{0mm}$ % harsh: if any $p_n$ = 0, average = 0 \item BLEU also contains a brevity penalty since otherwise sentences'' that correctly translate a few words but are very much shorter than the required length can score highly (since the measure above is a precision measure only). The brevity penalty is: if len(C)$\geq$len(R), then $BP=1$, else $BP < 1$, where $C$ is the candidate translation and $R$ is the reference translation. The final BLEU score is: ${\it BLEU} = BP \times (p_1.p_2.p_3.p_4)^\frac{1}{4}$ \end{itemize} % \item reference to key paper at end of these slides \end{answer} \end{qupart} \end{question} `