Question_mt.tex 16.5 KB
Newer Older
Loïc Barrault's avatar
Loïc Barrault committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
% Question on SMT
\begin{question}



\begin{qupart}
Explain the differences between \textit{direct}, \textit{transfer-based} and \textit{interlingual} approaches to machine translation.  Give the main advantage and disadvantage of each of these approaches.
\mypercent{15}

\begin{answer}

The key difference between the three approaches is the level of
analysis which is applied to the source text. 

[P1] (5\%) Direct approaches apply very little analysis to the the source text
and rely on simple translation of each word in the source
text. % Statistical MT could be considered to be a direct approach.
Chief advantage is simplicty; chief weakness is inability to deal with different word ordering in different languages.

[P2] (5\%) Transfer approaches attempt to analyse the structure of the source
text to produce an intermediate representation. The intermediate
representation of a sentence from the input text is then used in
generating the translated sentence in the target language.  Transfer
approach can employ syntactic and/or semantic representations.
Chief advantage is ability to deal with different word ordering in different languages; chief weakness is requirement for a parser and generator for each language pair.

[P3] (5\%) Interlingual approaches rely on a representation of the meaning which
is independent of both the source and target language. The source
sentence goes through syntactic and semantic analysis to be translated
into the interlingua. This representation is then used to generate a
sentence in the target language. The difference between transfer
approaches which use semantic representations and interlingua
approaches rests on the independence of the system used to represent
meaning; interlinguas are completely independent of the source and
target languages while the representation used in semantic transfer
simply aims to capture enough information to allow translation.
Chief advantage is for each new language only one analyser and one generator are needed (rather than one per language pair as in transfer approaches); main disadvantage is that 
formulating an interlingua to represent meaning for all languages has proved impossible to date.
\end{answer}
\end{qupart}


\begin{qupart}
%Describe the two main models of a standard {\em phrase-based} approach to statistical machine translation. 
%Explain how these models are combined and how they are applied to generate a translation for a new segment. 
%\mypercent{30}
%
%\begin{answer}
%\bigskip
%
%% The idea is to model $P(E|F)$, but this is difficult to be done directly. Instead, we break things apart to get good translations even if the probability numbers are not that accurate. 
%The two main models of a PBSMT system are $P(F|E)$ (translation model) and $P(E)$ (language model). 
%
%[P1](7\%) $P(F|E)$: faithfulness model - will ensure that a translation $E$ will have words that generally translate to words in $F$. 
%
%\medskip
%
%[P2](7\%) $P(E)$: fluency model - will ensure that $E$ reads well and is grammatical in the target language. 
%
%\medskip
%
%% Other models include a reordering model, which accounts for the fact that the order of words or phrases in the target language might be different from that of the source language, a penalty on the number of words and/or phrases to bias translations that are closer in length to the source and that use fewer (but longer) phrases. 
%
%[P3](8\%) These as well as the other models are computed as independent functions $h$ and combined using a log-linear model of the type $\sum_{i=1}^{m}\lambda_ih_i$, where $\lambda_i$ is the weight of each model. 
%
%\medskip
%
%[P4](8\%) To translate a new source segment, a {\it decoder} applies this combined model to find the translation that covers all source words and maximises the weighted joint translation probability $\sum_{i=1}^{m}\lambda_ih_i$ 
%for that source segment.
%% to segments the new source sentence in different possible ways and builds a search graph with possible translation 
%% options from the phrase dictionary. %The paths in this graph are weighted according to the model components and 
%\end{answer}

\begin{exlist}
\exitem What is the noisy channel model and how can it be applied to machine translation? \mypercent{15}

\begin{answer}
[P1] (7\%) The noisy channel model is a general  model of communication developed by Claude Shannon. In the model a message is sent from a source via a noisy channel to a recipient. The noisy channel distorts the message to some degree, but the distorted signal  arriving at the recipient is assumed to depend probabilistically on the source message. The challenge for the recipient is to decode the distorted signal and recover the original message by learning a model of the noise introduced by the channel.

[P2] (8\%) The model is applied to translation as follows. Suppose we want to translate French to English. We assume the source message is in English but has been distorted by the transmission channel and arrives at the recipient in French. The recipient  must decode the signal (French) and recover the orginal English source.
\end{answer}

\exitem State the fundamental probabalistic equation formalising the noisy channel model for machine translation and explain how it relates to that model. Show how the  equation can be rewritten using Bayes Theorem and then simplified. Be sure to state in words what each of  the terms in the equation is. \mypercent{15}

\begin{answer}

[P1] (5\%) The fundamental probabalistic equation formalising the noisy channel model for machine translation is:

\[ E^\ast = \myargmax{E} P(E|F) \]

[P2] (5\%) Here  $E^\ast $ is the best estimate of a translation into English of the French sentence F.  $E^\ast $  is that English sentence E that is the most probable English message (source) to have given rise to the observed French sentence F (signal) that is the output of the noisy channel.

[P3] (5\%) The right hand side of this equation can be rewritten using Bayes Theorem as follows: 

\[
\begin{array}{rcll}
E^\ast &=& \myargmax{E} P(E|F) \\ %
&  = & \myargmax{E} \frac{P(F|E) \cdot P(E) \cdot }{P(F)} &
\end{array}
\]

This can then be simplified to 
\[ 
\begin{array}{rcll}
E^\ast &=&  \myargmax{E} P(F|E)  \cdot P(E)&
\end{array}
\]
since  the term in the denominator $P(F)$ is the same for all $E$ and hence has no effect on which $E$ maximises the righthand side of the equation.

\end{answer}

%\exitem Show how the equation of of 2. b) (ii) can be rewritten using Bayes Theorem and then simplified. \mypercent{10}

\exitem  The simplified equation of 2(b)(ii) has three components that need to be implemented to build a working machine translation system. Name each of these components and describe briefly what its role in the translation system is. \mypercent{15}

\begin{answer}
The three components are:
\begin{description}
\item[P1(5\%)] $P(F|E)$: the translation model. $P(F|E)$ ensures that  $E$ is a \textit{faithful translation} of $F$ since it is higher the more likely it is that $E$ is the orginal message that gave rise to the observed, distorted signal $F$. 

\item[P2 (5\%)]  $P(E)$: the language model. $P(E)$ is the prior probability of a candidate translation $E$. It ensures that a f\textit{fluent} candidate $E$ is chosen, as it is larger the more probable $E$ is to occur as a string of English. 

\item[P3 (5\%)]  $\myargmax{E} $: the decoder. Given translation and language models we additionally need an effective way to search the space of possible $E$'s to arrive at an $E$ that maximises the product of $P(F|E)$ and $P(E)$ .  The decoder is the component that carries out this search.
\end{description}

\end{answer}

\end{exlist}
\end{qupart} 


\begin{qupart}
%Explain the intuition behind the IBM Model 1 in the context of Statistical Machine Translation (SMT). 
%Give the most important outcome of this model for an SMT system.
%Give one direction in which this model can be improved. 
%\mypercent{30}
%
%\begin{answer}

%[P1](15\%)
%IBM Model 1 is a model for word-alignment between bilingual segments, i.e. to identify likely correspondences between two languages at word level. 
%The model is trained in an unsupervised way based on a (large) training parallel corpus where sentences have been aligned, i.e., 
%a corpus of sentence pairs which are translation of each other. 
%The algorithm is a direct application of the Expectation Maximization (EM) algorithm: 
%
%1) Initially consider all alternative word alignments as equally likely
%
%2) Observe across sentences that (e.g.) source word 'x' often links to target word 'y'
%
%3) Increase probability of this word pair aligning
%
%4)  Iteratively redistribute probabilities, until probabilities stop changing (convergence)
%
%\medskip
%
%[P2](5\%) Once the corpus is aligned, a dictionary of source-target word pairs with translation probabilities can be extracted by simply taking the output of the final iteration.
%This dictionary can then be used as the most basic component in SMT. 
%
%\medskip
%
%[P2](10\%)
%Possible directions (only one necessary): 
%1) add a 'fertility' component, i.e., words in the source language can generate (align too) multiple words in the target language
%2) Add component to model absolute position of the words in the two sentences, i.e., words in the certain positions in the source segment tend to align to words in certain positions in the target segment, for example, to learn that the first word in the source segment may be more likely to align to the first or one of the first words in the target language as opposed to one of the last words in the target language
%3) Add a component to model word order dependencies, i.e., the fact the alignment of one source word to a given target word may be dependent on the neighboring words in one of both languages
%\end{answer}


Explain in a general way how word alignments are learnt from a parallel corpus in IBM model 1. Full mathematical details are not necessary. \mypercent{20}


\begin{answer}
IBM Model 1 is a model for word-alignment between bilingual segments, i.e. to identify likely correspondences between two languages at word level. 

The model is trained in an unsupervised way based on a (large) training parallel corpus where sentences have been aligned, i.e., 
a corpus of sentence pairs which are translations of each other. 

The algorithm is a direct application of the Expectation Maximization (EM) algorithm: 

1) Initially consider all alternative word alignments as equally likely

2) Observe across sentences that (e.g.) source word 'x' often links to target word 'y'

3) Increase probability of this word pair aligning

4)  Iteratively redistribute probabilities, until probabilities stop changing (convergence)

\medskip

Once the corpus is aligned, a dictionary of source-target word pairs with translation probabilities can be extracted by simply taking the output of the final iteration.
%This dictionary can then be used as the most basic component in SMT. 
\end{answer}

\end{qupart} 


 \begin{qupart}
% Explain two alternative approaches to BLEU -- one based on human judgements and one based on automatic metrics -- that can be used for evaluating Machine
% Translation systems, indicating their advantages and disadvantages. 
% \mypercent{20}

% \begin{answer}

% Various approaches are possible for both manual and automatic evaluation - some possible answers (only 2 are necessary). 

% 1) One way is to rely on human judges,
% who compare sys`tem outputs to source sentences or manually-created reference translations,
% and attempt to assess the rate of incorrect sentences produced by the
% system by assessing criteria such as fluency and adequacy. 

% 2) Alternatively, subjects might read the system outputs and then
% complete a comprehension test, to measure the effectiveness of the
% translations in communicating the content of the source document.

% These two methods have the advantage of being more reliable, since they involve humans. However, such human-based approaches are very costly of
% time/money, and can only occasionally be repeated (if at all),
% i.e. cannot frequently be redone to aid the system development
% process. %In addition, human assessment of translation quality is a very subjective task, prone to disagreement.

% 3) Another possibility is to use the MT systems being evaluated as a
% component in some larger system, e.g. for cross-language IR or
% QA. Whatever means is available for evaluating/benchmarking the
% performance of that larger system can then be applied to determine
% whether the system works better or worse with one MT system than
% another, and hence, by implication, which of the MT systems is more
% effective. This is, however, a rather indirect means of evaluating the
% how good the MT system is at producing translations, so that its
% results have unclear status, and the approach will also carry with it
% whatever costs are associated with evaluation of the larger system. 

% 4) The back-translation or round-trip translation evaluation approach translates the original text, written in L1, into another language L2 and then back into L1. The quality of the translation is evaluated by checking how close the text produced is to the original text. 
% The advantages are that this approach is that it does not require knowledge of another language (L2) or significant resources (such as reference translations). The disadvantages are that two Machine Translation systems are involved (L1 to L2) and (L2 to L1). The approach is cheap but it cannot distinguish between them and cannot tell which MT system introduced any errors. In addition, one MT system may fix errors made by the other but this will not be detected. %In addition the basic premise behind this approach, that the text produced by round trip translation should be the same as the original text is flawed since human translators would not necessarily produce the same text. 

% Other metrics for automatic evaluation include Translation Error Rate (TER) and Meteor. 

% \end{answer}

Explain briefly how the BLEU measure, which is used to automatically evaluate  the quality of machine translated texts,  is calculated. \mypercent{20}

\begin{answer}
BLEU assumes that the closer a machine translation is to a  professional human translation, the better it is. The approach therefore compares a candidate translation to one or more  human-created reference translations.
%The human reference translations ar costly to produce, but endlessly reusable. 

Comparison approach:
\begin{itemize}

\item \textit{Similarity} between MT output and reference measured as:

  \begin{itemize}
\stretchit{2}
%  \item count word n-grams MT and reference(s) have in common 
  
%  \begin{itemize}
%   \item n-grams here are just sequences of words
%   \end{itemize}

  \item compute word n-gram overlap between candidate translation and reference translation for each n=1--4, and then combine these into a single score
  \item for n=1 the measure assesses word choice
  \item for  n$>$1: the measure assesses both word choice and  word order
  \item this gives an n-gram precision measure:
  
  \hspace*{20mm}
  \begin{tabular}{c}
\it correct n-grams in candidate \\ \hline
\it total n-grams in candidate    
  \end{tabular}
  \end{itemize}

\item Count clipping
\begin{itemize}
\stretchit{2}
\item Counts are  clipped not to let any n-gram count more than the maximum
  number of times it appears in reference translations. Otherwise a candidate translation that consisted of multiple occurrences of one correctly translated word could obtain a perfect score.

\item Scores are computed over a {\it multi-sentence test set}

\smallskip

\begin{itemize}
\stretchit{2}
\item system produces a set of candidate translations ($\{Cand\}$)

\item then, compute corpus precision for n-grams of length $n$ as: 

\[ p_n = \frac{\sum_{C\in\{Cand\}}\sum_{ngram\in C}
Count_{clip}(ngram)}{\sum_{C'\in\{Cand\}}\sum_{ngram'\in
  C'} Count(ngram')} \rule{20mm}{0mm}\]


\end{itemize}
\end{itemize}


\item Precision scores for different n-gram lengths 1--4 combined as a
 geometric mean, i.e.: \vspace*{-5mm}

\[ (p_1.p_2.p_3.p_4)^\frac{1}{4} \rule{10mm}{0mm}\]

% harsh: if any $p_n$ = 0, average = 0

\item BLEU also contains a brevity penalty since otherwise ``sentences'' that correctly translate a few words but are very much shorter than the required length can score highly (since the measure above is a precision measure only). 

The brevity penalty is: if len(C)$\geq$len(R), then $BP=1$, else $BP < 1$, where $C$ is the candidate translation and $R$ is the reference translation.

The final BLEU score is: 

\[ {\it BLEU} = BP \times (p_1.p_2.p_3.p_4)^\frac{1}{4}  \]

\end{itemize}

%   \item reference to key paper at end of these slides




\end{answer}

\end{qupart} 



\end{question}