com3110_2003_under.tex 25.8 KB
Newer Older
Loïc Barrault's avatar
Loïc Barrault committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% classes and packages
 
%% document class
\documentclass[12pt]{article}


\usepackage[a4paper,foot=1cm,%
            top=3cm,bottom=2cm]{geometry}


%% standard packages
\usepackage{natbib,pslatex,epsfig,fancyhdr,amsmath}

%% interline spacing
\renewcommand{\baselinestretch}{.95}

%personal-packages 
\usepackage{markup}

\usepackage{extramarks} 
% headings
\pagestyle{fancy}
\lhead{} 
\chead{}
\rhead{\bfseries COM3110} 
\lfoot{\bfseries COM3110} 
\cfoot{\thepage} 
\rfoot{\firstmark}
\extramarks{}{\bfseries TURN OVER}
\extramarks{}{}
\renewcommand{\headrulewidth}{0pt} 
\renewcommand{\footrulewidth}{0pt}


%% define argmax
\newcommand{\argmax}{\operatornamewithlimits{argmax}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% spacing
%% page format

\frenchspacing
\sloppy

%% paragraph spacing
\parindent = 0em        
\parskip = 0em

%% list indentation
%\leftmargini = \parindent
%\leftmarginii = \parindent

%% example indentation 
%\exampleindent = \leftmargini



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% header


\title{\vspace{-5ex}\textbf{\epsfig{file=crest_white.ps,width=3.2cm}\\The University of Sheffield}}
\author{}

\date{}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% document body

\begin{document}

\maketitle

\thispagestyle{fancy}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-6ex}
\textbf{DEPARTMENT OF COMPUTER SCIENCE} 
\vspace{1ex}

\textbf{Autumn Semester 2003-2004} \hspace{12em} \textbf{2 hours}
\vspace{1ex}

\textbf{TEXT PROCESSING}
\vspace{1ex}

\textbf{Answer {\sc three} questions. All questions carry equal
  weight.  Figures in square brackets indicate the percentage of
  available marks allocated to each part of a question.}

\newpage 
\begin{enumerate}

\item Provide short answers to the following questions.

 \begin{enumerate}

\item What is mutual information? Give a formula as part of your
  answer. 

\hfill[10\%]

\item What is cross entropy?  Describe one way in which cross entropy
  is used in statistical text processing. Give a formula as part of
  your answer.

\hfill[10\%]

\item Probabilistic language models are very popular in text
  processing. Using the notation of conditional probabilities, write
  down an expression for the exact probability of a sentence
  containing $l$ words $w_1 w_2 \dots w_l$. Now write down an
  approximation to this. The approximation should use only unigram and
  bigram probabilities (i.e.,~those having the form $P(w_{k})$ where
  $k \in 1 \dots l$ or $P(w_i|w_{i-1})$ where $i \in 2 \dots l$).
  Briefly explain what information is used in the first expression but
  ignored in the second. Why might one choose to use the second form
  of the expression in preference to the first?

\hfill[10\%]

\item What is smoothing? Why is it important? Name and briefly
  describe two distinct methods of smoothing probability distributions
  which arise in probabilistic language models. Give formulas as part
  of your answer.


\hfill[20\%]

\item What is the purpose of the following Unix commands? 

 {\tt  cat genesis.txt | tr -cs 'A-Za-z' '$\backslash$012' | sort | uniq -c |
 sort -nr}

What is the definition of a word which this command is implicitly
adopting?  

\hfill[10\%]

\item Specify a pipeline of Unix tools which takes as input a plain
  text file and produces as output a list of words occurring in the
  file 20 times. 

\hfill[10\%]

\item State clearly what the following Perl program would print out
  when run, giving reasons.  

\begin{verbatim}
@y = ('a', 'b', 'c');  # line 1
print @y; 
$x = @y;               # line 2
print $x;              
$y[$x] = $y[@y - 1];   # line 3
print @y; 
\end{verbatim}

\hfill[20\%]

% \item  What is printed out by each of the following Perl programs? 
% \begin{enumerate}
% \item 
% \begin{verbatim}
% $_ = 'bananarama'; 
% s/(an)*a/$1; 
% print; 
% \end{verbatim}
% 
% \item 
% \begin{verbatim}
% $_ = 'uggawuggahuggamugga'; 
% s/((a)(.)(u))/$2$4/g; 
% print; 
% \end{verbatim}
% \hfill[10\%]
% \end{enumerate}
\textbf{QUESTION CONTINUES ON NEXT PAGE}
\newpage
\item The Perl regular expression 

\begin{verbatim}
/^((a|b)+)\1$/
\end{verbatim}

will match which of the following strings? 

\begin{verbatim}
aa
ab
ba
aba
abba
abab
aababba
babbab
baaa
baab
\end{verbatim}
\end{enumerate}

\hfill[10\%]


\item Write a Perl program which reads standard input (up to the end
  of file) and prints out an integer representing the number of
  occurrences in the input data of the most frequently occurring word
  in that data, along with a note of the word.

 For example, if the input data were: 

\begin{verbatim}
the man with the glasses
saw the woman with the saw
\end{verbatim}

then the most frequently occurring word would be ``the'', occurring 4
times, so the output should be: 

\begin{verbatim}
the : 4
\end{verbatim}

If there are several equally frequent words in the text, they should
all be printed out with the number of occurrences. For example if the
input data were: 

\begin{verbatim}
the glasses with the glasses
saw the glasses with the glasses
\end{verbatim} 

then the most frequently occurring words would be ``the'' and
``glasses'', occurring 4 times, so the output should be: 

\begin{verbatim}
the : 4
glasses : 4
\end{verbatim}

Take a word to be any consecutive sequence of alphabetic characters
separated by white space from any adjacent word. Assume the input
contains only alphabetic characters, spaces, and linebreaks, i.e.,
there is no punctuation or numeric data. 

\hfill [100\%]

\newpage
\item 
  
  In order for documents to be retrieved efficiently from an
  information retrieval system, information is stored in an inverted
  index. Suppose that you are given the two documents below:

\begin{center}
\begin{tabular}{l}
\textbf{Document 1} \\ 
She sells sea shells at the shore. 
\\
\\
\textbf{Document 2} \\
Many types of sea shells sell at the fish market. \\
\end{tabular}
\end{center}

\begin{enumerate}
\item Explain what an inverted index is. Show how Documents~1 and~2
  would be stored in an inverted index using a stoplist of your choice
  and stemming. Justify the types of words chosen to include in the
  stoplist.

\hfill[25\%] 
\item Create a document-by-word matrix for Documents~1 and~2 after
  stemming and removal of stopwords. Assume the matrix cells represent
  the number of times a word appears in Documents~1 and~2.

\hfill[25\%] 
\item Calculate the similarity of Documents~1 and~2 using a measure of
  your choice and justify why you have chosen this measure. (Useful
  information: $\sqrt{4} = 2$; $\sqrt{6} \approx 2.45$). 

\hfill[25\%] 
\item Write a Perl program which takes as input Documents~1 and~2 and
  for each word present in the two Documents, it prints out the name
  of the document the word has been found in and its position in the
  sentence. So the output of your program should be a table that looks
  like this:

%\begin{table}[h!]
\begin{center}
\begin{tabular}{l}
{\tt document1: She 1}\\
{\tt document1: sells 2}\\
{\tt document1: sea 3}\\
{\tt document1: shells 4}\\
{\tt document1: at 5}\\
{\tt document1: the 6}\\
{\tt document1: sea 7}\\
{\tt document1: shore 8}\\
{\tt document2: Many 1}\\
{\tt document2: types 2}\\
{\tt document2: of 3}\\
{\tt document2: sea 4}\\
{\tt document2: shells 5}\\
{\tt document2: sell 6}\\
{\tt document2: at 7}\\
{\tt document2: the 8}\\
{\tt document2: fish 9}\\
{\tt document2: market 10}\\
\end{tabular}
\end{center}
%\end{table}

\textbf{QUESTION CONTINUES ON NEXT PAGE}
\newpage

The first column shows the document name, the second column the words
found in the document and the third the position of the word in the
sentence. Assume that a word is any consecutive sequence of alphabetic
characters separated by white space from any adjacent word. Assume
your documents have been pre-processed so that each sentence occupies
a new line (i.e.,~you do not have to segment your documents into
sentences, this has been already done for you). 

\hfill [25\%]

\end{enumerate}



\item
\begin{enumerate}
\item Describe the main features of the approach to statistical
  machine translation adopted by the IBM research group. Justify your  answers. 

\hfill [50\%]
\item To what
  extent is this approach purely statistical and to what extent does
  it rely on linguistic intuitions?  Justify your  answers. 

\hfill [25\%]
\item Do you believe that the approach
  is limited to closely related language pairs such as English and
  French, or will it be applicable to any language pair for which a
  large enough corpus of text is available? Justify your  answers. 

\hfill [25\%]
\end{enumerate}



\item You work for a large company where there are many meetings, both
  of internal staff and between staff and external clients. Meetings
  are recorded in form  minutes. The company's files of minutes are
  large and the material has to be kept for many years since it may be
  necessary to check back on decisions taken early in large projects. 

You are asked to design a retrieval system so that company staff can
locate minutes on a particular topic. The company wants the system to
be reliable and effective. 

\begin{enumerate}
\item Outline the design of your system, indicating the particular features
it will have that are intended to meet the company's requirements (you
can assume that minutes are always clearly dated and have explicit
lists of participants).  

\hfill [60\%]


\item The company is willing to allow the installation of a pilot
  system so your approach can be evaluated under realistic
  conditions. Describe, in detail, your design for the evaluation:
  what data will you consider, what aspects of your system will you
  evaluate, and why?  

\hfill [25\%]
  
\item Suppose you are running a pilot evaluation using a nine-document
  collection and the results of two information retrieval systems
  about this collection shown below. The presence of the symbol
  \key{$\surd$} indicates that the document is relevant to the query.

\vspace{2ex}
\textbf{QUESTION CONTINUES ON NEXT PAGE}
\newpage

%\begin{table}[h!]
\begin{small}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}\cline{1-3}\cline{5-7}
\multicolumn{3}{|c|}{System 1 Results} & & \multicolumn{3 }{|c|}{System 2 Results}\\\cline{1-3}\cline{5-7}
Docs   & Ranking & Relevant       &\hspace{2em} & Docs      &  Ranking & Relevant \\ \cline{1-3}\cline{5-7}
d8  & 1    &  \key{$\surd$}     &\hspace{2em} &    d8 & 1        & \key{$\surd$} \\
d2  & 2    &    \key{$\surd$}   &\hspace{2em} &    d2 & 2        &           \\
d7  & 3    &   \key{$\surd$}    &\hspace{2em} &    d5 & 3        &           \\
d3  & 4    &                    &\hspace{2em} &    d1 & 4        &           \\
d5  & 5    &  \key{$\surd$}     &\hspace{2em} &    d4 & 5        & \key{$\surd$}\\
d1  & 6    &                    &\hspace{2em} &    d6 & 6        & \key{$\surd$} \\  
d4  & 7    &                    &\hspace{2em} &    d3 & 7        & \\
d6  & 8    &                    &\hspace{2em} &    d9 & 7        & \\
d9  & 9    &                    &\hspace{2em} &    d7 & 9        &  \key{$\surd$}\\ \cline{1-3}\cline{5-7}
\end{tabular} 
\end{center}
\end{small}
%\end{table}                                     


Explain how you would evaluate the two information retrieval systems on the
basis of these results. Discuss the evaluation measures you would use
and illustrate your answer using  the examples above. 

\hfill [15\%]
\end{enumerate}



% \item One of the most successful features of mobile phones to date is
%   text messaging. Text messaging, known as Short Message Service (or
%   SMS) is the ability to send and receive text messages to and from
%   mobile telephones. The text can comprise of words or numbers or an
%   alphanumeric combination. Text messaging is popular in many parts of
%   the world, in particular in Europe (where 47\% of Swedes and 39\% of
%   Italians use the service) and in Asia. In the Philippines, the use
%   of text messaging as an organisational tool by protesters is even
%   credited with helping to overthrow Joseph Estrada, the country's
%   President in January 2001. 
%   
%   Typing text messages can be tiring in particular since the mobile
%   phone has a relatively small keyboard designed for typing by thumb.
%   There are only between 12 and 15 keys for numbers and letters on a
%   mobile phone keypad. So it's not really designed for writing all the
%   different numbers and letters you need when you write e-mail or a
%   note in your diary. These 12-15 keys must cover the 26 letters of
%   the alphabet (for English for instance) and all of the punctuation
%   and numerical characters. In order to help users type faster mobile
%   phones are equipped with word completion software. The software
%   completes a word when the user has entered only a part of the word.
%   For example, if the user begins to enter HELL the software completes
%   by adding a O to form the word HELLO.
%   
%   How would you implement a system that performs word completion?
%   Assume that you have access to a large corpus of English and that
%   you can obtain counts about the frequency of English letters and
%   words.  How would you model the completion process? Here are
%   examples of completions that your system should be able to handle: 

% \begin{tabular}{lll}
%    predict      &  You     \\
%    prediction   &  Your    \\
%    predictive   &  Young   \\
%    predicting   &  Youth   \\
%    predicted    &  Yourself\\
% \end{tabular}

\rfoot{}

\end{enumerate}

\vspace{1ex}
\begin{center}
{\bfseries END OF QUESTION PAPER}
\end{center}


\newpage
\section*{Solutions for COM6150 Exam 2003-2004 \\ 
Setter: Mirella Lapata}

\rhead{}
\lfoot{}
\begin{enumerate}
\item 
\begin{enumerate}

\item Mutual information is the reduction in uncertainly of one random
variable due to knowing about another, or in other words, the amount
of information one random variable contains about another. Mutual
information can be calculated as follows. 

\[ \begin{array}{lll}
I(X;Y) & = & H(X) - H(X|Y) \\
       & = & H(X) + H(Y) - H(X,Y) \\
       & = & \sum\limits_{x} p(x) \log \frac{1}{p(x)} + \sum\limits_{y} p(y) \log
       \frac{1}{p(y)} + \sum\limits_{x,y} p(x,y) \log p(x,y) \\
       & = &  \sum\limits_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \\
\end{array}
\]

\item The cross entropy is useful when we do not know the actual
  probability distribution $p$ that generated some data. It allows us
  to use some $m$ which is a model of $p$ (i.e.,~an approximation to
  $p$). The cross entropy of $m$ on $p$ is defined by: 

 \[ H(p,m) = \lim_{n\rightarrow\infty} \frac{1}{n} \sum\limits_{W
  \in L} p(w_1, \dots, w_n) \log m(w_1, \dots, w_n) \] 

Cross entropy is used in language modeling to evaluate how well a
given language model performs on unseen data. The lower the cross
entropy, the better the model. 

\item 
Here's the exact expression for the probability of a sentence
containing $l$ words:. 

\[ P(w_1 \dots w_l) = P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_1,w_2)
  \dots P(w_l|w_1 \dots w_{l-1}) \]

The above can be approximated as follows: 

\[ P(w_1 \dots w_l) \approx P(w_1) \cdot P(w_2|w_1) \cdot P(w_3|w_2) \dots
\cdot P(w_l|w_{l-i}) \] 

Contextual information is ignored in the second expression. We might
chose the second expression over the first one, because the second one
will be easier to estimate (i.e., it has less parameters).

\item The task of reevaluating some of the zero-probability and
  low-probability $n$-grams, and assigning them non-zero values, is
  called smoothing. Smoothing techniques are used in a variety of
  statistical natural language processing applications as a means to
  address data sparseness, an inherent problem for statistical methods
  which rely on the relative frequencies of word combinations. The
  problem arises when the probability of word combinations that do not
  occur in the training data needs to be estimated.  Methods for
  smoothing probability distributions are Adding-one smoothing (or
  Laplace's law), absolute discounting, or Good-Turing smoothing.

\vspace{1ex}
Add-one smoothing: 

\[
P(w_2|w_1) = \frac{C(w_1,w_2) + 1}{C(w_1) + V}
\]
%
where $C(w_1,w_2)$ and $C(w_1)$ is the number of times $w_1,w_2$ and
$w_1$ appear in the corpus and $V$ the size of the vocabulary. 

\vspace{1ex}
Absolute discounting: 

\[\begin{array}{ll} Pabs(w_n|w_1 \dots w_{n-1}) = \frac{r - \delta}{N} & \mathrm{if}~r > 0\\
& \\
Pabs(w_n|w_1 \dots w_{n-1})= \frac{(V - N_{0})\delta}{N_{0}N} & \mathrm{otherwise}
\end{array}
\]
%
where $r$ is  $C(w_1 \dots w_{n})$, $N$ is the total number of times
$w_1 \dots w_{n-1}$ has been seen, $V$ is the size of the vocabulary,
and $N_{0}$ is the number of word types that were unseen after this
  context. 

\vspace{1ex}
Good-Turing smoothing: 

\[
r^{*} = (r + 1) \frac{N_{r+1}}{N_r}
\]
%
where $r^{*}$ is the corrected estimate for word occurring $r$ times.
The proportion of unseen words is estimated by proportion of
singletons, the proportion of $N_{1}$ singletons is estimated by
proportion of $N_{2}$ doubletons, etc.

\item The Unix commands tokenise the text, translate all alphabetic
  characters to lower case, print each word in a separate line, sort
  the words alphabetically, squeeze and count repetitions, and
  finally sort numerically in a descending order. A word is any
  sequence of letters and numbers separated by white space. 

\item 

\begin{verbatim}
 cat genesis.txt | tr -cs 'A-Za-z' '\012' | sort | uniq -c | grep '^20'
\end{verbatim}

\item \begin{verbatim}abc3abcc\end{verbatim}
  
  The y-array has 3 entries, a, b, c, which are printed unspaced. x
  tasks on the value of the length of y, i.e.,~3, which is printed. The
  update on y assigns the element of y indexed by 2 (length of y minus
  1) to the position indexed y x (3), extending the array to have a
  further copy of c, total length 4 elements.


\item It matches {\tt aa}, {\tt abab}, {\tt babbab} only. (Any
  non-empty string consisting of a string followed by an exact copy). 
\end{enumerate}

\item 
\begin{verbatim}
while ($line = <>) { 
    chop $line; 
    @words = split(/\s+/,$line); 
    foreach $member (@words) { 
        $hash_words{$member} += 1; 
    }
}

$counter = 0; 

foreach $k (keys %hash_words) { 
    if ($hash_words{$k} > $counter) { 
        $counter = $hash_words{$k}; 
    }
}
    
 foreach $k (keys %hash_words) { 
     if ($hash_words{$k} == $counter) { 
         print "$k : $hash_words{$k}\n"; 
     }
 }
         
\end{verbatim}
  
\fbox{\parbox{15cm}{50\% for using a hash to store the words and their frequencies, 25\%
  for looping through the hash to find the words with the highest
  frequency, and 25\% for printing these words correctly.}}

\item 
\begin{enumerate}
\item An inverted index is a data structure that lists for each word
  in a document collection all documents that contain it and the
  frequency of occurrence in each document. An inverted index makes it
  easy to search for `hits' of a query word. One just goes to the part
  of the inverted index that corresponds to the query word and
  retrieves the documents listed there. A more sophisticated version
  of an inverted index also contains position information. Instead of
  just listing the documents that a word occurs in, the position of
  all occurrences in the document are also listed. A position of
  occurrence can be encoded as a byte offset relative to the beginning
  of the document. An inverted index with position information lets us
  search for \word{phrases}. Here's an inverted index for Documents~1
  and~2: 

\begin{table}[h!]
\begin{center}
\begin{tabular}{|c|l|l|}\hline
Number & Words & Documents\\ \hline 
1  &  fish    &  2\\
2  &  market  &  2\\
3  &  sea     &  1, 2\\
4  &  sell    &  1, 2\\
5  &  shell   &  1, 2\\
6  &  shore   &  1\\
7  &  type    &  2\\\hline
\end{tabular}
\end{center}
\end{table}

The following words have been included in the stoplist: \{she, at,
the, many, of\}. Words which are too frequent among the documents in a
collection are not good discriminators and are usually filtered out as
potential index terms. Articles, prepositions, and conjunctions are
natural candidates for a list of stopwords. Words like \word{many} and
\word{she} can also be used as stop words given that they do not have
a lot of semantic content and one would expect to find them in a lot
of documents.



\item  This is the document-by-word matrix for Documents~1 and~2. 

\begin{table}[h!]
\begin{center}
\begin{tabular}{l|ccccccc}
           &    fish & market & sea & sell & shell & shore & type \\ \hline
Document 1 &     0   & 0      & 1   & 1    & 1     & 1     &  0 \\
Document 2 &     1   &  1     & 1   & 1    & 1     & 0     & 1 \\
\end{tabular}
\end{center}
\end{table}

\item 
\begin{small}
\[
cos(\vec{Doc~1},\vec{Doc~2}) = \frac{0\cdot1 + 0\cdot1 +
  1\cdot1 + 1\cdot1 + 1\cdot1 + 1\cdot0 + 0\cdot1}{\sqrt{0^{2} + 0^{2} + 1^{2}
  + 1^{2} + 1^{2} + 1^{2} + 0^{2}}\sqrt{1^{2} +1^{2}
  +1^{2}+1^{2}+1^{2}+0^{2} +1^{2}}} = \frac{3}{\sqrt{4}{\sqrt{6}}} =
  0.61
\]
\end{small}

\item 
\begin{verbatim}
foreach $member (@ARGV) { 
    open(FILE,$member); 

    $counter = 0; 

    while ($line = <FILE>) { 
        chop $line; 
        $line =~ s/\.//g; 
        @words = split(/\s+/,$line); 

        foreach $w (@words) { 
            ++$counter; 
            print "$member: $w $counter\n";
        }
    }
    close(FILE); 
}


\end{verbatim}
  
\end{enumerate}


\item This is an essay-type question. These are general points that
  the students should discuss: 

\begin{enumerate}
\item Machine translation can be cast in terms of the noisy-channel
  model. The approach was pioneered by IBM in the early 90's. If we
  want to translate French into English, we shall assume that someone
  has $E$ in his head but by the time he writes it down it's
  \word{corrupted} and becomes $F$. To recover the original $E$, need
  to reason about: (a)~the kinds of things people say in English and
  (b)~how English gets turned into French. Given a French sentence $F$
  the approach searches for English $E$ that maximises $P(E|F)$.

\[
\begin{array}{lll}
\argmax_{E}P(E|F) &  = & \argmax_{E} \frac{P(E) \cdot P(F|E)}{P(F)} \\
                   
                  & =  &  \argmax_{E} P(E) \cdot P(F|E)  \\
\end{array}
\]

We choose not to model $P(E|F)$ directly. Instead, we break things
apart to get good translations even if the probability numbers are not
that accurate. $P(F|E)$ will ensure that a good $E$ will have words
that generally translate to words in $F$. $P(E)$ will help save us
from bad English and $P(F|E)$ only needs to say whether a bag of
English words corresponds to a bag of French words.  $P(E)$ and
$P(F|E)$ can be trained independently.

In order to model $P(F|E)$ we need a sentence sentence-aligned
parallel (bilingual corpus) and a model of translation $\sim P(F|E)$.
The latter assumes word alignments which we do not have. We can
approximate those by using EM, an unsupervised learning algorithm that
can learn alignment probabilities.

\fbox{\parbox{15cm}{30\% for explaining the noisy channel model, 10\% for writing down
formulas that express how French gets converted into English, 10\% for
explaining how probabilistic models of machine translation are trained
(using aligned corpora and EM).}}


\item The approach as described above does not rely on
  linguistic knowledge about the two languages. 
  
\fbox{\parbox{15cm}{10\% for discussing why the statistical MT approach is purely data
driven, 15\% for explaining how parallel corpora are used to provide
information about correspondences between two languages.}}




\item The approach is limited to closely related language pairs. It
  will be difficult to translate languages that are structurally very
  different (e.g.,~languages that have different notions of what
  counts as a word or languages that have radically different word
  orders).
  
\fbox{\parbox{15cm}{15\%for explaining why the statistical approach to MT assumes that
  the two languages are structurally similar. 10\% for coming up with
  examples of language pairs for which the approach would be
  problematic.}}

\end{enumerate}

\item This is an essay-type question. These are general points that
  the students should discuss: 

\begin{enumerate}
\item The idea is to build an IR system for the company, to enable
  staff retrieve minutes for meetings. The user will type in a query
  and the system will retrieve a list of documents. It should be
  described here what is the IR model of choice (boolean or
  vector-based), whether stemming and a list of stopwords will be
  used, what is the data structure that will hold the documents, how
  the system will be updated, etc.
  
  \fbox{\parbox{15cm}{25\% for discussing IR models, for explaining
      why the vector-based model is more appropriate and dicussing how
      it is formalised, 15\% for mentioning the inverted index data
      structure, 10\% for discussing stemming, and 10\% for mentioning
      a stoplist.}}

\item Information retrieval systems are typically evaluated using
  precision, recall and the F-measure. Since this system is built for
  a company with real users, a more detailed evaluation can be
  conducted, where user satisfaction is taken into account.
 

\fbox{\parbox{15cm}{10\% for explaining precision and recall and giving the
  appropriate formulas, 10\% for mentioning the F-measure and
  precision at a cutoff, 5\% for discussing the possibility of
  involving real users in the evaluation task.}}

\item In terms of precision, recall, and F-measure the two systems
  perform identically. By using precision at a cutoff we will get an
  impression of how a system ranks relevant documents before non
  relevant ones. With precision at 5 we see that System~1 returns
  100\% of the relevant documents, whereas System~2 returns only 50\%.
\end{enumerate}

\fbox{\parbox{15cm}{5\% for discussing the problem with precision and
    recall in this particular example and 10\% for suggesting
    precision at a cutoff as an alternative measure and showing how it
    works in practice.}}

\end{enumerate}
\end{document}




\lhead[\thepage]{hello}      % Note the different brackets!
\rhead[\thesection]{hello}