Commit adcd87f9 authored by Loïc Barrault's avatar Loïc Barrault
Browse files

textprocessing updates

parent 025ecbab
\documentclass[12pt]{article}
\usepackage{pslatex}
\usepackage{amsfonts}
\usepackage{newexam+shield}
\usepackage{epsf}
\usepackage{myfancyverb}
\usepackage{examAnswers}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% TOGGLES
%\showAnswers
\hideAnswers
%\mscPaper
\ugPaper
%\input{toggle}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pressmark{COM3110}
\mscOnly{\pressmark{COM6150}}
\pressmark{COM3110/COM6150}
\department{\bf {DEPARTMENT OF COMPUTER SCIENCE}}
\examtitle{{\bf TEXT PROCESSING}}
\examdate{{\bf Autumn Semester 2005-06}}
\examtime{{\bf 2 hours}}
\setter{Mark Hepple \\ Mark Stevenson}
\rubric{{\bf Answer THREE questions.
All questions carry equal weight. Figures in square brackets indicate the
percentage of available marks allocated to each part of a question.}}
\mscOnly{\rubric{{\bf Answer the Question in Section A and TWO further questions
from Section B.
All questions carry equal weight. Figures in square brackets indicate the
percentage of available marks allocated to each part of a question.}}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\begin{exam}
\turnover
\mscOnly{\begin{center}\bf
SECTION A
\end{center}}
\begin{question}
\begin{qupart}
Explain what is meant by the ``bag of words'' model which is used for
various text processing tasks.
\mypercent{10}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
The bag of words model assumes that text can be represented as a
simple list of the words it contains and, possibly, their frequencies
in the text. Information about the relations between these words and
their position in the text is ignored.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Explain how the bag of words model can be used in Information
Retrieval and word sense disambiguation.
\mypercent{15}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
The standard approach to IR models each document as a bag of words,
and identifies relevant documents by comparing queries to documents
using similarity measures that work over such bag-of-words
representations.
There are various applications in word sense disambiguation including naive
Bayesian classifiers which treat the context of an ambiguous word as a bag
of words and the Lesk algorithm for disambiguation using dictionary
definition overlap treats each definition as a bag of words.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Write down Bayes Rule and explain how it is used in the naive
Bayesian classifier for word sense disambiguation.
\mypercent{20}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Bayes rule is:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
The naive Bayesian classifier aims to estimate the probability of a
sense given the context. i.e $P(s|c)$, where $s$ is a particular sense
and $c$ is the context used by the algorithm. However, this cannot be
estimated directly so Bayes rule is applied and the formula re-written
as: : $\frac{P(c|s)P(s)}{P(c)}$. $P(c)$ is constant for all senses and
can be ignored. We approximate $P(c|s)$ as
$P(a_1|s)\times\ldots\times{}P(a_n|s)$, where $a_1\ldots a_n$ are the
multiple features that make up
`context', i.e. making the naive bayesian assumption that these
features are conditionally independent. The probabilities $P(s)$ and
$P(a_i|s)$ can be estimated from training text,
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\answersonly{\newpage}
\begin{qupart}
What are the similarities between part of speech tagging and word
sense disambiguation? Why are techniques for part of speech tagging
not suitable for word sense disambiguation? \mypercent{15}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Both part of speech tagging and WSD aim to add extra information to
each token in a text (or possibly just each content word). Part of
speech tagging adds syntactic information while WSD selects amongst
word meanings.
Part of speech tagging approaches generally examine a narrow context
around the word being annotated, for example the previous one or two
words. WSD requires a wider context.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Write a short Perl program that will read from a file on {\tt STDIN}
and that does nothing with the line that is read {\it until\/} it
encounters a line containing the string {\tt<BEGIN>}. Thereafter, each
subsequent line will be printed (to {\tt STDOUT}) {\it until\/} it
encounters a line containing the string {\tt <END>}. This line and
any that follow are not printed. You may assume that the {\tt<BEGIN>}
and {\tt<END>} strings will occur only once in a file. \mypercent{20}
\end{qupart}
\begin{SaveVerbatim}{vrb1}
$print=0;
while (<>) {
if (/<BEGIN>/) {
$print=1;
} elsif (/<END>/) {
$print=0;
} elsif ($print) {
print;
}
}
\end{SaveVerbatim}
\begin{SaveVerbatim}{vrb2}
while (<>) {
last if /<BEGIN>/;
}
while (<>) {
last if /<END>/;
print;
}
\end{SaveVerbatim}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Could be done in more than one way:
e.g. as:~~
\fbox{~\useVerbStretchFNS{.9}{vrb1}~}
~~ or as:~~
\fbox{~\useVerbStretchFNS{.9}{vrb2}~}
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Assume that, in reading text from an HTML file, we can identify a URL
(`web address') expression {\it within\/} a line of text provided that the string
meets the following requirements:
\begin{itemize}
\item[i.] the string should fall between double quotes ({\tt "})
\item[ii.] it should start with {\tt http://}
\item[iii.] it should end with {\tt.htm} or {\tt.html}
\item[iv.] between this start and end, there may be any sequence of
characters except that {\tt "}, {\tt <} and {\tt >} may not appear.
\end{itemize}
Write a Perl regular expression that will match against strings that
contain a URL expression, with the
the URL string found being assigned to the variable \verb+$1+.
\mypercent{10}
\end{qupart}
\SaveVerb{vrb1}+/\"(http:[^\"<>]*\.htm(l)?)\"/+
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
\hspace*{40mm}
\UseVerb{vrb1}
\medskip
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{SaveVerbatim}{vrb1}
$_ = 'abcde';
s/(\w)(\w)/$2$1/g;
print "$_\n";
\end{SaveVerbatim}
\begin{SaveVerbatim}{vrb2}
$_ = 'abcde';
while (s/(a)(\w)/$2$1/) { }
print "$_\n";
\end{SaveVerbatim}
\begin{qupart}
Specify what will be printed by each of the following pieces of Perl
code: \useVerbLineCentering{t}
\begin{itemize}
\item[i.] \fbox{~\useVerbStretchFNS{.9}{vrb1}~}
\mypercent{5}
\medskip
\item[ii.] \fbox{~\useVerbStretchFNS{.9}{vrb2}~}
\mypercent{5}
\bigskip
\end{itemize}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
\begin{itemize}
\item[i.] gives: {\tt 'badce'}
\medskip
\item[ii.] gives: {\tt 'bcdea'} -- iteratively transposes the {\tt a}
down to the end of the string.
\end{itemize}
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\end{question}
%\ugOnly{\medskip}%
\mscOnly{\continued
\begin{center}\bf%
SECTION B%
\end{center}}%
\begin{question}
\begin{qupart}
Describe Lesk's algorithm for word sense disambiguation using
dictionary definition overlap and explain its
limitations. \mypercent{25}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Lesk's algorithm relies on the textual definitions of words in a
dictionary to disambiguate words. The pairwise combinations of senses
are considered and the number of words they share in common
counted. The pair of senses with the largest number of words in common
are chosen as the senses.
To work well the algorithm requires the definitions to be pre-processed
in a number of ways. Empty heads and stop words are removed. (Empty
heads are short expressions generally found at the start of dictionary
definitions which indicate the hypernym of the definition, such as "a
type of".) The remaining words are stemmed.
The limitations are:
\begin{enumerate}
\item The approach depends on the correct senses sharing words in their
definitions. This means that the approach depends on the particular
dictionary used and will prefer senses with longer definitions over
shorter ones.
\item The approach may not be tractable for long sentences; the number
of sense combinations which have to be checked could be prohibitive.
\end{enumerate}
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Describe two techniques which can be used to annotate text with word
sense information automatically without the need for manual
annotation. These techniques are used to provide data which can be
used to train and/or test a word sense disambiguation system. Describe
the disadvantages of each.\mypercent{15}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Two techniques are:
\begin{enumerate}
\item Make use of parallel corpora where the sense is defined as the
translation of a word into another language.
\item Use pseudowords to automatically introduce ambiguity into
text. (Pseudowords are created by choosing two or more words,
e.g. car and bottle, and replacing each occurrence of each with
their concatenation, car-bottle. The task of the disambiguation
algorithm is to identify the original word.)
\end{enumerate}
Their disadvantages are:
\begin{enumerate}
\item Parallel text may be hard to obtain.
\item Sense distinctions in pseudowords may not be appropriate for actual
applications (e.g. translation). The sense distinctions are
artificial and so an algorithm may learn to disambiguate between
contexts in which each of the words is likely to occur rather than
meanings of the words.
\end{enumerate}
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
What assumption is made by a naive Bayes classifier when it is used for word
sense disambiguation?
\mypercent{10}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
The naive Bayes classifier assumes that the probabilities of the words
in the context of the ambiguous words are conditionally
independent. That is, the presence, or otherwise, of a word in the
context has no influence on other words.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\answersonly{\newpage}
\begin{qupart}
Explain the differences between direct, transfer and interlingua
approaches to Machine Translation. \mypercent{30}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
The key difference between the three approaches is the level of
analysis which is applied to the source text.
Direct approaches apply very little analysis to the the source text
and rely on simple translation of each word in the source text. Statistical
MT could be considered to be a direct approach.
Transfer approaches attempt to analyse the structure of the source
text to produce an intermediate representation. The intermediate
representation of a sentence from the input text is then used in
generating the translated sentence in the target language. Transfer
approach can employ syntactic or semantic representations.
Interlingua approaches rely on a representation of the meaning which
is independent of both the source and target language. The source
sentence goes through syntactic and semantic analysis to be translated
into the interlingua. This representation is then used to generate a
sentence in the target language. The difference between transfer
approaches which use semantic representations and interlingua
approaches rests on the independence of the system used to represent
meaning; interlinguas are completely independent of the source and
target languages while the representation used in semantic transfer simply aims
to capture enough information to allow translation.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Describe the approach used by the BLEU system for evaluation of
Machine Translation systems.\mypercent{20}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
The BLEU system relies on multiple reference translations, each of
which represents a possible way in which a text could be translated
into a target language. The translation being evaluated (the
candidate) is compared against the reference translations by counting
the number of possible ngrams (strings of words) which occur in
them. BLEU assumes that there are several possible ways to translate a
text, each of which are equally valid, and uses multiple reference
reference translations to provide these alternatives.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\end{question}
\ugOnly{\continued}
\mscOnly{\turnover}
\begin{question}
\begin{qupart}
What are stop words, in the context of an Information Retrieval
system? Why are they generally not included as index terms?
\mypercent{10}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Stop words are generally words belonging to closed grammatical classes
such as determiners, conjunctions and prepositions. They are
not included as index terms as they occur so frequently in documents
that they are not good discriminators between relevant and irrelevant
documents.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\answersonly{\newpage}
\begin{qupart}
Consider the following collection of three short documents:\\
\\
\hspace{10ex}Document 1: Tyger, tyger, burning bright\\
\hspace{10ex}Document 2: Tyger sat on the mat\\
\hspace{10ex}Document 3: The bright mat\\
\\
Show how the documents would be represented in a vector
space model for Information Retrieval, as vectors in which term weights
correspond to term frequencies. Do not remove stop words,
and do not use stemming in creating these representations.
\mypercent{10}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
The documents would be represented as follows:
\begin{center}
\begin{tabular}{|cccccccc|}
\hline
& bright & burning & mat & on & sat & the & tyger\\
Document 1 & 1 & 1 & 0 & 0 & 0 & 0 & 2\\
Document 2 & 0 & 0 & 1 & 1 & 1 & 1 & 1\\
Document 3 & 1 & 0 & 1 & 0 & 0 & 1 & 0\\
\hline
\end{tabular}
\end{center}
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
The cosine coefficient can be used by Information retrieval systems to
rank the relevance of documents in relation to a query. Compute the
similarity that would be produced between Document 1 and the query
``burning tyger'' using this measure. \mypercent{15}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
\begin{math}
cos(Document 1, \textrm{``burning tyger''}) = \frac{1.0 + 1.1 + 0.0 + 0.0 + 0.0 +
0.0 + 2.1}{\sqrt{1^{2} + 1^{2} + 0^{2} + 0^{2} + 0^{2} + 0^{2} + 2^{2}}\sqrt{1^{2} + 1^{2}}} =
\frac{3}{\sqrt{6}\sqrt{2}} = 0.866
\end{math}
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Explain what is meant by term weighting in Information Retrieval
systems and why it is used. \mypercent{15}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Term weighting is the process of deciding on the importance of each
term and is normally carried out by assigning a
numerical score to each term.
Term weighting is used in IR systems because not all terms are equally
useful for retrieval; some occur in many of the documents in the
collection (and are therefore bad discriminators) while others occur
infrequently (and will return few documents). Term weighting aims to assign high scores
to terms which
are likely to discriminate between relevant and irrelevant documents. The term
weights are taken into account by the ranking function with the aim of improving
retrieval performance.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Explain how tf.idf term weighting is used to assign weights to terms in
Information Retrieval. Include the formula for computing tf.idf. \mypercent{15}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
tf.idf assigns weights to terms by taking into account two elements:
the frequency of that term in a particular document and the proportion
of documents in the corpus in which it occurs. The tf part prefers
terms which occur frequently in a document and the idf part gives
extra weight to terms which do not occur in many documents in the
collection. The tf.idf weight is produced by computing the product of
these two terms.
The formula for computing tf.idf is:
$tf.idf = tf_{ik} \times log_{10}\left(\frac{N}{n_{k}}\right)$
where $tf_{ik}$ is the frequency of term $k$ in document $i$, N is the
total number of documents in the collection and $n_{k}$ the number
which contain the term $k$.
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\answersonly{\newpage}
\begin{qupart}
Show the weights which would be generated if tf.idf weighting was applied to Document 1
in the document collection of three documents shown above. \mypercent{10}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
Using tf.idf the weights in document 1 would become:
\begin{center}
\begin{tabular}{|cccccccc|}
\hline
& bright & burning & mat & on & sat & the & tyger\\
Document 1 & 0.176 & 0.477 & 0 & 0 & 0 & 0 & 0.352\\
\hline
\end{tabular}
\end{center}
} %%%%%%%%%%%%% **** END ANSWER **** %%%%%%
\begin{qupart}
Consider the following ranking of ten documents produced by an
Information Retrieval
system. The symbol $\checkmark$ indicates that a retrieved document is
relevant and $\times$ that it is not.
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
Document & Ranking & Relevant\\
\hline
d6 & 1 & $\checkmark$ \\
d1 & 2 & $\times$ \\
d2 & 3 & $\checkmark$ \\
d10 & 4 & $\times$ \\
d9 & 5 & $\checkmark$ \\
d3 & 6 & $\checkmark$ \\
d5 & 7 & $\times$ \\
d4 & 8 & $\times$ \\
d7 & 9 & $\checkmark$ \\
d8 & 10 & $\times$ \\
\hline
\end{tabular}
\end{center}
\vspace{4eX} Compute the (uninterpolated) average precision and
interpolated average precision for this ranked set of
documents. Explain your working. \mypercent{25}
\end{qupart}
\answer{%%%%%% **** BEGIN ANSWER **** %%%%%%
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|}
\hline
Document & Ranking & Relevant & Recall & Precision & Interpolated Precision\\
\hline
d6 & 1& $\checkmark$& 0.2& 1 & 1\\
d1 & 2& $\times$& 0.2& 0.5 & 0.67\\
d2 & 3& $\checkmark$& 0.4& 0.67 & 0.67\\