Commit adcd87f9 authored by Loïc Barrault's avatar Loïc Barrault
Browse files

textprocessing updates

parent 025ecbab
import re
s = 'abracadabra'
for x in re.findall('[ar].',s):
print x
import re
pattern = re.compile('^[a-z]+$')
s = "My baby don't love nobody but me."
for x in s.split():
if pattern.search(x):
print x
a = s.split()
print a
\documentclass[12pt]{book}
\usepackage{tuos_exam}
\usepackage{amsfonts}
\usepackage{amsmath}
%\showanswers % *** TOGGLE IN/OUT TO SHOW/HIDE ANSWERS ***
\pressmark{COM3110/COM6150}
%\pressmark{COM3110}
%\pressmark{COM6150}
\department{{\bf DEPARTMENT OF COMPUTER SCIENCE}}
\examtitle{{\bf TEXT PROCESSING}}
\examdate{{\bf Autumn Semester 2009-2010}}
\examtime{{\bf 2 hours}}
% \dataProvided{Deduction Rules}
\setter{Mark Hepple}
\rubric{{\bf Answer THREE questions.
All questions carry equal weight. Figures in square brackets indicate the
percentage of available marks allocated to each part of a question.}}
%\doNotRemove %% MODIFIES FRONT PAGE TO "DO NOT REMOVE FROM HALL" FORMAT
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcounter{myfboxdispx}
\newcounter{myfboxdispy}
\setcounter{myfboxdispx}{0}
\setcounter{myfboxdispy}{-4}
\newcommand{\myfboxdisplace}[2]{%
\addtocounter{myfboxdispx}{#1}%
\addtocounter{myfboxdispy}{#2}}
\newenvironment{myfbox}[2]{%
\setlength{\unitlength}{1mm}%
\begin{picture}(0,0)(\value{myfboxdispx},\value{myfboxdispy})
\put(0,-#2){\framebox(#1,#2){}}
\end{picture}~~\begin{minipage}[t]{#1mm}}%
{\end{minipage}%
\setcounter{myfboxdispx}{0}%
\setcounter{myfboxdispy}{-4}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\stretchit}[1]{\addtolength{\itemsep}{#1mm}}
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
\newcommand{\myargmax}[1]{\begin{array}[t]{c}\argmax\\{^#1}\end{array}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\begin{exam}
\begin{question}
\begin{qupart}
Explain the difference between {\it stemming\/} and {\it morphological
analysis}. Suggest a text processing context where stemming might be
used. Suggest a text processing context where morphological analysis
might be useful, but where simple stemming would {\it not\/} be
useful. \mypercent{20}
\begin{answer}
{\it Stemming\/} refers to the process of reducing words that are
morphological variants to their common root or stem, e.g. so that
variants {\it computer, computes, computed, computing}, etc, are
reduced to the stem {\it compute\/} (or some surrogate for the real
root, such as {\it comput\/}). This is a form of {\it term
conflation}.
{\it Morphological analysis\/} refers to the process of analysing the
morphological structure of words, i.e. decomposing them into
morphemes, including affixes and root.
One text processing context where stemming may be used is IR, where
its use can allow that a query containing one morphological variant of
a word can retrieve documents not containing that term, but rather
other variants of the same stem (e.g. a query with {\it computing\/}
retrieving a document containing {\it computer\/}). The usefulness of
this move, however, is debated. Stemming will also produce some
reduction in the size of document indexes.
An example of a text processing context where morphological analysis
might be useful, but where stemming would not be, is part-of-speech
(POS) tagging, and in particular the tagging of {\it unknown
words}. Here, identifying the affixes of an unknown word might
provide information valuable towards guessing the word's POS. Since
simple stemming throws affix information away, it would precisely {\it
not\/} be helpful in this context.
\end{answer}
\end{qupart}
\answersonly{\newpage}
\begin{qupart}
Define the precision and recall measures in IR. Is Graph A a possible
precision/recall graph? Is Graph B a possible precision/recall graph?
Explain your answers. \mypercent{20}
\bigskip
\centerline{~~~~~~\includegraphics[width=10cm]{figures/prec_recall1.eps}}
\begin{answer}
Assuming that: $RET$ is the set of all documents the system has
retrieved for a specific query; $REL$ is the set of relevant documents
for a specific query; $RETREL$ is the set of the retrieved relevant
documents, i.e., $RETREL = RET \cap REL$. Precision is defined as
$|RETREL| / |RET|$ and recall is defined as $|RETREL| / |REL|$.
It is not possible for a graph to look like Graph A. The fact that the
curve touches the point (1,1) indicates both precision and recall at
one, i.e. meaning that all documents retrieved are relevant. However,
this contradicts the fact that precision is less than zero when recall
is low, which means that some irrelevant documents are already
retrieved.
It is possible for a graph to look like Graph B. The curve means that
there are relatively few relevant documents at the beginning and end
of the ranked set of retrieved documents but some relevant documents
are concentrated in the middle of the ranking.
\end{answer}
\end{qupart}
\answersonly{\newpage}
\begin{qupart}
A common approach to classification in text processing is to assign a
category label $v\in V$ to an instance based on a number of feature
values $f_1\ldots f_n$ that serve to describe the instance or its
context, where the label $v$ that is chosen is the most probable or
MAP ({\it maximum a posteriori\/}) hypothesis, as follows:
\[ v_{MAP} ~= \myargmax{{v\in V}}P(v|f_1\ldots f_n) \]
Show how this approach can be reformulated using Bayes Theorem,
and an assumption of {\it conditional independence}, to give a {\it
Naive Bayes classification\/} method. Explain the benefits of this
reformulation in relation to the problem of {\it data sparseness}.
\mypercent{20}
\begin{answer}
We start with:
\[ v_{MAP} ~= \myargmax{{v\in V}}P(v|f_1\ldots f_n) \]
Using Bayes theorem, we restate this as:
\[ v_{MAP} ~= \myargmax{{v\in
V}}\frac{P(f_1\ldots f_n|v)P(v)}{P(f_1\ldots f_n)} \]
Here, the divisor $P(f_1\ldots f_n)$ does not affect the result of the
maximisation, so we can simplify to:
\[ v_{MAP} ~= \myargmax{{v\in V}}P(f_1\ldots f_n|v)P(v) \]
The conditional independence assumption used is that the value of the
features $f_1\ldots f_n$ are conditionally independent given the
target value $v$, which means that:
\[ P(f_1\ldots f_n|v) = \prod_{i=1}^n P(f_i|v) \]
Using this to modify our $v_{MAP}$ equation gives a Naive Bayes
classifier:
\[ v_{NB} ~= \myargmax{{v\in V}}P(v)\cdot\prod_{i=1}^n P(f_i|v) \]
The problem of {\it data sparseness\/} is that it may be difficult to
get good estimates of probabilities $P(f_1\ldots f_n|v)$ without
having an infeasibly large amount of data, i.e. there may be so many
different feature combinations $f_1\ldots f_n$ that each is seen only
rarely or not at all in the available data. The probabilities
$P(f_i|v)$ of the Naive Bayes approach can more reasonably be
estimated from a more limited amount of data.
\end{answer}
\end{qupart}
\begin{qupart}
Indicate what will be printed by each of the following pieces of
Python code, explaining your answer:
\medskip
\begin{exlist}
\exitem
\begin{small}
\myfboxdisplace{1}{-1}
~ \begin{myfbox}{100}{22}
\begin{verbatim}
import re
s = 'abracadabra'
for x in re.findall('[ar].',s):
print x
\end{verbatim}
\end{myfbox}
\end{small}
\mypercent{10}
\answersonly{\bigskip}
\begin{answer}
Code prints the following:
\begin{verbatim}
ab
ra
ad
ab
ra
\end{verbatim}
The regex `{\tt '[ar].'} matches two adjacent characters, of which the
first is either {\tt a} or {\tt r} and the second is any
character. The {\tt findall} returns a list of the matches for the
regex in the string, which are the successive underlined fragments
shown in: {\tt
\underline{ab}\,\underline{ra}\,c\,\underline{ad}\,\underline{ab}\,\underline{ra}}
\end{answer}
\bigskip
\medskip
\exitem
\begin{small}
\myfboxdisplace{1}{-1}
~ \begin{myfbox}{100}{32}
\begin{verbatim}
import re
pattern = re.compile('^[a-z]+$')
s = "My baby don't love nobody but me."
for x in s.split():
if pattern.search(x):
print x
\end{verbatim}
\end{myfbox}
\end{small}
\mypercent{10}
\answersonly{\bigskip}
\begin{answer}
Code prints the following:
\begin{verbatim}
baby
love
nobody
but
\end{verbatim}
The call to string method {\tt .split()} splits by default on space,
returning the list: {\tt ['My', 'baby', "don't", 'love', 'nobody', 'but', 'me.']}
The for loop only prints the items from this list that match the regex
{\tt pattern}, which only accepts strings consisting of just lowercase
letters (since it is anchored at both ends). Thus, we lose three
items: {\tt 'My'} (contains uppercase letter), and {\tt "don't"} and
{\tt 'me.'} (both containing punctuation chars).
\end{answer}
\end{exlist}
\end{qupart}
\smallskip
\begin{qupart}
What is {\it part-of-speech-tagging\/}? Explain why this task is difficult,
i.e. why it requires something more than just simple dictionary
look-up.
\mypercent{20}
\begin{answer}
This is the task of assiging to each word (or more generally token) in
texts, a tag or label for the word's part of speech class, such as
noun, verb or adjective. These classes group together words that
exhibit similar distributional behaviour, i.e. which play similar
roles in the syntax of the language. The task is non-trivial due to
(i)~ambiguity, i.e. many words have more than one POS, and the correct
one must be selected (on the basis of local context), and because (ii)
it is common to encounter unknown words (i.e. ones not present in the
dictionary, even if this has been compiled from a large amount of
text), for which a POS tag must be `guessed' (using context and
morphological information).
\end{answer}
\end{qupart}
\end{question}
\newpage
\begin{question}
\begin{qupart}
Text compression techniques are important because growth in volume of
text continually threatens to outstrip increases in storage, bandwidth
and processing capacity. Briefly explain the differences between:
\begin{exlist}
\exitem {\bf symbolwise} (or statistical) and {\bf dictionary} text
compression methods;
\mypercent{10}
\begin{answer}
\vspace*{-4mm}
\begin{itemize}
\item {\bf Symbolwise methods} work by estimating the probabilities of
symbols (characters or words/non-words) and coding one symbol at a
time using shorter codewords for the more likely symbols
\item {\bf Dictionary methods} work by replacing word/text fragments
with an index to an entry in a dictionary
\end{itemize}
\vspace*{-4mm}
%{\bf 5\% for explanation of symbolwise methods, 5 \% for dictionary}
\end{answer}
\exitem {\bf modelling} versus {\bf coding} steps;
\mypercent{10}
\begin{answer}
Symbolwise methods rely on a modeling step and a coding step
\begin{itemize}
\item {\bf Modeling} is the estimation of probabilities for the
symbols in the text -- the better the probability estimates, the
higher the compression that can be achieved
\medskip
\item {\bf Coding} is the conversion of the probabilities obtained
from a model into a bitstream
\end{itemize}
\vspace*{-4mm}
% {\bf 5\% for explanation of modelling step, 5 \% for coding step}
\end{answer}
\exitem {\bf static}, {\bf semi-static} and {\bf adaptive} techniques
for text compression.
\mypercent{10}
\begin{answer}
Compression techniques can also be distinguished by
whether they are
\begin{itemize}
\item {\bf Static} -- use a fixed model or fixed dictionary derived
in advance of any text to be compressed
\medskip
\item {\bf Semi-static} -- use current text to build a
model or dictionary during one pass, then apply it in second pass
\medskip
\item {\bf Adaptive} -- build model or dictionary adaptively during
one pass
\end{itemize}
\vspace*{-4mm}
%{\bf 3\% each, 1 \% discretionary}
\end{answer}
\end{exlist}
\end{qupart}
\answersonly{\newpage}
\begin{qupart}
The script for the fictitious language Gavagese contains only the
7 characters {\it a}, {\it e}, {\it u}, {\it k}, {\it r}, {\it f}, {\it
d}. You assemble a large electronic corpus of Gavagese and now
want to compress it. You analyse the frequency of occurrence of each
of these characters in the corpus and, using these frequencies as
estimates of the probability of occurrence of the characters in the
language as a whole, produce the following table:
\bigskip
\begin{center}
\begin{tabular}{cc}
Symbol & Probability \\ \hline
a & 0.25\\
e & 0.20 \\
u & 0.30 \\
k & 0.05 \\
r & 0.07\\
f & 0.08\\
d & 0.05 \\
\end{tabular}
\end{center}
\begin{exlist}
\exitem Show how to construct a Huffman code tree for Gavagese,
given the above probabilities.
\mypercent{30}
\begin{answer}
Start off by creating a leaf node for each character, with
associated probability (a). Then join two nodes with smallest
probabilities under a single parent node, whose probability is their
sum, and repeat till only one node left. Finally, 0's and 1's are
assigned to each binary split.
\medskip
\centerline{\includegraphics[width=90mm]{figures/huffman_eg1.eps}}
\vspace*{-4mm}
%{\bf 10 \% for the explanation of the method; 20 \% for a correct
%tree}
\end{answer}
\answersonly{\newpage}
\exitem Use your code tree to encode the string {\it dukerafua} and
show the resulting binary encoding. For this string, how much length
does your codetree encoding save over a minimal fixed length binary
character encoding for a 7 character alphabet? \mypercent{10}
\begin{answer}
Encoding for {\it dukerafua} will be
\begin{verbatim}
d u k e r a f u a
1101 10 1100 01 1111 00 1110 10 00
\end{verbatim}
For a seven letter alphabet a minimal fixed length binary character
encoding is 3 bits per character. There are 9 characters in the
string, so a fixed length encoding would require 27 characters. The
codetree encoding is 26, so one character only is saved (the
advantages will become apparent over larger more statistically
representative strings).
\vspace*{-4mm}
%{\bf 5 \% for the encoding; 5 \% for getting the amount saved correct}
\end{answer}
\end{exlist}
\end{qupart}
\answersonly{\newpage}
\begin{qupart}
One popular compression technique is the LZ77 method, used in common
compression utilities such as {\it gzip}.
\begin{exlist}
\exitem Explain how LZ77 works.
\mypercent{20}
\begin{answer}
The {\bf key idea} underlying the LZ77 adaptive dictionary compression method
is to replace substrings with a pointer
to previous occurrences of the same substrings in same text. The
encoder output is a series of triples where
\begin{itemize}
\stretchit{-2}
\item the first component indicates how far back in decoded output to
look for next phrase
\item the second indicates the length of that phrase
\item the third is next character from input (only necessary when
not found in previous text, but included for simplicity)
\end{itemize}
{\bf Issues} to be addressed in implementing an adaptive dictionary method
such as LZ77 include
\begin{itemize}
\stretchit{-2}
\item how far back in the text to allow pointers to refer
\begin{itemize}
\stretchit{-2}
\item references further back increase chance of longer matching
strings, but also increase bits required to store
pointer
\item typical value is a few thousand characters
\end{itemize}
\item how large the strings referred to can be
\begin{itemize}
\stretchit{-2}
\item the larger the string, the larger the width parameter
specifying it
\item typical value $\sim$ 16 characters
\end{itemize}
\item during encoding, how to search window of prior text for
longest match with the upcoming phrase
\begin{itemize}
\stretchit{-2}
\item linear search very inefficient
\item best to index prior text with a suitable data structure,
such as a trie, hash, or binary search tree
\end{itemize}
\end{itemize}
\item A popular high performance implementation of LZ77 is
{\bf gzip}
\begin{itemize}
\stretchit{-2}
\item uses a hash table to locate previous occurrences of strings
\begin{itemize}
\stretchit{-2}
\item hash accessed by next 3 characters
\item holds pointers to prior locations
of the 3 characters
\end{itemize}
\item pointers and phrase lengths are stored using variable length
Huffman codes, computed semi-statically by processing 64K blocks
of data at a time %(can be held in memory, so appears as if
%single-pass)
\item pointer triples are reduced to pairs, by eliminating
3rd element
\begin{itemize}
\item first transmit phrase length --
if 1 treat pointer as raw character;
else treat pointer as genuine pointer
\end{itemize}
\end{itemize}
\vspace*{-4mm}
%{\bf 10 \% for the key idea; 5 \% for issues and 5 \% for gzip}
\end{answer}
\exitem How would the following LZ77 encoder output
% a b ad badb bbba
%badbadbbbbaaaaddddabba
\[
\langle 0, 0, b \rangle \langle 0, 0, a \rangle
\langle 0, 0, d \rangle
\langle 3, 3, b \rangle \langle 1, 3, a\rangle
\langle 1, 3, d \rangle \langle 1, 3, a \rangle
\langle 11, 2, a \rangle
\]
be decoded, assuming the encoding representation presented in the
lectures? Show how your answer is derived.
\mypercent{10}
\begin{answer}
\begin{enumerate}
\item $\langle 0, 0, b \rangle$ Go back 0 copy for length 0 and end with
$b$: $b$
\item $\langle 0, 0, a \rangle$ Go back 0 copy for length 0 and end with
$a$: $ba$
\item $\langle 0, 0, d \rangle$ Go back 0 copy for length 0 and end with
$a$: $bad$