import re s = 'abracadabra' for x in re.findall('[ar].',s): print x import re pattern = re.compile('^[a-z]+$') s = "My baby don't love nobody but me." for x in s.split(): if pattern.search(x): print x a = s.split() print a  \documentclass[12pt]{book} \usepackage{tuos_exam} \usepackage{amsfonts} \usepackage{amsmath} %\showanswers % *** TOGGLE IN/OUT TO SHOW/HIDE ANSWERS *** \pressmark{COM3110/COM6150} %\pressmark{COM3110} %\pressmark{COM6150} \department{{\bf DEPARTMENT OF COMPUTER SCIENCE}} \examtitle{{\bf TEXT PROCESSING}} \examdate{{\bf Autumn Semester 2009-2010}} \examtime{{\bf 2 hours}} % \dataProvided{Deduction Rules} \setter{Mark Hepple} \rubric{{\bf Answer THREE questions. All questions carry equal weight. Figures in square brackets indicate the percentage of available marks allocated to each part of a question.}} %\doNotRemove %% MODIFIES FRONT PAGE TO "DO NOT REMOVE FROM HALL" FORMAT %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcounter{myfboxdispx} \newcounter{myfboxdispy} \setcounter{myfboxdispx}{0} \setcounter{myfboxdispy}{-4} \newcommand{\myfboxdisplace}[2]{% \addtocounter{myfboxdispx}{#1}% \addtocounter{myfboxdispy}{#2}} \newenvironment{myfbox}[2]{% \setlength{\unitlength}{1mm}% \begin{picture}(0,0)(\value{myfboxdispx},\value{myfboxdispy}) \put(0,-#2){\framebox(#1,#2){}} \end{picture}~~\begin{minipage}[t]{#1mm}}% {\end{minipage}% \setcounter{myfboxdispx}{0}% \setcounter{myfboxdispy}{-4}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\stretchit}[1]{\addtolength{\itemsep}{#1mm}} \newcommand{\argmax}{\operatornamewithlimits{argmax}} \newcommand{\myargmax}[1]{\begin{array}[t]{c}\argmax\\{^#1}\end{array}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \begin{exam} \begin{question} \begin{qupart} Explain the difference between {\it stemming\/} and {\it morphological analysis}. Suggest a text processing context where stemming might be used. Suggest a text processing context where morphological analysis might be useful, but where simple stemming would {\it not\/} be useful. \mypercent{20} \begin{answer} {\it Stemming\/} refers to the process of reducing words that are morphological variants to their common root or stem, e.g. so that variants {\it computer, computes, computed, computing}, etc, are reduced to the stem {\it compute\/} (or some surrogate for the real root, such as {\it comput\/}). This is a form of {\it term conflation}. {\it Morphological analysis\/} refers to the process of analysing the morphological structure of words, i.e. decomposing them into morphemes, including affixes and root. One text processing context where stemming may be used is IR, where its use can allow that a query containing one morphological variant of a word can retrieve documents not containing that term, but rather other variants of the same stem (e.g. a query with {\it computing\/} retrieving a document containing {\it computer\/}). The usefulness of this move, however, is debated. Stemming will also produce some reduction in the size of document indexes. An example of a text processing context where morphological analysis might be useful, but where stemming would not be, is part-of-speech (POS) tagging, and in particular the tagging of {\it unknown words}. Here, identifying the affixes of an unknown word might provide information valuable towards guessing the word's POS. Since simple stemming throws affix information away, it would precisely {\it not\/} be helpful in this context. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} Define the precision and recall measures in IR. Is Graph A a possible precision/recall graph? Is Graph B a possible precision/recall graph? Explain your answers. \mypercent{20} \bigskip \centerline{~~~~~~\includegraphics[width=10cm]{figures/prec_recall1.eps}} \begin{answer} Assuming that:$RET$is the set of all documents the system has retrieved for a specific query;$REL$is the set of relevant documents for a specific query;$RETREL$is the set of the retrieved relevant documents, i.e.,$RETREL = RET \cap REL$. Precision is defined as$|RETREL| / |RET|$and recall is defined as$|RETREL| / |REL|$. It is not possible for a graph to look like Graph A. The fact that the curve touches the point (1,1) indicates both precision and recall at one, i.e. meaning that all documents retrieved are relevant. However, this contradicts the fact that precision is less than zero when recall is low, which means that some irrelevant documents are already retrieved. It is possible for a graph to look like Graph B. The curve means that there are relatively few relevant documents at the beginning and end of the ranked set of retrieved documents but some relevant documents are concentrated in the middle of the ranking. \end{answer} \end{qupart} \answersonly{\newpage} \begin{qupart} A common approach to classification in text processing is to assign a category label$v\in V$to an instance based on a number of feature values$f_1\ldots f_n$that serve to describe the instance or its context, where the label$v$that is chosen is the most probable or MAP ({\it maximum a posteriori\/}) hypothesis, as follows: $v_{MAP} ~= \myargmax{{v\in V}}P(v|f_1\ldots f_n)$ Show how this approach can be reformulated using Bayes Theorem, and an assumption of {\it conditional independence}, to give a {\it Naive Bayes classification\/} method. Explain the benefits of this reformulation in relation to the problem of {\it data sparseness}. \mypercent{20} \begin{answer} We start with: $v_{MAP} ~= \myargmax{{v\in V}}P(v|f_1\ldots f_n)$ Using Bayes theorem, we restate this as: $v_{MAP} ~= \myargmax{{v\in V}}\frac{P(f_1\ldots f_n|v)P(v)}{P(f_1\ldots f_n)}$ Here, the divisor$P(f_1\ldots f_n)$does not affect the result of the maximisation, so we can simplify to: $v_{MAP} ~= \myargmax{{v\in V}}P(f_1\ldots f_n|v)P(v)$ The conditional independence assumption used is that the value of the features$f_1\ldots f_n$are conditionally independent given the target value$v$, which means that: $P(f_1\ldots f_n|v) = \prod_{i=1}^n P(f_i|v)$ Using this to modify our$v_{MAP}$equation gives a Naive Bayes classifier: $v_{NB} ~= \myargmax{{v\in V}}P(v)\cdot\prod_{i=1}^n P(f_i|v)$ The problem of {\it data sparseness\/} is that it may be difficult to get good estimates of probabilities$P(f_1\ldots f_n|v)$without having an infeasibly large amount of data, i.e. there may be so many different feature combinations$f_1\ldots f_n$that each is seen only rarely or not at all in the available data. The probabilities$P(f_i|v)$of the Naive Bayes approach can more reasonably be estimated from a more limited amount of data. \end{answer} \end{qupart} \begin{qupart} Indicate what will be printed by each of the following pieces of Python code, explaining your answer: \medskip \begin{exlist} \exitem \begin{small} \myfboxdisplace{1}{-1} ~ \begin{myfbox}{100}{22} \begin{verbatim} import re s = 'abracadabra' for x in re.findall('[ar].',s): print x \end{verbatim} \end{myfbox} \end{small} \mypercent{10} \answersonly{\bigskip} \begin{answer} Code prints the following: \begin{verbatim} ab ra ad ab ra \end{verbatim} The regex {\tt '[ar].'} matches two adjacent characters, of which the first is either {\tt a} or {\tt r} and the second is any character. The {\tt findall} returns a list of the matches for the regex in the string, which are the successive underlined fragments shown in: {\tt \underline{ab}\,\underline{ra}\,c\,\underline{ad}\,\underline{ab}\,\underline{ra}} \end{answer} \bigskip \medskip \exitem \begin{small} \myfboxdisplace{1}{-1} ~ \begin{myfbox}{100}{32} \begin{verbatim} import re pattern = re.compile('^[a-z]+$') s = "My baby don't love nobody but me." for x in s.split(): if pattern.search(x): print x \end{verbatim} \end{myfbox} \end{small} \mypercent{10} \answersonly{\bigskip} \begin{answer} Code prints the following: \begin{verbatim} baby love nobody but \end{verbatim} The call to string method {\tt .split()} splits by default on space, returning the list: {\tt ['My', 'baby', "don't", 'love', 'nobody', 'but', 'me.']} The for loop only prints the items from this list that match the regex {\tt pattern}, which only accepts strings consisting of just lowercase letters (since it is anchored at both ends). Thus, we lose three items: {\tt 'My'} (contains uppercase letter), and {\tt "don't"} and {\tt 'me.'} (both containing punctuation chars). \end{answer} \end{exlist} \end{qupart} \smallskip \begin{qupart} What is {\it part-of-speech-tagging\/}? Explain why this task is difficult, i.e. why it requires something more than just simple dictionary look-up. \mypercent{20} \begin{answer} This is the task of assiging to each word (or more generally token) in texts, a tag or label for the word's part of speech class, such as noun, verb or adjective. These classes group together words that exhibit similar distributional behaviour, i.e. which play similar roles in the syntax of the language. The task is non-trivial due to (i)~ambiguity, i.e. many words have more than one POS, and the correct one must be selected (on the basis of local context), and because (ii) it is common to encounter unknown words (i.e. ones not present in the dictionary, even if this has been compiled from a large amount of text), for which a POS tag must be guessed' (using context and morphological information). \end{answer} \end{qupart} \end{question} \newpage \begin{question} \begin{qupart} Text compression techniques are important because growth in volume of text continually threatens to outstrip increases in storage, bandwidth and processing capacity. Briefly explain the differences between: \begin{exlist} \exitem {\bf symbolwise} (or statistical) and {\bf dictionary} text compression methods; \mypercent{10} \begin{answer} \vspace*{-4mm} \begin{itemize} \item {\bf Symbolwise methods} work by estimating the probabilities of symbols (characters or words/non-words) and coding one symbol at a time using shorter codewords for the more likely symbols \item {\bf Dictionary methods} work by replacing word/text fragments with an index to an entry in a dictionary \end{itemize} \vspace*{-4mm} %{\bf 5\% for explanation of symbolwise methods, 5 \% for dictionary} \end{answer} \exitem {\bf modelling} versus {\bf coding} steps; \mypercent{10} \begin{answer} Symbolwise methods rely on a modeling step and a coding step \begin{itemize} \item {\bf Modeling} is the estimation of probabilities for the symbols in the text -- the better the probability estimates, the higher the compression that can be achieved \medskip \item {\bf Coding} is the conversion of the probabilities obtained from a model into a bitstream \end{itemize} \vspace*{-4mm} % {\bf 5\% for explanation of modelling step, 5 \% for coding step} \end{answer} \exitem {\bf static}, {\bf semi-static} and {\bf adaptive} techniques for text compression. \mypercent{10} \begin{answer} Compression techniques can also be distinguished by whether they are \begin{itemize} \item {\bf Static} -- use a fixed model or fixed dictionary derived in advance of any text to be compressed \medskip \item {\bf Semi-static} -- use current text to build a model or dictionary during one pass, then apply it in second pass \medskip \item {\bf Adaptive} -- build model or dictionary adaptively during one pass \end{itemize} \vspace*{-4mm} %{\bf 3\% each, 1 \% discretionary} \end{answer} \end{exlist} \end{qupart} \answersonly{\newpage} \begin{qupart} The script for the fictitious language Gavagese contains only the 7 characters {\it a}, {\it e}, {\it u}, {\it k}, {\it r}, {\it f}, {\it d}. You assemble a large electronic corpus of Gavagese and now want to compress it. You analyse the frequency of occurrence of each of these characters in the corpus and, using these frequencies as estimates of the probability of occurrence of the characters in the language as a whole, produce the following table: \bigskip \begin{center} \begin{tabular}{cc} Symbol & Probability \\ \hline a & 0.25\\ e & 0.20 \\ u & 0.30 \\ k & 0.05 \\ r & 0.07\\ f & 0.08\\ d & 0.05 \\ \end{tabular} \end{center} \begin{exlist} \exitem Show how to construct a Huffman code tree for Gavagese, given the above probabilities. \mypercent{30} \begin{answer} Start off by creating a leaf node for each character, with associated probability (a). Then join two nodes with smallest probabilities under a single parent node, whose probability is their sum, and repeat till only one node left. Finally, 0's and 1's are assigned to each binary split. \medskip \centerline{\includegraphics[width=90mm]{figures/huffman_eg1.eps}} \vspace*{-4mm} %{\bf 10 \% for the explanation of the method; 20 \% for a correct %tree} \end{answer} \answersonly{\newpage} \exitem Use your code tree to encode the string {\it dukerafua} and show the resulting binary encoding. For this string, how much length does your codetree encoding save over a minimal fixed length binary character encoding for a 7 character alphabet? \mypercent{10} \begin{answer} Encoding for {\it dukerafua} will be \begin{verbatim} d u k e r a f u a 1101 10 1100 01 1111 00 1110 10 00 \end{verbatim} For a seven letter alphabet a minimal fixed length binary character encoding is 3 bits per character. There are 9 characters in the string, so a fixed length encoding would require 27 characters. The codetree encoding is 26, so one character only is saved (the advantages will become apparent over larger more statistically representative strings). \vspace*{-4mm} %{\bf 5 \% for the encoding; 5 \% for getting the amount saved correct} \end{answer} \end{exlist} \end{qupart} \answersonly{\newpage} \begin{qupart} One popular compression technique is the LZ77 method, used in common compression utilities such as {\it gzip}. \begin{exlist} \exitem Explain how LZ77 works. \mypercent{20} \begin{answer} The {\bf key idea} underlying the LZ77 adaptive dictionary compression method is to replace substrings with a pointer to previous occurrences of the same substrings in same text. The encoder output is a series of triples where \begin{itemize} \stretchit{-2} \item the first component indicates how far back in decoded output to look for next phrase \item the second indicates the length of that phrase \item the third is next character from input (only necessary when not found in previous text, but included for simplicity) \end{itemize} {\bf Issues} to be addressed in implementing an adaptive dictionary method such as LZ77 include \begin{itemize} \stretchit{-2} \item how far back in the text to allow pointers to refer \begin{itemize} \stretchit{-2} \item references further back increase chance of longer matching strings, but also increase bits required to store pointer \item typical value is a few thousand characters \end{itemize} \item how large the strings referred to can be \begin{itemize} \stretchit{-2} \item the larger the string, the larger the width parameter specifying it \item typical value $\sim$ 16 characters \end{itemize} \item during encoding, how to search window of prior text for longest match with the upcoming phrase \begin{itemize} \stretchit{-2} \item linear search very inefficient \item best to index prior text with a suitable data structure, such as a trie, hash, or binary search tree \end{itemize} \end{itemize} \item A popular high performance implementation of LZ77 is {\bf gzip} \begin{itemize} \stretchit{-2} \item uses a hash table to locate previous occurrences of strings \begin{itemize} \stretchit{-2} \item hash accessed by next 3 characters \item holds pointers to prior locations of the 3 characters \end{itemize} \item pointers and phrase lengths are stored using variable length Huffman codes, computed semi-statically by processing 64K blocks of data at a time %(can be held in memory, so appears as if %single-pass) \item pointer triples are reduced to pairs, by eliminating 3rd element \begin{itemize} \item first transmit phrase length -- if 1 treat pointer as raw character; else treat pointer as genuine pointer \end{itemize} \end{itemize} \vspace*{-4mm} %{\bf 10 \% for the key idea; 5 \% for issues and 5 \% for gzip} \end{answer} \exitem How would the following LZ77 encoder output % a b ad badb bbba %badbadbbbbaaaaddddabba $\langle 0, 0, b \rangle \langle 0, 0, a \rangle \langle 0, 0, d \rangle \langle 3, 3, b \rangle \langle 1, 3, a\rangle \langle 1, 3, d \rangle \langle 1, 3, a \rangle \langle 11, 2, a \rangle$ be decoded, assuming the encoding representation presented in the lectures? Show how your answer is derived. \mypercent{10} \begin{answer} \begin{enumerate} \item $\langle 0, 0, b \rangle$ Go back 0 copy for length 0 and end with $b$: $b$ \item $\langle 0, 0, a \rangle$ Go back 0 copy for length 0 and end with $a$: $ba$ \item $\langle 0, 0, d \rangle$ Go back 0 copy for length 0 and end with $a$: $bad$ \item $\langle 3, 3, b \rangle$ Go back 3 copy for length 3 and end with