Commit adcd87f9 authored by Loïc Barrault's avatar Loïc Barrault
Browse files

textprocessing updates

parent 025ecbab
%%%% local macros %%%%
\newcommand{\seq}[2]{\mbox{$#1_{1},\- \ldots ,\-#1_{#2}$}}
\newenvironment{proof}{\begin{trivlist}\item[\hskip \labelsep {\bf Proof}]}{\nopagebreak
%\newcommand{\psrule}[2]{\mbox{{\rm #1}\ \ $\rightarrow$\ \ {\rm #2}}}
\mbox{{\rm #1}\ \ $\rightarrow$\ \ {\rm #2}\ :\ ${\it #3}$}}
\def\argmax{\mathop{\rm argmax}}
\def\argmin{\mathop{\rm argmin}}
\newcommand{\pgm}[1]{{\sc #1}}
{\bf Solutions for COM3110 (Text Processing) Exam 2002-03\\
Setters: R. Gaizauskas (1,2)/L. Guthrie (3,4)}
{\bf Note: These solutions are more than model solutions; they are
{\it ideal} solutions in the sense that they contain more, in total,
than any student could reasonably be expected to produce. Full
marks for most questions could be obtained by producing less than is
included here. Their utility is as an aid in marking, as they
supply the union of all material the students might be expected to
present as an answer.}
Answer {\bf each} of the following short answer questions.
% Character encoding, compression and text markup
% Character Encoding + Unicode
Explain each of the following terms and make clear
the relations between them: {\bf language}, {\bf script},
{\bf character}, {\bf glyph}, {\bf font}. Give examples of each.
%\exitem Explain the overall goals and high level design principles
%underlying Unicode.
%What distinguishes Unicode from earlier character coding schemes?
\item[language] the fundamental means of communication between
humans; may be spoken or written -- spoken language both
historically and cognitively more fundamental than written
language (e.g English, Chinese, Hindi)
\item[script] a system for writing down a language -- ``the system
of graphical marks employed to record expressions of a
language in visible form''; some languages have more than one
script (e.g Japanese); some scripts serve for more than one
language (e.g. Roman)
\item[character] the smallest unit of written language that has
a semantic value (e.g. ``a'' in English; a Chinese
``logogram''; a constant + vowel in a alphasyllabic script
like Devenagari);
\item[glyph] a representation of a character or characters, as
it/they is/are rendered or displayed
A given character can be
rendered in many ways (italic, sanserif) -- i.e. can have many
corresponding glpyhs); and multiple characters can be
represented as a single glyph (e.g ligatures; characters +
diacritics in Devenagari)
\item[font] a repertoire of glyphs (e.g. 10 pt times roman)
{\bf 20 \%: 4 \% per term}
Briefly describe the function of, and give an example of, each of the
\verb+<!ELEMENT>+, \verb+<!ATTLIST>+ and \verb+<!ENTITY>+
declarations as found in SGML document type definitions.
For a document with DTD $D$
\item \verb+<!ELEMENT>+ declarations specify
the type and structure of the textual units or elements that will be
found in an SGML document of type $D$. They specify
\item the name of the element
%\item whether starting and ending tags are compulsory (-) or not
% (0)
\item the content model of the element -- the type and order of
sub-elements that may or must occur within this element.
Child elements are either:
other named SGML elements,
{\tt \#PCDATA} (parsed character data),
{\tt \#NDATA} (binary data),
{\tt EMPTY}.
%\item the sequence of elements is specified by writing their
% type specifiers in a comma-separated sequence; alternatives
% may be specified using {\tt |}
%\item regular expression quantifiers are used to specify number
% of occurrences -- no quantifier means exactly one, {\tt
% ?} means 0 or 1, {\tt \*} means 0 or more, and {\tt +}
% one or more.
E.g. An email element might be defined as:\\
\verb#<!ELEMENT email - - (header, contents) >#\\
which asserts that an email element is called {\tt email}, has
compulsory opening and closing tags, and consists of a header
element followed by a contents element (each of which may in turn
be composed of sub-elements).
%sender element, one or more
%address elements, 0 or 1 subject elements and 0 or more Cc elements.
\item \verb+<!ATTLIST>+ declarations specify
\item the attributes associated with each element
\item for each attribute:
the type of the attribute
a default declaration for the attribute -- indicates whether
a value is optional ({\tt \#IMPLIED}), required ({\tt
\#REQUIRED}), or if there is an explicit default
For exanmple, the email element might have as associated attributes
a required {\tt date\_sent} attribute and an {\tt status} attribute
with default value {\tt public}:
<!ATTLIST e-mail
date_sent DATE #REQUIRED
status (secret | public) public >
\item \verb+<!ENTITY>+ declarations provide a general ``macro''
definition facility which may be used in both DTD and document
\item SGML has a number of built-in entities. For example:
\verb+&lt+ escapes \verb+<+ when \verb+<+
is to occur in character data (and is not signalling
the beginning of an SGML tag); \verb+&amp+ escapes \verb+&+.
\item Entities may also be used to define substitute strings
as abbreviations. E.g:
<!ENTITY fname SYSTEM "/home/robertg/docs/foo.dtd">
allows \verb+&fname+ to be used in place of the path name
in the document instance
{\bf 8 \% for Elements, and 7 \% each for Attributes and Entities. For
each 5\% for the description and 2/3\% for the example.}
Briefly describe the Unicode coding model, making clear the levels in the
model, explaining the differences between, and the motivations for,
UTF-8 and UTF-16
%, and describing the purpose and implementation of
%surrogate pairs.
The Unicode model may be first approximated by
a three level model consisting of:
\item an abstract character repertoire %(ACR)
\item their mapping to a set of integers %(coded character set -- CCS)
\item their encoding forms + byte serialization
\item Abstract character repertoires are being established for all the
world's languages -- involves specifying what the set of characters
in each language are.
\item Each of these character repertoires is mapped onto integers in the
range 0 - 2$^{16}$. Of these 65,536 values
\item 63,486 are available to represent characters with single
16-bit code values
\item 2048 code values are available to represent an additional
1,048, 544 characters through paired 16-bit code values
(called {\bf surrogate pairs})
\item Once a mapping from an abstract character set to a set of
integers has been defined two further mappings need to be
\item {\bf Character Encoding Form} -- a mapping from a set of
integers to a set of sequences of code units of specified width
(e.g. 8-bit bytes).
\item {\bf Character Encoding Scheme} -- a mapping from a set of
sequences of code units to a {\bf serialized} sequence of bytes
(addresses issues of byte ended-ness on different machines)
The most common character encoding forms are:
\item UTF-16 (Unicode (or UCS) Transformation Format-16)
-- default Unicode encoding form -- characters are assigned to two
eight bit bytes, except for surrogate pairs which consist of two
16-bit values (4 bytes)
\item the most simple and straightforward way to map Unicode
code points into code units (bytes)
\item files containing only Latin texts are twice as large as
they are in single byte encodings, such as ASCII or ISO
\item not backwards/forwards compatible with ASCII -- so
programs that expect a single-byte character set won't
work on a UTF-16 file {\it even if it only contains Latin text}.
\item UTF-8 (Unicode (or UCS) Transformation Format-8) --
maps a Unicode scalar value onto 1 to 4 bytes
\item 1st byte indicates number of bytes to follow
\item one byte sufficient for ASCII code values (1..127)
\item two bytes sufficient for most non-ideographic scripts
\item four bytes needed only for surrogate pairs
\item existing ASCII files are already UTF-8
\item most broadly supported encoding form today
\item ideographic (mostly Asian) languages requires 3
bytes/character -- so UTF-8 encodings are larger for
Asian languages than UTF-16 and most existing encodings
% \noindent
% The most common character encoding schemes are:
% \begin{itemize}
% \item UTF-16BE -- UTF-16 with big endian byte sequencing
% \item UTF-16LE -- UTF-16 with little endian byte sequencing
% \end{itemize}
{\bf 20\%: 10\% for a clear picture of the levels of the model;
10 \% for UTF-8/16 distinction}
Define each of the following terms as they are used in discussing
Perl regular expressions and give an example of each:
{\bf metacharacters}, {\bf metasymbols}, {\bf anchors}, {\bf
quantifiers}, {\bf back references}.
{\bf Metacharacters} are single characters which do not match themselves
in regular expressions, but instead have special meanings. There
are 12 of them:
\ | ( ) [ { ^ $ * + ? .
{\bf Metasymbols} are sequences of two or more characters with special
meaning in regexs, the first character of which is always a
Most metasymbols fall into two categories
\item {\it special characters}: certain characters in a regex that
cannot easily be typed at the keyboard -- e.g. \verb+\n+ --
\item {\it character class shortcuts}: certain character classes are
used so frequently abbreviations have been created for them --
e.g. \verb+\d+ (digit character) abbreviates \verb+[0-9]+
{\bf Anchors}
The default behaviour of the Perl RE matcher is to attempt to
match the pattern against the string, starting at the beginning
of the string, then ``floating'' down it to find a match.
To force the pattern to match at particular points in the
string {\it anchors} can be included in the regex.
\item The beginning of a string is matched using the \verb+^+
character. E.g.
\verb+/bert/+ matches {\tt bert} and {\tt robert};
\verb+/^bert/+ matches {\tt bert} but not {\tt robert}.
\item The end of a string is matched using the \verb+$+
\item \verb+\b+ (word-boundary anchor) anchors matching at the
beginning or end of ``words'' (strings matched by \verb?\w+?).
\verb+\B+ matches non-word boundaries -- everywhere \verb+\b+
does not.
{\bf Quantifiers}
Quantifiers specify how many times a specific character pattern in
an RE is to match the input. There are three basic
quantifiers in Perl regexs:
\item \verb+*+ : match the preceding item 0 or more times
\item \verb?+? : match the preceding item 1 or more times
\item \verb+?+ : match the preceding item 0 or 1 times
Precise numbers of matches can be specified using numeric
quantifiers in curly braces.
Items to which quantifiers attach include characters, character
classes and groupings of characters in parentheses (``()'')
\item \verb?/:\w+:/? matches one or more ``word'' characters between
{\tt :}'s
% \item \verb?/foo+/? matches {\tt foo}, {\tt fooo}, {\tt foooo}, etc.
\item \verb?/(foo)+/? matches {\tt foo}, {\tt foofoo}, {\tt foofoofoo}, etc.
\item \verb?/\d{5,10}/? matches from 5 to 10 digits
{\bf Back References}
{\it Capturing} provides the capability to ``remember'' a substring
matched by part of a pattern, and use that substring later on in
the pattern itself via a back reference. Capturing is done using
parentheses (``()'') -- any substring of the target matched by a
pattern segment in parentheses is remembered by the matcher
Back referencing is done using a backslash followed by an
integer identifying which captured string is referred to --
the first captured substring is back referenced as \verb+\1+,
the second by \verb+\2+, etc.
For example \verb+/\B([a-z])\1(ing|ed)\b/+ would find all words ending
in a double-letter followed by {\tt ing} or {\tt ed}.
{\bf 20\%: 4\% each -- 3\% for definition, 1\% for example}
What would be an appropriate Perl data structure to hold an email
folder, where an email folder is viewed as a numbered sequence of
messages each of which consists of a collection of fields, such as
\verb+To+, \verb+From+, \verb+Subject+, etc.?
% %Give the Perl code to
% %access a message field, given the message number, the message field
% %name and the data structure holding the email folder.
% Give the Perl
% %code to sort and print the folder alphabetically by \verb+From+
% code to print one line for each email message in the folder
% containing the \verb+From+ and \verb+Subject+ fields for the
% message. The whole list should be sorted alphabetically by
% \verb+From+ field, and within \verb+From+ field by message number.
Write a Perl subroutine which, given a sender's name and a reference
to the email folder data structure, will print out the subject field
of all messages from the sender, one per line.
An appropriate data structure would be an array of hash references,
where each entry in the array points to an email message hash, and each hash has
as keys the field names and as values the contents of the field.
%The following Perl code sorts and prints the folder alphabetically.
%Assume the folder is held in an array of hash references called
% %senders = ();
% $index = 1;
% for ($email in @email_folder) {
% $senders{$index} = $$email{``From''};
% $$index++;
% }
% for ($i=0;$i <= $#@senders;$i++) {
% }
sub print_sender_messages {
my ($sender,$folder_ref) = @_;
my ($subject);
for ($email_ref in @$folder_ref) {
if ($$email_ref{"From"} eq $sender) {
$subject = ($$email_ref{"Subject"};
print "$subject\n";
{\bf 5 \% for the data structure; 15 \% for the code:
5 \% for the subroutine, parameter passing;
5 \% for using references correctly; 5 \% for appropriate
looping constructs}
% Text Compression
Text compression techniques are important because growth in volume
of text continually threatens to outstrip increases in storage,
bandwidth and processing capacity. Briefly explain the differences between:
\exitem {\bf symbolwise} (or
statistical) and {\bf dictionary} text compression methods;
\item {\bf Symbolwise methods} work by estimating the probabilities
of symbols (characters or words/non-words) and coding one symbol
at a time using shorter codewords for the more likely symbols
\item {\bf Dictionary methods} work by replacing word/text fragments
with an index to an entry in a dictionary
{\bf 5\% for explanation of symbolwise methods, 5 \% for dictionary}
\exitem {\bf modelling} versus {\bf coding} steps
Symbolwise methods rely on a modeling step and a coding step
\item {\bf Modeling} is the estimation of probabilities for the
symbols in the text -- the better the probability estimates, the
higher the compression that can be achieved
\item {\bf Coding} is the conversion of the probabilities obtained
from a model into a bitstream
{\bf 5\% for explanation of modelling step, 5 \% for coding step}
\exitem {\bf static}, {\bf semi-static} and {\bf adaptive}
techniques for text compression;
Compression techniques can also be distinguished by
whether they are
\item {\bf Static} -- use a fixed model or fixed dictionary derived
in advance of any text to be compressed
\item {\bf Semi-static} -- use current text to build a
model or dictionary during one pass, then apply it in second pass
\item {\bf Adaptive} -- build model or dictionary adaptively during
one pass
{\bf 3\% each, 1 \% discretionary}
% \exitem {\bf Huffman coding} and {\bf arithmetic coding} methods
% for text compression.
% \mypercent{10}
The script for the fictitious language Gavagese contains only the
7 characters {\it a}, {\it e}, {\it u}, {\it k}, {\it r}, {\it f}, {\it
d}. You assemble a large electronic corpus of Gavagese and now
want to compress it. You analyse the frequency of occurrence of each
of these characters in the corpus and, using these frequencies as
estimates of the probability of occurrence of the characters in the
language as a whole, produce the following table:
Symbol & Probability \\ \hline
a & 0.25\\
e & 0.20 \\
u & 0.30 \\
k & 0.05 \\
r & 0.07\\
f & 0.08\\
d & 0.05 \\
\exitem Show how to construct a Huffman code tree for Gavagese,
given the above probabilities.
Start off by creating a leaf node for each character, with
associated probability (a). Then join two nodes with smallest
probabilities under a single parent node, whose probability is their
sum, and repeat till only one node left. Finally, 0's and 1's are
assigned to each binary split.
{\bf 10 \% for the explanation of the method; 20 \% for a correct tree}
\exitem Use your codetree to encode the string {\it dukerafua} and