Commit adcd87f9 authored by Loïc Barrault's avatar Loïc Barrault
Browse files

textprocessing updates

parent 025ecbab
COM3110
DEPARTMENT OF COMPUTER SCIENCE
Text Processing
Autumn Semester 2001-02
2 hours
Answer THREE questions.
All questions carry equal weight. Figures in square brackets indicate the
percentage of available marks allocated to each part of a question.
Schemes for electronic encoding of text which support all human
languages and permit interoperability of software on a global scale
require careful analysis of underlying issues in text
representation and design of appropriate standards.
Explain each of the following terms and make clear
the relations between them: language, script,
character, glyph, font. Give examples of each.
20
Describe the Unicode coding model, making clear the
levels in the model, explaining the differences between and the
motivations for UTF-8 and UTF-16, and describing the purpose and
implementation of surrogate pairs. 30
Information about documents is frequently stored in the document itself
using embedded annotations called ``markup''.
What is the difference between a markup
metalanguage and a markup language? Give at least two examples of
each.
10
What is a DTD? Propose a simple SGML DTD for the abstract
of a journal article. It should require that the abstract
contain one or more author names, a title, a journal name,
volume number, issue number, and page numbers and the text of
the abstract. Each author should have associated with them an
affiliation (e.g. their university). Give a simple example
of a fictitious abstract marked up using your DTD.
40
Perl has three basic data types.
What are these three types? Give an example of each and
indicate how the type of data stored in a variable is conveyed
syntactically by the name of the variable.
10
With these basic types more complex data structures may
be built using references. Describe a data structure that would
be appropriate to hold contact details for a set of persons --
for each person a telephone number, email address and fax number
is to be held.
Give Perl code that adds a new person's data to the
structure -- you may assume each piece of data (person name,
telephone number, email address, fax number) is already held in
a distinct named variable. Also give code that extracts a data
element (e.g. fax number) into a variable given the data
structure, a person's name, and a data element keyword (e.g.
fax). Explain how your code works.
30
Explain the difference between lexical and dynamic scoping of
variables in Perl, and the role of the my, local
and our declarations.
20
Regular expressions provide a very expressive language for pattern
matching in strings.
Explain the difference between metacharacters and
metasymbols in Perl regular expressions and give at least two
examples of each.
10
Write a Perl regular expression which will match
HTML anchor tags and capture the value of the HREF
attribute and the contents of the anchor tag itself.
For example, suppose the following assignment has been made
in a Perl program:
Your regular expression should match
such strings and capture the substrings
and .
Explain how your regular expression works.
30
\documentclass[12pt]{article}
%\documentstyle[12pt,newexam,epsf]{article}
%\documentstyle[myexam]{article}
%\usepackage{graphicx}
\usepackage{newexam+shield}
\usepackage{epsf}
\input{abbrev}
%\input{matdec}
\pressmark{COM3110}
\department{\bf {DEPARTMENT OF COMPUTER SCIENCE}}
\examtitle{{\bf Text Processing}}
\examdate{{\bf Autumn Semester 2002-03}}
\examtime{{\bf 2 hours}}
\rubric{{\bf Answer THREE questions.
All questions carry equal weight. Figures in square brackets indicate the
percentage of available marks allocated to each part of a question.}}
%%%% local macros %%%%
\newcommand{\seq}[2]{\mbox{$#1_{1},\- \ldots ,\-#1_{#2}$}}
\newtheorem{prop}{Proposition}[questionnumber]
\newenvironment{proof}{\begin{trivlist}\item[\hskip \labelsep {\bf Proof}]}{\nopagebreak
\rule{1mm}{3mm}\end{trivlist}}
\newcommand{\es}{\mbox{$\lambda$}}
\newcommand{\mrewrites}{\mbox{$\stackrel{*}{\Rightarrow}$}}
\newcommand{\psrule}{\mbox{$\rightarrow$}}
%\newcommand{\psrule}[2]{\mbox{{\rm #1}\ \ $\rightarrow$\ \ {\rm #2}}}
\newcommand{\synsemrule}[3]{%
\mbox{{\rm #1}\ \ $\rightarrow$\ \ {\rm #2}\ :\ ${\it #3}$}}
\newcommand{\scbrl}{\left[\hspace*{-0.12em}\left[}
\newcommand{\scbrr}{\right]\hspace*{-0.12em}\right]}
%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\begin{exam}
%\hspace*{\fill}QUESTION CONTINUED ON NEXT PAGE
%\turnover
%\continued
%\questionsend
%\begin{question}
% \begin{qupart}
% \begin{exlist}
% \exitem
% \mypercent{10}
% \exitem
% \mypercent{20}
% \end{exlist}
% \end{qupart}
%\end{question}
\begin{question}
Answer {\bf each} of the following short answer questions.
% Character encoding, compression and text markup
% Character Encoding + Unicode
\begin{qupart}
Explain each of the following terms and make clear
the relations between them: {\bf language}, {\bf script},
{\bf character}, {\bf glyph}, {\bf font}. Give examples of each.
\mypercent{20}
%\exitem Explain the overall goals and high level design principles
%underlying Unicode.
%What distinguishes Unicode from earlier character coding schemes?
%\mypercent{20}
\end{qupart}
\begin{qupart}
Briefly describe the function of, and give an example of, each of the
\verb+<!ELEMENT>+, \verb+<!ATTLIST>+ and \verb+<!ENTITY>+
declarations as found in SGML document type definitions.
\mypercent{20}
\end{qupart}
\begin{qupart}
Briefly describe the Unicode coding model, making clear the levels in the
model, explaining the differences between, and the motivations for,
UTF-8 and UTF-16
%, and describing the purpose and implementation of
%surrogate pairs.
\mypercent{20}
\end{qupart}
\begin{qupart}
Define each of the following terms as they are used in discussing
Perl regular expressions and give an example of each:
{\bf metacharacters}, {\bf metasymbols}, {\bf anchors}, {\bf
quantifiers}, {\bf back references}.
\mypercent{20}
\end{qupart}
\begin{qupart}
What would be an appropriate Perl data structure to hold an email
folder, where an email folder is viewed as a numbered sequence of
messages each of which consists of a collection of fields, such as
\verb+To+, \verb+From+, \verb+Subject+, etc.?
Write a Perl subroutine which, given a sender's name and a reference
to the email folder data structure, will print out the subject field
of all messages from the sender, one per line.
\mypercent{20}
\end{qupart}
\end{question}
\begin{question}
% Text Compression
\begin{qupart}
Text compression techniques are important because growth in volume
of text continually threatens to outstrip increases in storage,
bandwidth and processing capacity. Briefly explain the differences between:
\begin{exlist}
\exitem {\bf symbolwise} (or
statistical) and {\bf dictionary} text compression methods;
\mypercent{10}
\exitem {\bf modelling} versus {\bf coding} steps;
\mypercent{10}
\exitem {\bf static}, {\bf semi-static} and {\bf adaptive}
techniques for text compression.
\mypercent{10}
% \exitem {\bf Huffman coding} and {\bf arithmetic coding} methods
% for text compression.
% \mypercent{10}
\end{exlist}
\end{qupart}
%\begin{qupart}
% Explain and make clear the differences between the following
% terms, as normally used in discussing text compression techniques:
%\begin{exlist}
%\exitem symbolwise versus dictionary methods
% \mypercent{10}
%\exitem modelling versus coding steps
% \mypercent{10}
%\exitem static versus semi-static versus adaptive techniques
% \mypercent{10}
%\end{exlist}
%\end{qupart}
\turnover
\begin{qupart}
The script for the fictitious language Gavagese contains only the
7 characters {\it a}, {\it e}, {\it u}, {\it k}, {\it r}, {\it f}, {\it
d}. You assemble a large electronic corpus of Gavagese and now
want to compress it. You analyse the frequency of occurrence of each
of these characters in the corpus and, using these frequencies as
estimates of the probability of occurrence of the characters in the
language as a whole, produce the following table:
\bigskip
\begin{center}
\begin{tabular}{ll}
Symbol & Probability \\ \hline
a & 0.25\\
e & 0.20 \\
u & 0.30 \\
k & 0.05 \\
r & 0.07\\
f & 0.08\\
d & 0.05 \\
\end{tabular}
\end{center}
\begin{exlist}
\exitem Show how to construct a Huffman code tree for Gavagese,
given the above probabilities.
\mypercent{30}
\exitem Use your codetree to encode the string {\it dukerafua} and
show the resulting binary encoding. For this string, how much does
length does your codetree encoding save over a minimal fixed length
binary character encoding for a 7 character alphabet?
\mypercent{10}
\end{exlist}
\end{qupart}
\begin{qupart}
One popular compression technique is the LZ77 method, used in common
compression utilities such as {\it gzip}.
\begin{exlist}
\exitem Explain how LZ77 works.
\mypercent{20}
\exitem How would the following LZ77 encoder output
% a b ad badb bbba
%badbadbbbbaaaaddddabba
\[
\langle 0, 0, b \rangle \langle 0, 0, a \rangle
\langle 0, 0, d \rangle
\langle 3, 3, b \rangle \langle 1, 3, a\rangle
\langle 1, 3, d \rangle \langle 1, 3, a \rangle
\langle 11, 2, a \rangle
\]
be decoded, assuming the encoding representation presented in the
lectures?
\mypercent{10}
\end{exlist}
\end{qupart}
\end{question}
%\continued
\questionsend
\end{exam}
\end{document}
\ No newline at end of file
COM3110
DEPARTMENT OF COMPUTER SCIENCE
Text Processing
Autumn Semester 2002-03
2 hours
Answer THREE questions.
All questions carry equal weight. Figures in square brackets indicate the
percentage of available marks allocated to each part of a question.
1. Answer each of the following short answer questions.
a) Explain each of the following terms and make clear the relations
between them: language, script, character, glyph, font. Give
examples of each. [20%]
b) Briefly describe the function of, and give an example of, each
of the <!ELEMENT>, <!ATTLIST>+ and <!ENTITY> as found in SGML
document type definitions. [20%]
c) Briefly describe the Unicode coding model, making clear the
levels in the model, explaining the differences between, and the
motivations for, UTF-8 and UTF-16. [20%]
d) Define each of the following terms as they are used in
discussing Perl regular expressions and give an example of each:
metacharacters, metasymbols, anchors, quantifiers, back references.
[20%]
e) What would be an appropriate Perl data structure to hold an
email folder, where an email folder is viewed as a numbered
sequence of messages each of which consists of a collection of
fields, such as To, From , Subject, etc.?
Write a Perl subroutine which, given a sender's name and a
reference to the email folder data structure, will print out the
subject field of all messages from the sender, one per line. [20%]
2. a) Text compression techniques are important because growth in
volume of text continually threatens to outstrip increases in
storage, bandwidth and processing capacity. Briefly explain the
differences between:
(i) symbolwise (or statistical) and dictionary text compression
methods; [10%]
(ii) modelling versus coding steps; [10% ]
(iii) static, semi-static and adaptive techniques for text
compression. {10%]
b) The script for the fictitious language Gavagese contains only the
7 characters a, e, u, k, r, f, d. You assemble a large electronic
corpus of Gavagese and now want to compress it. You analyse the
frequency of occurrence of each of these characters in the corpus
and, using these frequencies as estimates of the probability of
occurrence of the characters in the language as a whole, produce the
following table:
Symbol Probability
a 0.25
e 0.20
u 0.30
k 0.05
r 0.07
f 0.08
d 0.05
(i) Show how to construct a Huffman code tree for Gavagese, given
the above probabilities. [30%]
(ii) Use your codetree to encode the string dukerafua and show the
resulting binary encoding. For this string, how much does length does
your codetree encoding save over a minimal fixed length binary
character encoding for a 7 character alphabet? [10%]
c) One popular compression technique is the LZ77 method, used in
common compression utilities such as gzip.
(i) Explain how LZ77 works. [20%]
(ii) How would the following LZ77 encoder output
<0,0,b><0,0,a><0,0,d><3,3,b><1,3,a><1,3,d><1,3,a><11,2,a>
be decoded, assuming the encoding representation presented in the
lectures? [10%]
COM3110
DEPARTMENT OF COMPUTER SCIENCE
Text Processing
Autumn Semester 2002-03
2 hours
Answer THREE questions.
All questions carry equal weight. Figures in square brackets indicate the
percentage of available marks allocated to each part of a question.
1. Answer each of the following short answer questions.
a) Explain each of the following terms and make clear the relations
between them: language, script, character, glyph, font. Give
examples of each. [20%]
b) Briefly describe the function of, and give an example of, each
of the <!ELEMENT>, <!ATTLIST>+ and <!ENTITY> as found in SGML
document type definitions. [20%]
c) Briefly describe the Unicode coding model, making clear the
levels in the model, explaining the differences between, and the
motivations for, UTF-8 and UTF-16. [20%]
d) Define each of the following terms as they are used in
discussing Perl regular expressions and give an example of each:
metacharacters, metasymbols, anchors, quantifiers, back references.
[20%]
e) What would be an appropriate Perl data structure to hold an
email folder, where an email folder is viewed as a numbered
sequence of messages each of which consists of a collection of
fields, such as To, From , Subject, etc.?
Write a Perl subroutine which, given a sender's name and a
reference to the email folder data structure, will print out the
subject field of all messages from the sender, one per line. [20%]