lab3_cw.tex 6.48 KB
Newer Older
Loïc Barrault's avatar
Loïc Barrault committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178

\documentclass[12pt]{article}
\usepackage{fancyhdr}
\usepackage{epsfig}
\setlength{\textwidth}{170mm}
%\setlength{\textheight}{250mm}
\setlength{\textheight}{240mm}
%\setlength{\topmargin}{-15mm}
\setlength{\topmargin}{-20mm}
%\setlength{\headheight}{0pt}
%\setlength{\headsep}{0pt}
\setlength{\oddsidemargin}{-5mm}
%\setlength{\evensidemargin}{0in}

\setlength{\parindent}{0mm}
\setlength{\parskip}{2mm}

\lfoot{COM3110/6150}
\rfoot{COM3110/6150}
\pagestyle{fancy}
%\setlength{\headrulewidth}{0cm}
\renewcommand{\headrulewidth}{0cm}

\newenvironment{my_enum}{\begin{list}{\arabic{equation}.}{%    
\usecounter{equation}%
\setlength{\labelwidth}{2em}%
\setlength{\labelsep}{0.8em}%
\addtolength{\itemsep}{-2mm}%
\setlength{\leftmargin}{\labelwidth}%
\addtolength{\leftmargin}{\labelsep}%
\addtolength{\leftmargin}{-1mm}%
}}{\end{list}}

\begin{document}

\begin{center}
\Large
{\bf COM3110/6150: Lab Class 3
\\[3mm]
Computing Word Overlap
}
\end{center}

\begin{my_enum}
\item This week's lab provides practice in word counting and
  file-handling. More specifically, we look at computing the level of
  word overlap between two documents as a means for determining cases
  of document duplication, text reuse (e.g. plagiarism), and as a
  basis for  determining whether two documents have the same topic. 

\item Download the file {\tt NEWS.zip} from the module homepage (under
  \underline{Lab class files}). Unzip this to produce a directory {\tt
    NEWS} of short news articles (with names {\tt news01.txt}, {\tt
    news02.txt},~\ldots). The directory also contains a file {\tt
    stop\_list.txt}: a list of `unimportant' words, that are often
  ignored during word counting (discussed later in the course).

\item Working within the directory {\tt NEWS}, write a script {\tt
  compare.py} which can be called from the command line as follows:
\vspace*{-2mm}

\begin{verbatim}
    > python compare.py -s stop_list.txt news01.txt  news02.txt 
    news01.txt <> news02.txt = 0.0234
\end{verbatim}

Here, the command-line flag \verb!-s! identifies the file
\verb!stop_list.txt! as an (optional) stoplist. It is is followed by
two news files whose contents are to be compared. The script responds
by printing the names of the files and a score value, which is a
measure of the degree of word overlap between the two articles. (If
you would rather delay using the {\tt getopt} module, you could supply
the stoplist as a standard command line argument.  To compute scores
for when a stoplist is {\it not\/} used, you could substitute this
file with an empty file.)

Some suggestions about components to implement as part of this script
follow. 

\item STOPLIST: Firstly, you might define a function to `loads the
  stop words from the stoplist file into a suitable data structure ---
  a python {\it set\/} is an obvious candidate.  The file specifies
  one stop word per line, for easy reading, but you must strip off the
  line break from the lines you read in (e.g. using string method {\tt
    .strip()}) or later look-up won't work.

\item COUNTING:  You might define a function (e.g. {\tt count\_words}) which
  counts the occurrences of (non-stoplist) words in a file, and
  returns these counts as a dictionary, i.e.  with words as keys, and
  counts as values. (Alternatively, you might define a simple class,
  whose instance will store this information, and have the word
  counting function as a method.)

\item TOKENISATION: To simplify, you might treat the words as being
  all the maximal alphabetic sequences found in the file (which can
  readily be extracted from each line with a regex findall). To
  improve overlap, you should convert the text to lowercase, so that
  different case variants are conflated in counting.

\item COMPARISON: You might define a function which, given the counts
  for two files, computes a measure of their word overlap.  We will
  use versions of the {\it jaccard coefficient}. A simple version
  ignores the counts of words in the files, and instead treats each as
  just a {\it set\/} of word types. For word sets $A$ and $B$, this
  measure is computed as:

\[ \frac{| A \cap B |}{| A \cup B |} \]

\item For a count-sensitive version of a jaccard metric, we might use
  the following. For compared files $A$ and $B$, let $w_A$ and $w_b$
  denote the counts of word $w$ in the two files respectively. Our
  measure is then computed as follows (where $\Sigma_w$ here
  implicitly ranges over the full set of terms appearing in either
  file):

\[ \frac{\Sigma_w \,{\tt min}(w_A,w_B)}{\Sigma_w \,{\tt max}(w_A,w_B)} \]

\item Having produced a script that can compare two files, you might
modify it to perform pair-wise comparison across more than two files,
e.g. as in: 
\vspace*{-2mm}

\begin{verbatim}
  > python compare.py stop_list.txt news01.txt news02.txt news03.txt 
  news01.txt <> news02.txt = 0.0187
  news01.txt <> news03.txt = 0.0118
  news02.txt <> news03.txt = 0.0112        
\end{verbatim}
\vspace*{-2mm}

On unix-like platforms (i.e. unix, linux, mac), you can call your code
in the following manner as a means of providing {\it all\/} news files
as arguments:
\vspace*{-2mm}

\begin{verbatim}
  > python compare.py -s stop_list.txt news??.txt
\end{verbatim}
\vspace*{-2mm}

Given the number of comparison here, you might modify your code to
store up the scores, and then print out only the top $N$ scoring
comparisons.

I don't think that you can do this on Windows, so you will need to
enter the arguments explicitly. (Fortunately, cases of the kind that
we are seeking, can all be found within the first 10 documents.) 

\item Apply your code to the set of news articles to see if you can
  find the following cases of related files:

\begin{itemize}
\item {\it duplication\/}: two of the news files are identical

\item {\it plagiarism\/}: one of the news files has been produced by
  cutting and pasting together text from two of the other files

\item {\it related topics\/}: three of the articles address the same
  news story, as given by three different newspapers. Do these
  distinct presentations of the same story exhibit higher overlap than
  other separately authored articles? (Note, two of these related
  articles appear within the first 5 news stories.) 
\end{itemize}

\item You can investigate whether similarity detection works better
  with the simple or the weighted metric, and whethet the use of a
  stoplist helps or not. 

\item If you have time, look back at your code and consider how the 
functionality might be grouped into classes, to make easier to
understand (and potentially more reusable). 

\end{my_enum}
\end{document}