lab3_cw.tex 6.48 KB
Newer Older
 Loïc Barrault committed Mar 08, 2021 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178  \documentclass[12pt]{article} \usepackage{fancyhdr} \usepackage{epsfig} \setlength{\textwidth}{170mm} %\setlength{\textheight}{250mm} \setlength{\textheight}{240mm} %\setlength{\topmargin}{-15mm} \setlength{\topmargin}{-20mm} %\setlength{\headheight}{0pt} %\setlength{\headsep}{0pt} \setlength{\oddsidemargin}{-5mm} %\setlength{\evensidemargin}{0in} \setlength{\parindent}{0mm} \setlength{\parskip}{2mm} \lfoot{COM3110/6150} \rfoot{COM3110/6150} \pagestyle{fancy} %\setlength{\headrulewidth}{0cm} \renewcommand{\headrulewidth}{0cm} \newenvironment{my_enum}{\begin{list}{\arabic{equation}.}{% \usecounter{equation}% \setlength{\labelwidth}{2em}% \setlength{\labelsep}{0.8em}% \addtolength{\itemsep}{-2mm}% \setlength{\leftmargin}{\labelwidth}% \addtolength{\leftmargin}{\labelsep}% \addtolength{\leftmargin}{-1mm}% }}{\end{list}} \begin{document} \begin{center} \Large {\bf COM3110/6150: Lab Class 3 \$3mm] Computing Word Overlap } \end{center} \begin{my_enum} \item This week's lab provides practice in word counting and file-handling. More specifically, we look at computing the level of word overlap between two documents as a means for determining cases of document duplication, text reuse (e.g. plagiarism), and as a basis for determining whether two documents have the same topic. \item Download the file {\tt NEWS.zip} from the module homepage (under \underline{Lab class files}). Unzip this to produce a directory {\tt NEWS} of short news articles (with names {\tt news01.txt}, {\tt news02.txt},~\ldots). The directory also contains a file {\tt stop\_list.txt}: a list of unimportant' words, that are often ignored during word counting (discussed later in the course). \item Working within the directory {\tt NEWS}, write a script {\tt compare.py} which can be called from the command line as follows: \vspace*{-2mm} \begin{verbatim} > python compare.py -s stop_list.txt news01.txt news02.txt news01.txt <> news02.txt = 0.0234 \end{verbatim} Here, the command-line flag \verb!-s! identifies the file \verb!stop_list.txt! as an (optional) stoplist. It is is followed by two news files whose contents are to be compared. The script responds by printing the names of the files and a score value, which is a measure of the degree of word overlap between the two articles. (If you would rather delay using the {\tt getopt} module, you could supply the stoplist as a standard command line argument. To compute scores for when a stoplist is {\it not\/} used, you could substitute this file with an empty file.) Some suggestions about components to implement as part of this script follow. \item STOPLIST: Firstly, you might define a function to loads the stop words from the stoplist file into a suitable data structure --- a python {\it set\/} is an obvious candidate. The file specifies one stop word per line, for easy reading, but you must strip off the line break from the lines you read in (e.g. using string method {\tt .strip()}) or later look-up won't work. \item COUNTING: You might define a function (e.g. {\tt count\_words}) which counts the occurrences of (non-stoplist) words in a file, and returns these counts as a dictionary, i.e. with words as keys, and counts as values. (Alternatively, you might define a simple class, whose instance will store this information, and have the word counting function as a method.) \item TOKENISATION: To simplify, you might treat the words as being all the maximal alphabetic sequences found in the file (which can readily be extracted from each line with a regex findall). To improve overlap, you should convert the text to lowercase, so that different case variants are conflated in counting. \item COMPARISON: You might define a function which, given the counts for two files, computes a measure of their word overlap. We will use versions of the {\it jaccard coefficient}. A simple version ignores the counts of words in the files, and instead treats each as just a {\it set\/} of word types. For word sets A and B, this measure is computed as: \[ \frac{| A \cap B |}{| A \cup B |}$ \item For a count-sensitive version of a jaccard metric, we might use the following. For compared files $A$ and $B$, let $w_A$ and $w_b$ denote the counts of word $w$ in the two files respectively. Our measure is then computed as follows (where $\Sigma_w$ here implicitly ranges over the full set of terms appearing in either file): $\frac{\Sigma_w \,{\tt min}(w_A,w_B)}{\Sigma_w \,{\tt max}(w_A,w_B)}$ \item Having produced a script that can compare two files, you might modify it to perform pair-wise comparison across more than two files, e.g. as in: \vspace*{-2mm} \begin{verbatim} > python compare.py stop_list.txt news01.txt news02.txt news03.txt news01.txt <> news02.txt = 0.0187 news01.txt <> news03.txt = 0.0118 news02.txt <> news03.txt = 0.0112 \end{verbatim} \vspace*{-2mm} On unix-like platforms (i.e. unix, linux, mac), you can call your code in the following manner as a means of providing {\it all\/} news files as arguments: \vspace*{-2mm} \begin{verbatim} > python compare.py -s stop_list.txt news??.txt \end{verbatim} \vspace*{-2mm} Given the number of comparison here, you might modify your code to store up the scores, and then print out only the top $N$ scoring comparisons. I don't think that you can do this on Windows, so you will need to enter the arguments explicitly. (Fortunately, cases of the kind that we are seeking, can all be found within the first 10 documents.) \item Apply your code to the set of news articles to see if you can find the following cases of related files: \begin{itemize} \item {\it duplication\/}: two of the news files are identical \item {\it plagiarism\/}: one of the news files has been produced by cutting and pasting together text from two of the other files \item {\it related topics\/}: three of the articles address the same news story, as given by three different newspapers. Do these distinct presentations of the same story exhibit higher overlap than other separately authored articles? (Note, two of these related articles appear within the first 5 news stories.) \end{itemize} \item You can investigate whether similarity detection works better with the simple or the weighted metric, and whethet the use of a stoplist helps or not. \item If you have time, look back at your code and consider how the functionality might be grouped into classes, to make easier to understand (and potentially more reusable). \end{my_enum} \end{document}