SentAnalysis_COM4115.tex 7.25 KB
Newer Older
Loïc Barrault's avatar
Loïc Barrault committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
\documentclass[english]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{lmodern}
\usepackage[a4paper]{geometry}
\usepackage{babel}
\usepackage{dblfloatfix}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{color}
\usepackage{xcolor}



\newcommand\ra[0]{$\rightarrow\ $}
\newcommand{\red}[1]{{\color{red} #1}}

\begin{document}

\begin{center}
	\LARGE
	{\bf COM4115 :  Text Processing (2020/2021)} \\[5mm]
	\Large
	\bf  Assignment:   Sentiment Analysis of Movie Reviews \\ %[2mm]
\end{center}

%\vspace{1cm}

\section{Project Description}

The aim of this project is to implement a machine learning model based on Naive Bayes for a sentiment analysis task using the Rotten Tomatoes movie review dataset. 
Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.

\section{Submission}
Submit your assignment work electronically via Blackboard. 
Precise instructions for what files to submit are given later in this document. 
Please check you have access to the relevant Blackboard unit and contact the module lecturer if not.

\vspace{.3cm}
\textbf{SUBMISSION DEADLINE: 15:00, Friday week 11 (11 December, 2020)}

\vspace{.3cm}
Penalties: standard departmental penalties apply for late hand-in and use of unfair means


\section{Data Description }

The dataset is a corpus of movie reviews originally collected by Pang and Lee.
This dataset contains tab-separated files with phrases from the Rotten Tomatoes dataset. 
The data are split into \textbf{train/dev/test} sets and the sentences are shuffled from their original order.
~\\
\begin{itemize}
	\item Each sentence has a \textbf{SentenceId}.
	\item They all have been \textbf{tokenized} already.
\end{itemize}

~\\
The training, dev and test set contain respectively 7529, 1000 and 3310 sentences. 
The sentences are labelled on a scale of five values: 
\begin{enumerate}
	\setcounter{enumi}{-1}
	\item negative 
	\item somewhat negative
	\item neutral
	\item somewhat positive
	\item positive
\end{enumerate} 

In the following table you can find several sentences and their \textbf{sentiment score}.

\begin{center}
\begin{tabular}{lll}
\toprule
SentenceId &	Phrase&	Sentiment\\ \midrule
1292 & The Sweetest Thing leaves a bitter taste .& 0 \\
343 & It labours as storytelling & 1\\ 
999 & There 's plenty to enjoy -- in no small part thanks to Lau . & 3 \\
1227 & Compellingly watchable . & 4 \\ \bottomrule
\end{tabular}
\end{center}

% EVALUATION
\section{Evaluation}
Systems are evaluated on classification accuracy (the percentage of labels that are predicted correctly) for every sentence on the dev and test set. 

% ROADMAP
\section{Project Roadmap}

\begin{enumerate}

\item Implement some preprocessing steps:
\begin{itemize}
	\item You are free to add any preprocessing step (e.g. lowercasing) before training your models. Explain what you did in your report.
	\item Integrate a preprocessing to create \textbf{subword level model} by using Byte Pair Encoding (BPE, see e.g. \url(https://github.com/rsennrich/subword-nmt)). \textbf{\red{For this step only}, you are allowed to re-use an already implemented tool.}
	\item Implement a function to map the 5-value sentiment scale to a 3-value sentiment scale. 
	Namely, the labels "negative" (value 0) and "somewhat negative" (value 1) are merged into label "negative" (value 0). 
	"Neutral" (value 2) will be mapped to "neutral" (value 1). And finally, "somewhat positive" (value 3) and "positive" (value 4) will be mapped to the label "positive" (value 2).
\end{itemize}

\item Visualize the \textbf{training} data:
\begin{itemize}
\item Implement any relevant visualisation answering all or part of the following:
\begin{itemize}
\item What is the vocabulary size and distribution?
\item What are the most relevant words for each category? (for the 3- and 5-value sentiment scale)
\item Are the classes equally probable?
\end{itemize}
\end{itemize}


\item Implement a Naive Bayes classifier \textbf{from scratch}.
\begin{itemize}
	\item You may \textbf{NOT} re-use already implemented classes/functions  (e.g. scikit-learn)
\end{itemize}

\item For each set of labels (5-value and 3-value scales), train \textbf{at least three different models} (at least 6 models in total):
\begin{itemize}
\item One considering all the words in the training set as features.
\item One considering the subwords obtained with a Byte Pair Encoding preprocessing.
\item One with a set of \textbf{features} of your choice determined by your experience (you will explain how you selected the features in your short report).
\end{itemize}

\item Compute and display confusion matrices on the development set for each developed models. Compare the results.
\item Process the test data with your best performing model.
\item Write a report (see below for details) 
\end{enumerate}

% WHAT TO SUBMIT
\section{What to Submit}
Your assignment work is to be submitted electronically using MOLE, and should include:

\begin{enumerate}

\item \textbf{Your Python code}.\\ 
It can be either a python file or a python notebook.

\item  \textbf{Four files with the predictions on the development and test corpora}.\\
The format is tab separated as follows : \textbf{sentence\_id[tab]sentiment\_value}\\
An example file named "SampleSubmission\_test\_predictions\_5classes\_John\_DOE.tsv" is provided with the data.\\
Those files \textbf{MUST BE NAMED} respectively:
\begin{itemize}
\item dev\_predictions\_3classes\_Firstname\_LASTNAME.tsv 
\item test\_predictions\_3classes\_Firstname\_LASTNAME.tsv 
\item dev\_predictions\_5classes\_Firstname\_LASTNAME.tsv 
\item test\_predictions\_5classes\_Firstname\_LASTNAME.tsv
\end{itemize}
where \textbf{Firstname is your firstname and LASTNAME is your lastname}.

\item  \textbf{A short report (as a pdf file)}.\\
It should \textbf{NOT EXCEED 2 PAGES IN LENGTH}. The report  should  include  a  brief  description  of  the  extent  of  the  implementation  achieved, and should present the performance results you have collected under different configurations, and any conclusions drawn from your analysis of these results.
Graphs/tables may be used in presenting your results, to aid exposition.

\end{enumerate}

% ASSESSMENT
\section{Assessment Criteria}
A total of 25 marks are available for the assignment and will be assigned based on the following criteria.

\textbf{Implementation and Code Style (15 marks)}\\
Have appropriate Python constructs been used?  
Is the code comprehensible and clearly commented?
How many different models have been tested?
How do you choose which is the best model?

\textbf{Report (10 marks)}\\
Is the report a clear and accurate description of the implementation?   
How complete and accurate is the discussion of the performance of the different systems under a range of configurations? 
What is the most important aspect to be taken into account to get the best results with a Naive Bayes approach?

% NOTES / COMMENTS
\section{Notes and Comments}

\begin{itemize}
\item Consider using the \textbf{Pandas} library to load the data \url{https://pandas.pydata.org/}.
\item Consider using  \textbf{Seaborn heatmap} to render the confusion matrices \url{https://seaborn.pydata.org/}.
\item You may search internet for lists of English punctuation and/or stopwords (also called function words) that you may use in your assignment.
\end{itemize}

\end{document}