SentAnalysis_COM3110.tex 6.44 KB
 Loïc Barrault committed Mar 08, 2021 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 \documentclass[english]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{lmodern} \usepackage[a4paper]{geometry} \usepackage{babel} \usepackage{dblfloatfix} \usepackage{hyperref} \usepackage{booktabs} \begin{document} \begin{center} \LARGE {\bf COM3110 : Text Processing (2020/2021)} \\[5mm] \Large \bf Assignment: Sentiment Analysis of Movie Reviews \\ %[2mm] \end{center} %\vspace{1cm} \section{Project Description} The aim of this project is to implement a machine learning model based on Naive Bayes for a sentiment analysis task using the Rotten Tomatoes movie review dataset. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging. \section{Submission} Submit your assignment work electronically via Blackboard. Precise instructions for what files to submit are given later in this document. Please check you have access to the relevant Blackboard unit and contact the module lecturer if not. \vspace{.3cm} \textbf{SUBMISSION DEADLINE: 15:00, Friday week 11 (11 December, 2020)} \vspace{.3cm} Penalties: standard departmental penalties apply for late hand-in and use of unfair means \section{Data Description } The dataset is a corpus of movie reviews originally collected by Pang and Lee. This dataset contains tab-separated files with phrases from the Rotten Tomatoes dataset. The data are split into \textbf{train/dev/test} sets and the sentences are shuffled from their original order. ~\\ \begin{itemize} \item Each sentence has a \textbf{SentenceId}. \item They all have been \textbf{tokenized} already. \end{itemize} ~\\ The training, dev and test set contain respectively 7529, 1000 and 3310 sentences. The sentences are labelled on a scale of five values: \begin{enumerate} \setcounter{enumi}{-1} \item negative \item somewhat negative \item neutral \item somewhat positive \item positive \end{enumerate} In the following table you can find several sentences and their \textbf{sentiment score}. \begin{center} \begin{tabular}{lll} \toprule SentenceId & Phrase& Sentiment\\ \midrule 1292 & The Sweetest Thing leaves a bitter taste .& 0 \\ 343 & It labours as storytelling & 1\\ 999 & There 's plenty to enjoy -- in no small part thanks to Lau . & 3 \\ 1227 & Compellingly watchable . & 4 \\ \bottomrule \end{tabular} \end{center} % EVALUATION \section{Evaluation} Systems are evaluated on classification accuracy (the percentage of labels that are predicted correctly) for every sentence on the dev and test set. % ROADMAP \section{Project Roadmap} \begin{enumerate} \item Implement some preprocessing steps: \begin{itemize} \item You are free to add any preprocessing step (e.g. lowercasing) before training your models. Explain what you did in your report. \item Implement a function to map the 5-value sentiment scale to a 3-value sentiment scale. Namely, the labels "negative" (value 0) and "somewhat negative" (value 1) are merged into label "negative" (value 0). "Neutral" (value 2) will be mapped to "neutral" (value 1). And finally, "somewhat positive" (value 3) and "positive" (value 4) will be mapped to the label "positive" (value 2). \end{itemize} \item Implement a Naive Bayes classifier \textbf{from scratch}. \begin{itemize} \item You may \textbf{NOT} re-use already implemented classes/functions (e.g. scikit-learn) \end{itemize} \item For each set of labels (5-value and 3-value scales), train \textbf{at least two different models} (at least 4 models in total): \begin{itemize} \item One considering all the words in the training set as features. \item One with a set of \textbf{features} of your choice determined by your experience (you will explain how you selected the features in your short report). \end{itemize} \item Compute and display confusion matrices on the development set for each developed models. Compare the results. \item Process the test data with your best performing model. \item Write a report (see below for details) \end{enumerate} % WHAT TO SUBMIT \section{What to Submit} Your assignment work is to be submitted electronically using MOLE, and should include: \begin{enumerate} \item \textbf{Your Python code}.\\ It can be either a python file or a python notebook. \item \textbf{Four files with the predictions on the development and test corpora}.\\ The format is tab separated as follows : \textbf{sentence\_id[tab]sentiment\_value}\\ An example file named "SampleSubmission\_test\_predictions\_5classes\_John\_DOE.tsv" is provided with the data.\\ Those files \textbf{MUST BE NAMED} respectively: \begin{itemize} \item dev\_predictions\_3classes\_Firstname\_LASTNAME.tsv \item test\_predictions\_3classes\_Firstname\_LASTNAME.tsv \item dev\_predictions\_5classes\_Firstname\_LASTNAME.tsv \item test\_predictions\_5classes\_Firstname\_LASTNAME.tsv \end{itemize} where \textbf{Firstname is your firstname and LASTNAME is your lastname}. \item \textbf{A short report (as a pdf file)}.\\ It should \textbf{NOT EXCEED 2 PAGES IN LENGTH}. The report should include a brief description of the extent of the implementation achieved, and should present the performance results you have collected under different configurations, and any conclusions you draw from your analysis of these results. Graphs/tables may be used in presenting your results, to aid exposition. \end{enumerate} % ASSESSMENT \section{Assessment Criteria} A total of 25 marks are available for the assignment and will be assigned based on the following criteria. \textbf{Implementation and Code Style (15 marks)}\\ Have appropriate Python constructs been used? Is the code comprehensible and clearly commented? How many different models have been tested? How do you choose which is the best model? \textbf{Report (10 marks)}\\ Is the report a clear and accurate description of the implementation? How complete and accurate is the discussion of the performance of the different systems under a range of configurations? What is the most important aspect to be taken into account to get the best results with a Naive Bayes approach? % NOTES / COMMENTS \section{Notes and Comments} \begin{itemize} \item Consider using the \textbf{Pandas} library to load the data \url{https://pandas.pydata.org/}. \item Consider using \textbf{Seaborn heatmap} to render the confusion matrices \url{https://seaborn.pydata.org/}. \item You may search internet for lists of English punctuation and/or stopwords (also called function words) that you may use in your assignment. \end{itemize} \end{document}