Commit eb4c5bd9 authored by Fethi Bougares's avatar Fethi Bougares

SentAna project

parent 0356c97c
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
\relax
\catcode `:\active
\catcode `;\active
\catcode `!\active
\catcode `?\active
\babel@aux{french}{}
\@writefile{toc}{\contentsline {section}{\numberline {1}Project Description}{1}}
\@writefile{toc}{\contentsline {section}{\numberline {2}Data Description }{1}}
\@writefile{lot}{\contentsline {table}{\numberline {1}{\ignorespaces Example of phrases from the training data and their scores}}{1}}
\@writefile{toc}{\contentsline {section}{\numberline {3}Evaluation}{2}}
\@writefile{toc}{\contentsline {section}{\numberline {4}Project Roadmap}{2}}
This diff is collapsed.
\documentclass[french]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{lmodern}
\usepackage[a4paper]{geometry}
\usepackage{babel}
\usepackage{dblfloatfix}
\begin{document}
\begin{center}
\LARGE
{\bf Movie Review Sentiment Analysis} \\[5mm]
\Large
\bf Projet App Auto en langues \\
{\bf 2018/2019} \\[2mm]
\end{center}
\vspace{1cm}
\section{Project Description}
The aim of this project is to implement a machine learning model for a sentiment analysis task using the Rotten Tomatoes movie review dataset. During this project you are asked to label phrases on a scale of five values:
\begin{enumerate}
\setcounter{enumi}{-1}
\item negative
\item somewhat negative
\item neutral,
\item somewhat positive,
\item positive.
\end{enumerate}
Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.
\section{Data Description }
The dataset is a corpus of movie reviews originally collected by Pang and Lee [1].This dataset contain tab-separated files with phrases from the Rotten Tomatoes dataset. The data are splitted to \textbf{train/test} sets and the sentences are shuffled from their original order.
~\\
\begin{itemize}
\item Each Sentence has been parsed into many \textbf{phrases} by the Stanford parser.
\item Each phrase has a PhraseId.
\item Each sentence has a \textbf{SentenceId}.
\item Phrases that are repeated (such as short/common words) are only included once in the data.
\end{itemize}
~\\
The training set contain 156000 examples and the test set represent 66300 phrases. In the following table you can find several phrases and their Sentiment score.
\begin{table}[!b]
\begin{tabular}{|c|c|l|c|}
\hline
PhraseId & SentenceId & Phrase& Sentiment\\ \hline
8140& 336 &of inept filmmaking &1\\ \hline
8143& 336 &joyless , idiotic , annoying , heavy-handed ,& 0\\ \hline
8146& 336 &joyless& 0 \\ \hline
8147& 336 &idiotic , annoying , heavy-handed& 2\\ \hline
\end{tabular}
\caption{Example of phrases from the training data and their scores}
\end{table}
\newpage
\section{Evaluation}
Systems are evaluated on classification accuracy (the percent of labels that are predicted correctly) for every parsed phrase.
\section{Project Roadmap}
\begin{enumerate}
\item Study and plot the training data
\item Split the data (train/dev)
\item Train and evaluate a vanilla deep recurrent neural network (RNN)
\item Use the Pytorch framework to train the RNN network.
\item Optimize the model and propose enhancement (Regularization, network init, new architecture )
\item Prepare the final defense
\end{enumerate}
\end{document}
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment