Commit bcf985de authored by Fethi Bougares's avatar Fethi Bougares

NER project

parent aa48037f
-------------------------------------------------------------------------------
NOTE: ADDED 16 AUGUST 2016
The Reuters Corpus is not distributed on a cd anymore but as a single
compressed file rcv1.tar.xz . In order to extract the English shared
task files from this file, you need the script ner/bin/make.eng.2016
1. download ner.tgz from http://www.clips.uantwerpen.be/conll2003/ner/
2. extract the files from ner.tgz: tar zxf ner.tgz
3. put the Reuters file rcv1.tar.xz in the new directory ner
4. run make.eng.2016 from directory ner: cd ner; bin/make.eng.2016
This should generate the three files eng.train, eng.testa and eng.testb
Contact: erikt(at)xs4all.nl
-------------------------------------------------------------------------------
20030423 CONLL-2003 SHARED TASK
GENERAL
This is the 20030423 release of the data for the CoNLL-2003 shared
task. In order to be able to use this data you need the Reuters
Corpus cd (for the English data) and the ECI Multilingual Text cd
(for the German data) which can be obtained via the following two
addresses:
http://trec.nist.gov/data/reuters/reuters.html
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5
This distribution only contains annotation information and filter
software. The words of the articles in the two corpora have not
been included here for copyright reasons. That is why you need the
the two cds for building the complete data sets.
The CoNLL-2003 shared task deals with Language-Independent Named
Entity Recognition. The two languages we deal with are English
and German. More information about this shared task can be found
on the related web page http://www.cnts.ua.ac.be/conll2003/ner/
BUILDING THE TRAIN AND TEST DATA FILES
In order to obtain the data files you need to perform three steps:
1. Extract the CoNLL-2003 files from the tar file available at
http://www.cnts.ua.ac.be/conll2003/ner.tgz
(tar zxf ner.tgz)
2a (English data) Insert the first cd of the Reuters Corpus in
your computer and mount it (mount /mnt/cdrom)
2b (German data) Insert the ECI Multilingual Text cd in
your computer and mount it (mount /mnt/cdrom)
3. Run the relevant extraction software from the ner directory
English: cd ner; bin/make.eng
German: cd ner; bin/make.deu
This will generate the training data (either eng.train or deu.train),
the development test data (eng.testa or deu.testa) and the final test
data (eng.testb or deu.testb) in the ner directory. You can use the
development data as test data during the development process of your
system. When your system works well, it can be applied to the final
test data.
These instructions assume that you work on a Linux work station and
that the cd files are available from the directory /mnt/cdrom . This
procedure might not work on other platforms.
News: January 26, 2006: Sven Hartrumpf from the FernUniversitat in
Hagen, Germany has checked and revised the entity annotations of the
German data. The new version is believed to be more accurate than
the previous one which was done by nonnative speakers. The files
associated with the new annotations can be found in the directory
ner/etc.2006
BUILDING THE UNANNOTATED DATA FILES
The unannotated data files can be build in the same way as the train
and test files. However, because of their size the annotation of these
files has been stored in separate tar files which you should fetch
first. Make sure that you have fetched and unpacked the main tar file
ner.tgz because that contains the software for building the files
with unannotated data. Here are the steps you should perform:
1. Extract the CoNLL-2003 files from the tar file available at
http://www.cnts.ua.ac.be/conll2003/ner/ner.tgz
(tar zxf eng.tgz)
2a (English data) Extract the unannotated annotation files from
http://www.cnts.ua.ac.be/conll2003/ner/eng.raw.tar
(tar xf eng.raw.tar)
2b (German data) Extract the unannotated annotation files from
http://www.cnts.ua.ac.be/conll2003/ner/deu.raw.tar
(tar xf deu.raw.tar)
3a (English data) Insert the first cd of the Reuters Corpus in
your computer and mount it (mount /mnt/cdrom)
4b (German data) Insert the first cd of the Reuters Corpus in
your computer and mount it (mount /mnt/cdrom)
4. Run the relevant extraction software from the ner directory
English: cd ner; bin/make.eng.raw
German: cd ner; bin/make.deu.raw
This will generate the file eng.raw.gz (or deu.raw.gz) in the ner
directory. These files have been compressed with gzip.
These instructions assume that you work on a Linux work station and
that the cd files are available from the directory /mnt/cdrom . This
procedure might not work on other platforms.
DATA FORMAT
The data files contain one word per line. Empty lines have been used
for marking sentence boundaries and a line containing the keyword
-DOCSTART- has been added to the beginning of each article in order
to mark article boundaries. Each non-empty line contains the following
tokens:
1. the current word
2. the lemma of the word (German only)
3. the part-of-speech (POS) tag generated by a tagger
4. the chunk tag generated by a text chunker
5. the named entity tag given by human annotators
The tagger and chunker for English are roughly similar to the
ones used in the memory-based shallow parser demo available at
http://ilk.uvt.nl/ German POS and chunk information has been
generated by the Treetagger from the University of Stuttgart:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
In order to simulate a real natural language processing
environment, the POS tags and chunk tags have not been checked.
This means that they will contain errors. If you have access to
annotation software with a performance that is superior to this,
you may replace these tags by yours.
The chunk tags and the named entity tags use the IOB1 format. This
means that in general words inside entity receive the tag I-TYPE
to denote that they are Inside an entity of type TYPE. Whenever
two entities of the same type immediately follow each other, the
first word of the second entity will receive tag B-TYPE rather than
I-TYPE in order to show that a new entity starts at that word.
The raw data has the same format as the training and test material
but the final column has been ommitted. There are word lists for
English (extracted from the training data), German (extracted from
the training data), and Dutch in the directory lists. Probably you
can use the Dutch person names (PER) for English data as well. Feel
free to use any other external data sources that you might have
access to.
GOALS
In the CoNLL-2002 shared task we have worked on named entity
recognition as well (Spanish and Dutch). The CoNLL-2003 shared
task deals with two different languages (English and German).
Additionally we supply additional information: lists of named
entities and non-annotated data. One of the main tasks of
the participants in the CoNLL-2003 shared task will be to find
out how these additional resources can be used to improve the
performance of their system.
BASELINE
The baseline performance for this shared task is assigning named
entity classes to word sequences that occur in the training data.
It can be computed as follows (example for English development data):
bin/baseline eng.train eng.testa | bin/conlleval
and the results are:
eng.testa: precision: 78.33%; recall: 65.23%; FB1: 71.18
eng.testb: precision: 71.91%; recall: 50.90%; FB1: 59.61
deu.testa: precision: 37.19%; recall: 26.07%; FB1: 30.65
deu.testb: precision: 31.86%; recall: 28.89%; FB1: 30.30
If you build a system for this task, it should at least improve on
the performance of this baseline system.
Antwerp, April 23, 2003
Erik Tjong Kim Sang <erik.tjongkimsang@ua.ac.be>
Fien De Meulder <fien.demeulder@ua.ac.be>
#!/usr/bin/perl -w
# conlleval: evaluate result of processing CoNLL-2000 shared task
# usage: conlleval [-l] [-r] [-d delimiterTag] [-o oTag] < file
# README: http://cnts.uia.ac.be/conll2000/chunking/output.html
# options: l: generate LaTeX output for tables like in
# http://cnts.uia.ac.be/conll2003/ner/example.tex
# r: accept raw result tags (without B- and I- prefix;
# assumes one word per chunk)
# d: alternative delimiter tag (default is single space)
# o: alternative outside tag (default is O)
# note: the file should contain lines with items separated
# by $delimiter characters (default space). The final
# two items should contain the correct tag and the
# guessed tag in that order. Sentences should be
# separated from each other by empty lines or lines
# with $boundary fields (default -X-).
# url: http://lcg-www.uia.ac.be/conll2000/chunking/
# started: 1998-09-25
# version: 2003-04-28
# author: Erik Tjong Kim Sang <erikt@uia.ua.ac.be>
use strict;
my $false = 0;
my $true = 42;
my $boundary = "-X-"; # sentence boundary
my $correct; # current corpus chunk tag (I,O,B)
my $correctChunk = 0; # number of correctly identified chunks
my $correctTags = 0; # number of correct chunk tags
my $correctType; # type of current corpus chunk tag (NP,VP,etc.)
my $delimiter = " "; # field delimiter
my $FB1 = 0.0; # FB1 score (Van Rijsbergen 1979)
my $firstItem; # first feature (for sentence boundary checks)
my $foundCorrect = 0; # number of chunks in corpus
my $foundGuessed = 0; # number of identified chunks
my $guessed; # current guessed chunk tag
my $guessedType; # type of current guessed chunk tag
my $i; # miscellaneous counter
my $inCorrect = $false; # currently processed chunk is correct until now
my $lastCorrect = "O"; # previous chunk tag in corpus
my $latex = 0; # generate LaTeX formatted output
my $lastCorrectType = ""; # type of previously identified chunk tag
my $lastGuessed = "O"; # previously identified chunk tag
my $lastGuessedType = ""; # type of previous chunk tag in corpus
my $lastType; # temporary storage for detecting duplicates
my $line; # line
my $nbrOfFeatures = -1; # number of features per line
my $precision = 0.0; # precision score
my $oTag = "O"; # outside tag, default O
my $raw = 0; # raw input: add B to every token
my $recall = 0.0; # recall score
my $tokenCounter = 0; # token counter (ignores sentence breaks)
my %correctChunk = (); # number of correctly identified chunks per type
my %foundCorrect = (); # number of chunks in corpus per type
my %foundGuessed = (); # number of identified chunks per type
my @features; # features on line
my @sortedTypes; # sorted list of chunk type names
# sanity check
while (@ARGV and $ARGV[0] =~ /^-/) {
if ($ARGV[0] eq "-l") { $latex = 1; shift(@ARGV); }
elsif ($ARGV[0] eq "-r") { $raw = 1; shift(@ARGV); }
elsif ($ARGV[0] eq "-d") {
shift(@ARGV);
if (not defined $ARGV[0]) {
die "conlleval: -d requires delimiter character";
}
$delimiter = shift(@ARGV);
} elsif ($ARGV[0] eq "-o") {
shift(@ARGV);
if (not defined $ARGV[0]) {
die "conlleval: -o requires delimiter character";
}
$oTag = shift(@ARGV);
} else { die "conlleval: unknown argument $ARGV[0]\n"; }
}
if (@ARGV) { die "conlleval: unexpected command line argument\n"; }
# process input
while (<STDIN>) {
chomp($line = $_);
@features = split(/$delimiter/,$line);
if ($nbrOfFeatures < 0) { $nbrOfFeatures = $#features; }
elsif ($nbrOfFeatures != $#features and @features != 0) {
printf STDERR "unexpected number of features: %d (%d)\n",
$#features+1,$nbrOfFeatures+1;
exit(1);
}
if (@features == 0) { @features = ($boundary,"O","O"); }
if (@features < 2) {
die "conlleval: unexpected number of features in line $line\n";
}
if ($raw) {
if ($features[$#features] eq $oTag) { $features[$#features] = "O"; }
if ($features[$#features-1] eq $oTag) { $features[$#features-1] = "O"; }
if ($features[$#features] ne "O") {
$features[$#features] = "B-$features[$#features]";
}
if ($features[$#features-1] ne "O") {
$features[$#features-1] = "B-$features[$#features-1]";
}
}
($guessed,$guessedType) = split(/-/,pop(@features));
($correct,$correctType) = split(/-/,pop(@features));
$guessedType = $guessedType ? $guessedType : "";
$correctType = $correctType ? $correctType : "";
$firstItem = shift(@features);
# 1999-06-26 sentence breaks should always be counted as out of chunk
if ( $firstItem eq $boundary ) { $guessed = "O"; }
if ($inCorrect) {
if ( &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and
&endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and
$lastGuessedType eq $lastCorrectType) {
$inCorrect=$false;
$correctChunk++;
$correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ?
$correctChunk{$lastCorrectType}+1 : 1;
} elsif (
&endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) !=
&endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) or
$guessedType ne $correctType ) {
$inCorrect=$false;
}
}
if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and
&startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and
$guessedType eq $correctType) { $inCorrect = $true; }
if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) ) {
$foundCorrect++;
$foundCorrect{$correctType} = $foundCorrect{$correctType} ?
$foundCorrect{$correctType}+1 : 1;
}
if ( &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) ) {
$foundGuessed++;
$foundGuessed{$guessedType} = $foundGuessed{$guessedType} ?
$foundGuessed{$guessedType}+1 : 1;
}
if ( $firstItem ne $boundary ) {
if ( $correct eq $guessed and $guessedType eq $correctType ) {
$correctTags++;
}
$tokenCounter++;
}
$lastGuessed = $guessed;
$lastCorrect = $correct;
$lastGuessedType = $guessedType;
$lastCorrectType = $correctType;
}
if ($inCorrect) {
$correctChunk++;
$correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ?
$correctChunk{$lastCorrectType}+1 : 1;
}
if (not $latex) {
# compute overall precision, recall and FB1 (default values are 0.0)
$precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0);
$recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0);
$FB1 = 2*$precision*$recall/($precision+$recall)
if ($precision+$recall > 0);
# print overall performance
printf "processed $tokenCounter tokens with $foundCorrect phrases; ";
printf "found: $foundGuessed phrases; correct: $correctChunk.\n";
if ($tokenCounter>0) {
printf "accuracy: %6.2f%%; ",100*$correctTags/$tokenCounter;
printf "precision: %6.2f%%; ",$precision;
printf "recall: %6.2f%%; ",$recall;
printf "FB1: %6.2f\n",$FB1;
}
}
# sort chunk type names
undef($lastType);
@sortedTypes = ();
foreach $i (sort (keys %foundCorrect,keys %foundGuessed)) {
if (not($lastType) or $lastType ne $i) {
push(@sortedTypes,($i));
}
$lastType = $i;
}
# print performance per chunk type
if (not $latex) {
for $i (@sortedTypes) {
$correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0;
if (not($foundGuessed{$i})) { $precision = 0.0; }
else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; }
if (not($foundCorrect{$i})) { $recall = 0.0; }
else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; }
if ($precision+$recall == 0.0) { $FB1 = 0.0; }
else { $FB1 = 2*$precision*$recall/($precision+$recall); }
printf "%17s: ",$i;
printf "precision: %6.2f%%; ",$precision;
printf "recall: %6.2f%%; ",$recall;
printf "FB1: %6.2f\n",$FB1;
}
} else {
print " & Precision & Recall & F\$_{\\beta=1} \\\\\\hline";
for $i (@sortedTypes) {
$correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0;
if (not($foundGuessed{$i})) { $precision = 0.0; }
else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; }
if (not($foundCorrect{$i})) { $recall = 0.0; }
else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; }
if ($precision+$recall == 0.0) { $FB1 = 0.0; }
else { $FB1 = 2*$precision*$recall/($precision+$recall); }
printf "\n%-7s & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\",
$i,$precision,$recall,$FB1;
}
print "\\hline\n";
$precision = 0.0;
$recall = 0;
$FB1 = 0.0;
$precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0);
$recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0);
$FB1 = 2*$precision*$recall/($precision+$recall)
if ($precision+$recall > 0);
printf "Overall & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\\\hline\n",
$precision,$recall,$FB1;
}
exit 0;
# endOfChunk: checks if a chunk ended between the previous and current word
# arguments: previous and current chunk tags, previous and current types
# note: this code is capable of handling other chunk representations
# than the default CoNLL-2000 ones, see EACL'99 paper of Tjong
# Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006
sub endOfChunk {
my $prevTag = shift(@_);
my $tag = shift(@_);
my $prevType = shift(@_);
my $type = shift(@_);
my $chunkEnd = $false;
if ( $prevTag eq "B" and $tag eq "B" ) { $chunkEnd = $true; }
if ( $prevTag eq "B" and $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" and $tag eq "B" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" and $tag eq "E" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" and $tag eq "I" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" and $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; }
if ($prevTag ne "O" and $prevTag ne "." and $prevType ne $type) {
$chunkEnd = $true;
}
# corrected 1998-12-22: these chunks are assumed to have length 1
if ( $prevTag eq "]" ) { $chunkEnd = $true; }
if ( $prevTag eq "[" ) { $chunkEnd = $true; }
return($chunkEnd);
}
# startOfChunk: checks if a chunk started between the previous and current word
# arguments: previous and current chunk tags, previous and current types
# note: this code is capable of handling other chunk representations
# than the default CoNLL-2000 ones, see EACL'99 paper of Tjong
# Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006
sub startOfChunk {
my $prevTag = shift(@_);
my $tag = shift(@_);
my $prevType = shift(@_);
my $type = shift(@_);
my $chunkStart = $false;
if ( $prevTag eq "B" and $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "I" and $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; }
if ( $prevTag eq "E" and $tag eq "E" ) { $chunkStart = $true; }
if ( $prevTag eq "E" and $tag eq "I" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "E" ) { $chunkStart = $true; }
if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; }
if ($tag ne "O" and $tag ne "." and $prevType ne $type) {
$chunkStart = $true;
}
# corrected 1998-12-22: these chunks are assumed to have length 1
if ( $tag eq "[" ) { $chunkStart = $true; }
if ( $tag eq "]" ) { $chunkStart = $true; }
return($chunkStart);
}
<html><head><title>
Language-Independent Named Entity Recognition (II)
</title></head><body bgcolor="#ffffff"><p>
<table cellpadding="0" cellspacing="0" border="0" width="100%">
<tr><td bgcolor="#00ccff" valign="top">&nbsp;
</table><p>
<h1>Language-Independent Named Entity Recognition (II)</h1>
<p>
Named entities are phrases that contain the names of persons,
organizations, locations, times and quantities.
Example:
<p>
<blockquote>
[ORG <font color="#0000ff">U.N.</font> ]
official
[PER <font color="#ff0000">Ekeus</font> ]
heads
for
[LOC <font color="#00ff00">Baghdad</font> ]
.
</blockquote>
<p>
The shared task of CoNLL-2003 concerns language-independent named
entity recognition.
We will concentrate on four types of named entities: persons,
locations, organizations and names of miscellaneous entities that do
not belong to the previous three groups.
The participants of the shared task will be offered training and test
data for two languages.
They will use the data for developing a named-entity recognition
system that includes a machine learning component.
For each language, additional information (lists of names and
non-annotated data) will be supplied as well.
The challenge for the participants is to find ways of incorporating
this information in their system.
<p>
<h2>Background information</h2>
<p>
Named Entity Recognition (NER) is a subtask of Information Extraction.
Different NER systems were evaluated as a part of the Sixth Message
Understanding Conference in 1995
(<a href="http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html">MUC6</a>).
The target language was English.
The participating systems performed well.
However, many of them used language-specific resources for performing
the task and it is unknown how they would have performed on another
language than English [<a href="#PD97">PD97</a>].
<p>
After 1995, NER systems have been developed for some European languages
and a few Asian languages.
There have been at least two studies that have applied one NER system
to different languages.
Palmer and Day [<a href="#PD97">PD97</a>] have used statistical methods
for finding named entities in newswire articles in Chinese, English,
French, Japanese, Portuguese and Spanish.
They found that the difficulty of the NER task was different for the
six languages but that a large part of the task could be performed
with simple methods.
Cucerzan and Yarowsky [<a href="#CY99">CY99</a>] used both
morphological and contextual clues for identifying named entities in
English, Greek, Hindi, Rumanian and Turkish.
With minimal supervision, they obtained overall F measures between 40
and 70, depending on the languages used.
In the shared task at
<a href="../../conll2002/ner/">CoNLL-2002</a>,
twelve different learning systems were applied to data in Spanish and
Dutch.
<p>
<h2>Software and Data</h2>
<p>
The CoNLL-2003 shared task data files contain four columns separated by
a single space.
Each word has been put on a separate line and there is an empty line
after each sentence.
The first item on each line is a word, the second a part-of-speech (POS)
tag, the third a syntactic chunk tag and the fourth the named entity
tag.
The chunk tags and the named entity tags have the format I-TYPE which
means that the word is inside a phrase of type TYPE.
Only if two phrases of the same type immediately follow each other,
the first word of the second phrase will have tag B-TYPE to show
that it starts a new phrase.
A word with tag O is not part of a phrase.
Here is an example:
<p>
<pre>