Commit bcf985de authored by Fethi Bougares's avatar Fethi Bougares

NER project

parent aa48037f
-------------------------------------------------------------------------------
NOTE: ADDED 16 AUGUST 2016
The Reuters Corpus is not distributed on a cd anymore but as a single
compressed file rcv1.tar.xz . In order to extract the English shared
task files from this file, you need the script ner/bin/make.eng.2016
1. download ner.tgz from http://www.clips.uantwerpen.be/conll2003/ner/
2. extract the files from ner.tgz: tar zxf ner.tgz
3. put the Reuters file rcv1.tar.xz in the new directory ner
4. run make.eng.2016 from directory ner: cd ner; bin/make.eng.2016
This should generate the three files eng.train, eng.testa and eng.testb
Contact: erikt(at)xs4all.nl
-------------------------------------------------------------------------------
20030423 CONLL-2003 SHARED TASK
GENERAL
This is the 20030423 release of the data for the CoNLL-2003 shared
task. In order to be able to use this data you need the Reuters
Corpus cd (for the English data) and the ECI Multilingual Text cd
(for the German data) which can be obtained via the following two
addresses:
http://trec.nist.gov/data/reuters/reuters.html
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5
This distribution only contains annotation information and filter
software. The words of the articles in the two corpora have not
been included here for copyright reasons. That is why you need the
the two cds for building the complete data sets.
The CoNLL-2003 shared task deals with Language-Independent Named
Entity Recognition. The two languages we deal with are English
and German. More information about this shared task can be found
on the related web page http://www.cnts.ua.ac.be/conll2003/ner/
BUILDING THE TRAIN AND TEST DATA FILES
In order to obtain the data files you need to perform three steps:
1. Extract the CoNLL-2003 files from the tar file available at
http://www.cnts.ua.ac.be/conll2003/ner.tgz
(tar zxf ner.tgz)
2a (English data) Insert the first cd of the Reuters Corpus in
your computer and mount it (mount /mnt/cdrom)
2b (German data) Insert the ECI Multilingual Text cd in
your computer and mount it (mount /mnt/cdrom)
3. Run the relevant extraction software from the ner directory
English: cd ner; bin/make.eng
German: cd ner; bin/make.deu
This will generate the training data (either eng.train or deu.train),
the development test data (eng.testa or deu.testa) and the final test
data (eng.testb or deu.testb) in the ner directory. You can use the
development data as test data during the development process of your
system. When your system works well, it can be applied to the final
test data.
These instructions assume that you work on a Linux work station and
that the cd files are available from the directory /mnt/cdrom . This
procedure might not work on other platforms.
News: January 26, 2006: Sven Hartrumpf from the FernUniversitat in
Hagen, Germany has checked and revised the entity annotations of the
German data. The new version is believed to be more accurate than
the previous one which was done by nonnative speakers. The files
associated with the new annotations can be found in the directory
ner/etc.2006
BUILDING THE UNANNOTATED DATA FILES
The unannotated data files can be build in the same way as the train
and test files. However, because of their size the annotation of these
files has been stored in separate tar files which you should fetch
first. Make sure that you have fetched and unpacked the main tar file
ner.tgz because that contains the software for building the files
with unannotated data. Here are the steps you should perform:
1. Extract the CoNLL-2003 files from the tar file available at
http://www.cnts.ua.ac.be/conll2003/ner/ner.tgz
(tar zxf eng.tgz)
2a (English data) Extract the unannotated annotation files from
http://www.cnts.ua.ac.be/conll2003/ner/eng.raw.tar
(tar xf eng.raw.tar)
2b (German data) Extract the unannotated annotation files from
http://www.cnts.ua.ac.be/conll2003/ner/deu.raw.tar
(tar xf deu.raw.tar)
3a (English data) Insert the first cd of the Reuters Corpus in
your computer and mount it (mount /mnt/cdrom)
4b (German data) Insert the first cd of the Reuters Corpus in
your computer and mount it (mount /mnt/cdrom)
4. Run the relevant extraction software from the ner directory
English: cd ner; bin/make.eng.raw
German: cd ner; bin/make.deu.raw
This will generate the file eng.raw.gz (or deu.raw.gz) in the ner
directory. These files have been compressed with gzip.
These instructions assume that you work on a Linux work station and
that the cd files are available from the directory /mnt/cdrom . This
procedure might not work on other platforms.
DATA FORMAT
The data files contain one word per line. Empty lines have been used
for marking sentence boundaries and a line containing the keyword
-DOCSTART- has been added to the beginning of each article in order
to mark article boundaries. Each non-empty line contains the following
tokens:
1. the current word
2. the lemma of the word (German only)
3. the part-of-speech (POS) tag generated by a tagger
4. the chunk tag generated by a text chunker
5. the named entity tag given by human annotators
The tagger and chunker for English are roughly similar to the
ones used in the memory-based shallow parser demo available at
http://ilk.uvt.nl/ German POS and chunk information has been
generated by the Treetagger from the University of Stuttgart:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
In order to simulate a real natural language processing
environment, the POS tags and chunk tags have not been checked.
This means that they will contain errors. If you have access to
annotation software with a performance that is superior to this,
you may replace these tags by yours.
The chunk tags and the named entity tags use the IOB1 format. This
means that in general words inside entity receive the tag I-TYPE
to denote that they are Inside an entity of type TYPE. Whenever
two entities of the same type immediately follow each other, the
first word of the second entity will receive tag B-TYPE rather than
I-TYPE in order to show that a new entity starts at that word.
The raw data has the same format as the training and test material
but the final column has been ommitted. There are word lists for
English (extracted from the training data), German (extracted from
the training data), and Dutch in the directory lists. Probably you
can use the Dutch person names (PER) for English data as well. Feel
free to use any other external data sources that you might have
access to.
GOALS
In the CoNLL-2002 shared task we have worked on named entity
recognition as well (Spanish and Dutch). The CoNLL-2003 shared
task deals with two different languages (English and German).
Additionally we supply additional information: lists of named
entities and non-annotated data. One of the main tasks of
the participants in the CoNLL-2003 shared task will be to find
out how these additional resources can be used to improve the
performance of their system.
BASELINE
The baseline performance for this shared task is assigning named
entity classes to word sequences that occur in the training data.
It can be computed as follows (example for English development data):
bin/baseline eng.train eng.testa | bin/conlleval
and the results are:
eng.testa: precision: 78.33%; recall: 65.23%; FB1: 71.18
eng.testb: precision: 71.91%; recall: 50.90%; FB1: 59.61
deu.testa: precision: 37.19%; recall: 26.07%; FB1: 30.65
deu.testb: precision: 31.86%; recall: 28.89%; FB1: 30.30
If you build a system for this task, it should at least improve on
the performance of this baseline system.
Antwerp, April 23, 2003
Erik Tjong Kim Sang <erik.tjongkimsang@ua.ac.be>
Fien De Meulder <fien.demeulder@ua.ac.be>
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
# CONLL-2003: List of tags with associated categories of names
# Based on: Nancy Chinchor, Erica Brown, Lisa Ferro, Patty Robinson,
# "1999 Named Entity Task Definition". MITRE and SAIC, 1999.
# 2003 Fien De Meulder
Locations: roads (streets, motorways)
trajectories
regions (villages, towns, cities, provinces, countries, continents,
dioceses, parishes)
structures (bridges, ports, dams)
natural locations (mountains, mountain ranges, woods,
rivers, wells, fields, valleys, gardens,
nature reserves, allotments, beaches,
national parks)
public places (squares, opera houses, museums, schools,
markets, airports, stations, swimming pools,
hospitals, sports facilities, youth centers,
parks, town halls, theaters, cinemas, galleries,
camping grounds, NASA launch pads, club
houses, universities, libraries, churches,
medical centers, parking lots, playgrounds,
cemeteries)
commercial places (chemists, pubs, restaurants, depots,
hostels, hotels, industrial parks,
nightclubs, music venues)
assorted buildings (houses, monasteries, creches, mills,
army barracks, castles, retirement
homes, towers, halls, rooms, vicarages,
courtyards)
abstract ``places'' (e.g. {\it the free world})
Miscellaneous: words of which one part is a location, organisation,
miscellaneous, or person
adjectives and other words derived from a word
which is location, organisation, miscellaneous, or
person
religions
political ideologies
nationalities
languages
programs
events (conferences, festivals, sports competitions,
forums, parties, concerts)
wars
sports related names (league tables, leagues, cups
titles (books, songs, films, stories, albums, musicals,
TV programs)
slogans
eras in time
types (not brands) of objects (car types, planes,
motorbikes)
Organizations: companies (press agencies, studios, banks, stock
markets, manufacturers, cooperatives)
subdivisions of companies (newsrooms)
brands
political movements (political parties, terrorist
organisations,
government bodies (ministries, councils, courts, political unions
of countries (e.g. the {\it U.N.}))
publications (magazines, newspapers, journals)
musical companies (bands, choirs, opera companies, orchestras
public organisations (schools, universities, charities
other collections of people (sports clubs, sports
teams, associations, theaters companies,
religious orders, youth organisations
Persons: first, middle and last names of people, animals and fictional
characters
aliases
20110624 updates:
the scripts make.eng and xml2txt.eng have been updated in order to
allow for a successful generation of the English data files in 2011
Erik TKS <erikt(at)xs4all.nl>
This diff is collapsed.
#!/usr/bin/perl -w
# baseline: compute a baseline classification for named entities
# usage: baseline [-u] [nbr] train test
# notes: option -u: only classify entities with unique class in train
# method used: only tag phrases present in training data
# greedy search: tag longest possible phrases
# train and test are supposed to be in
# CoNLL-2002 format
# url: http://lcg-www.uia.ac.be/conll2002/ner/
# 20020524 erikt@uia.ua.ac.be
use strict;
my (
$i,$j,$k,
$ambiguous,$bestCat,$bestCatNbr,$buffer,$bufferType,$debug,
$key,$line,$onlyUniq,$tag,$test,$train,$type,$uniqNbr,$word,
@classes,@test,@words,
%hash, # hash of hashes for categories of word sequences
%outWords # hash of words that appear outside of entities
);
$onlyUniq = 0;
$uniqNbr = 0;
$debug = 0;
if (defined $ARGV[0] and $ARGV[0] eq "-d") {
$debug = 1;
shift(@ARGV);
}
if (defined $ARGV[0] and $ARGV[0] eq "-u") {
$onlyUniq = 1;
shift(@ARGV);
}
if (defined $ARGV[0] and $ARGV[0] =~ /^[0-9]+$/) {
$uniqNbr = shift(@ARGV);
}
if ($#ARGV != 1) { die "usage: baseline [-u] [nbr] train test\n"; }
$train = shift(@ARGV);
$test = shift(@ARGV);
# read train file
$buffer = "";
$bufferType = "";
%hash = ();
open(INFILE,$train);
while (<INFILE>) {
$line = $_;
chomp($line);
$line = "-X- O" if ($line =~ /^\s*$/);
@words = split(/\s+/,$line);
$word = shift(@words); # word is first item on line
$tag = pop(@words); # tag is last item on line
if ($tag eq "O") { $outWords{$word} = 1; }
$type = $tag;
$type =~ s/^.*-//;
# if previous tagged phrase is complete
if ($buffer and
($type eq "O" or $type ne $bufferType or $tag =~ /^B/)) {
if (not defined $hash{$buffer}{$bufferType}) {
$hash{$buffer}{$bufferType} = 1;
} else { $hash{$buffer}{$bufferType}++; }
@words = split(/\s+/,$buffer);
pop(@words);
# store all prefixes of entity in hash with tag PREFIX
while (@words) {
$line = join(" ",@words);
if (not defined $hash{$line}{"PREFIX"}) {
$hash{$line}{"PREFIX"} = 1;
} else { $hash{$line}{"PREFIX"}++; }
pop(@words);
}
$buffer = "";
$bufferType = "";
}
# append current word to buffer if we are processing a tagged phrase
if ($tag ne "O") {
$buffer = $buffer ? "$buffer $word" : $word;
$bufferType = $bufferType ? $bufferType : $type;
}
}
if ($buffer) {
if (not defined $hash{$buffer}{$bufferType}) {
$hash{$buffer}{$bufferType} = 1;
} else { $hash{$buffer}{$bufferType}++; }
@words = split(/\s+/,$buffer);
pop(@words);
# store all prefixes of entity in hash with tag PREFIX
while (@words) {
$line = join(" ",@words);
if (not defined $hash{$line}{"PREFIX"}) {
$hash{$line}{"PREFIX"} = 1;
} else { $hash{$line}{"PREFIX"}++; }
pop(@words);
}
}
close(INFILE);
# read test file
@test = ();
open(INFILE,$test) or die "cannot open $test\n";
while (<INFILE>) {
$line = $_;
chomp($line);
push(@test,$line);
}
close(INFILE);
# assign entity tags to test file
$i = 0;
LOOP: while ($i<=$#test) {
if (not $test[$i]) { print "\n"; $i++; next LOOP; }
@words = split(/\s+/,$test[$i]);
if (not defined %{$hash{$words[0]}}) {
print "$test[$i] O\n";
$i++;
} else {
$j = 0;
$buffer = "$words[0]";
# add words to phrase while we are in a phrase prefix and
# the next word exists and is not a line break
while (defined $hash{$buffer}{"PREFIX"} and
$i+$j < $#test and $test[$i+$j+1]) {
$j++;
@words = split(/\s+/,$test[$i+$j]);
$buffer .= " $words[0]";
}
# remove words from entity
@classes = defined $hash{$buffer} ? %{$hash{$buffer}}: ();
# note: classes always contains pairs tag/amount
# remove words from phrase while current phrase is nonempty and
# does not contain a phrase or is only a prefix
while ($buffer and
($#classes < 0 or
($#classes == 1 and defined $hash{$buffer}{"PREFIX"})) or
($onlyUniq and
($#classes > 3 or
($#classes > 1 and not defined $hash{$buffer}{"PREFIX"})))) {
$j--;
@words = split(/\s+/,$buffer);
pop(@words);
$buffer = join(" ",@words);
@classes = defined $hash{$buffer} ? %{$hash{$buffer}}: ();
}
if ($debug) {
# show phrase with possible classification and nbr of examples
print ">>> $#classes $buffer ";
foreach $i (@classes) { print "# $i "; }
print "\n";
}
# if no complete entity was found
if (not $buffer) {
print "$test[$i] O\n";
$i++;
next LOOP;
}
# get category
$bestCat = "UNDEF";
$bestCatNbr = 0;
foreach $key (sort keys %{$hash{$buffer}}) {
if ($key ne "PREFIX" and $hash{$buffer}{$key} > $bestCatNbr) {
$bestCatNbr = $hash{$buffer}{$key};
$bestCat = $key;
}
}
# does the phrase occur frequently enough in the training data?
if ($bestCatNbr < $uniqNbr) {
print "$test[$i] O\n";
$i++;
next LOOP;
}
for ($k=$i;$k<=$i+$j;$k++) {
if ($k == $i) { print "$test[$k] B-$bestCat\n"; }
else { print "$test[$k] I-$bestCat\n"; }
}
$i += $j+1;
}
}
exit(0);
#!/usr/bin/perl -w
# chrep: change the representation format of IOBE tags
# notes: the tags are assumed to be the last item on a line
# there are seven formats:
# - iob1: standard RM95
# - iob2: RM95 plus B tag at every chunk start
# - ioe1: replaces RM95 IB sequences by EI
# - ioe2: as ioe1 but with all end of chunks marked with E
# - io: as iob1 but with all B's replaced by I's
# - openb: mark chunk-initial word with [, rest with .
# - closeb: mark chunk-final word with ], rest with .
# usage: chrep [iob1|iob2|ioe1|ioe2|io|openb|closeb] < file
# 981125 erikt@uia.ua.ac.be
$false = 0;
$sep = " ";
$true = 1;
$usage = "usage: chrep [iob1|iob2|ioe1|ioe2|io|openb|closeb] < file";
sub endOfChunk {
local($prevTag) = shift(@_);
local($tag) = shift(@_);
local($prevType) = shift(@_);
local($type) = shift(@_);
local($chunkEnd) = $false;
if ( $prevTag eq "B" && $tag eq "B" ) { $chunkEnd = $true; }
if ( $prevTag eq "B" && $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" && $tag eq "B" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" && $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" && $tag eq "E" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" && $tag eq "I" ) { $chunkEnd = $true; }
if ( $prevTag eq "E" && $tag eq "O" ) { $chunkEnd = $true; }
if ( $prevTag eq "I" && $tag eq "O" ) { $chunkEnd = $true; }
if ($prevTag ne "O" && $prevType ne $type) { $chunkEnd = $true; }
$chunkEnd;
}
sub startOfChunk {
local($prevTag) = shift(@_);
local($tag) = shift(@_);
local($prevType) = shift(@_);
local($type) = shift(@_);
local($chunkStart) = $false;
if ( $prevTag eq "B" && $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "I" && $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "O" && $tag eq "B" ) { $chunkStart = $true; }
if ( $prevTag eq "O" && $tag eq "I" ) { $chunkStart = $true; }
if ( $prevTag eq "E" && $tag eq "E" ) { $chunkStart = $true; }
if ( $prevTag eq "E" && $tag eq "I" ) { $chunkStart = $true; }
if ( $prevTag eq "O" && $tag eq "E" ) { $chunkStart = $true; }
if ( $prevTag eq "O" && $tag eq "I" ) { $chunkStart = $true; }
if ($tag ne "O" && $prevType ne $type) { $chunkStart = $true; }
$chunkStart;
}
sub iob1 {
local($prevTag) = shift(@_);
local($tag) = shift(@_);
local($prevType) = shift(@_);
local($type) = shift(@_);
local($newTag) = $tag;
if ( &startOfChunk($prevTag,$tag,$prevType,$type) &&
&endOfChunk($prevTag,$tag,$prevType,$type) &&
$prevType eq $type) {
$newTag = "B";
} elsif ( $tag ne "O" ) {
$newTag = "I";
}
$newTag;
}
sub iob2 {
local($prevTag) = shift(@_);
local($tag) = shift(@_);
local($prevType) = shift(@_);
local($type) = shift(@_);
local($newTag) = $tag;
if ( &startOfChunk($prevTag,$tag,$prevType,$type) ) {
$newTag = "B";
} elsif ( $tag ne "O" ) {
$newTag = "I";
}
$newTag;
}
sub ioe1 {
local($prevTag) = shift(@_);
local($tag) = shift(@_);
local($prevType) = shift(@_);
local($type) = shift(@_);
local($newTag) = $prevTag;
if ( &startOfChunk($prevTag,$tag,$prevType,$type) &&
&endOfChunk($prevTag,$tag,$prevType,$type) &&
$prevType eq $type) {
$newTag = "E";
} elsif ( $prevTag ne "O" ) {
$newTag = "I";
}
$newTag;
}
sub ioe2 {
local($prevTag) = shift(@_);
local($tag) = shift(@_);
local($prevType) = shift(@_);
local($type) = shift(@_);
local($newTag) = $prevTag;
if ( &endOfChunk($prevTag,$tag,$prevType,$type) ) {
$newTag = "E";
} elsif ( $prevTag ne "O" ) {
$newTag = "I";
}
$newTag;
}
sub openb {
local($prevTag) = shift(@_);