Commit 70e6f754 authored by Anthony Larcher's avatar Anthony Larcher
Browse files

update Readme

parent 6f874dde
......@@ -38,24 +38,19 @@ See **ALLIES_evaluation_plan_V0.pdf** in this repository.
![The ALLIES lifelong learning framework](./allies_baseline.png)
#### Diarization across time
#### 1 Incremental Diarization across time
The task of Diarization across time simulates the evaluation of an automatic
system across time. The system is allowed to update its models using any
audio data sent in, creating a better speech model or updating model clusters
for instance, and generate a new version of them to handle the next show.
The task of Diarization across time aims at evaluating automatic
systems across time. Systems have to process a sequence of TV and radio shows in a chronological order without being allowed to come back and modify annotations of any already processed show.
Systems can update their models using any audio data sent in to improve acoustic models or speaker representation for instance in order to generate a new version of them that will be used to process the next shows.
See the evaluation plan for more details.
#### Lifelong learning speaker diarization
#### 2 Lifelong learning speaker diarization
The protocol is similar to the one in *Diarization across time* except that the
the system is allowed to performed *active* and *interactive learning* during
its adaptation.
The task is performed through a set of python scripts that simulate
the user for the human-assisted learning. These scripts require
access to the test references. Participants are advised not to use
these references in their systems by mistake.
the system is allowed to performe *active* and *interactive learning* to improve the quality of its output or to adapt its models.
To enable fair and reproducible evaluation of systems, a simulation of human operator is released to answer questions of the automatic systems.
The use of the human simulation requires access to the test references. Participants are advised not to use these references in their systems by mistake.
The user simulation that is used to evaluate human assisted learning
allows for two types of actions: active or interactive learning. In
......@@ -67,6 +62,8 @@ biggest impact of the DER.
### The metrics
Hypotheses generated by the systems will be evaluated using classical metrics (DER, JER) but also with metrics especially developed for the two tasks.
#### DER Across time
Evaluation-wise, validated speakers are associated with same-name
......@@ -152,34 +149,60 @@ Two datasets are available:
that will be evaluated. All information for the files of this dataset are available to the User Simulation to answer the questions.
Iterate over the dataset by using the following code:
```python
for idx, (show, file_info, uem, ref, filename) in enumerate(lifelong_data):
execute your code here
```
Where:
## ALLIES API description (How to interact with the User Simulation)
* *show* is the **ID** of the show to process (given as a `string`)
* *file_info* is a `file_info` object that includes all informations required to process the current file
* *uem* is a `UEM` object that describes the part of audio signal annotated
* *ref* is a `Reference` object that contains the ground truth annotation to be given to the Human simulation
* *filename* is the name of the audio file to process
This section first describes the objects included in the evallies package that are required to interact with the user simulation.
In a second part we describe the methods to be used to interact.
### REFERENCE: The hypothesis format to interact with user simulation
## How to interact with the User Simulation, input and output formats
After processing a file, your system exchanges with the user simulation through a **Reference** object.
This section describes:
1. The input and output file formats
2. The different types of possible interactions between the human simulation and the system
3. The process requires to interact with the user simulation
4. The API provided to handle this interaction
### Input and Output formats
Inputs consist of a list of file triplets: **(audio, mdtm, uem)**
Formats of those files are described below.
As an output, participants are required to send an archive including one single **MDTM** file for each of the input triplet. The **MDTM** files must be named after the show with a **.mdtm** extension.
Example of input and output file names:
* Input:
* (`BFMTV_PlaneteShowbiz_2011-11-11_065040.wav`,
* `BFMTV_PlaneteShowbiz_2011-11-11_065040.mdtm`,
* `BFMTV_PlaneteShowbiz_2011-11-11_065040.uem`)
* Expected output: `BFMTV_PlaneteShowbiz_2011-11-11_065040.mdtm`
### UEM format description
#### Audio format description
The UEM format is a format describing the time ranges in the source audio files the system should be working on. It is used to give the boundaries of the shows but also to exclude the zones with overlapping speakers. It's a space-separated columns format, with four columns:
Audio files are provided in WAV format encoded as PCM 16bits 16kHz.
#### UEM format description
Audio files are not fully annotated. The range of audio that is annotated (and thus evaluated) for each **WAV** file is given in **UEM**.
**UEM** format is a text, space-separated columns format, with four columns:
* File name without the extension
* Channel number (always 1)
* Start time of zone to diarize
* End time of zone to diarize
* Start time of zone to diarize (in seconds)
* End time of zone to diarize (in seconds)
Example extract:
......@@ -190,16 +213,15 @@ TV8_LaPlaceDuVillage_2011-03-14_172834 1 476.920 493.571
TV8_LaPlaceDuVillage_2011-03-14_172834 1 492.927 495.556
```
#### MDTM format description
### MDTM format description
The MDTM format is a format describing the reference or an hypothesis for the speaker identity in a file. It's a space-separated format, with eight columns:
The MDTM format is used to provide the reference (ground truth) required for evaluation and for human-in-the-loop simulation.
This format is also the one to use for submission of the system hypotheses. It is a space-separated format, with eight columns:
* File name without the extension
* Channel number (always 1)
* Start time of the speaker range
* Duration of the speaker range (beware, not end time)
* Start time of the speaker turn (in seconds)
* Duration of the speaker turn (beware, not end time) (in seconds)
* Event type (always "speaker")
* Event subtype (always "na")
* Gender ("adult_male" or "adult_female", "unknown" for hypothese, not evaluated in any case)
......@@ -209,64 +231,136 @@ The MDTM format is a format describing the reference or an hypothesis for the sp
In the references, the speaker id is the speaker name of the form "Firstname\_LASTNAME", in the hypothesis it is a unique, space-less, identifier per speaker.
Example extract:
```python
TV8_LaPlaceDuVillage_2011-03-14_172834 1 407.621 15.040 speaker na adult_male Michel_THABUIS
TV8_LaPlaceDuVillage_2011-03-14_172834 1 422.661 18.148 speaker na adult_male Philippe_DEPARIS
TV8_LaPlaceDuVillage_2011-03-14_172834 1 440.809 30.357 speaker na adult_male Michel_THABUIS
TV8\_LaPlaceDuVillage_2011-03-14_172834 1 471.666 6.730 speaker na adult_male Philippe_DEPARIS
TV8_LaPlaceDuVillage_2011-03-14_172834 1 471.666 6.730 speaker na adult_male Philippe_DEPARIS
```
### Interacting with the human in the loop.
### Type of possible interactions
Each file from the lifelong learning dataset comes with a flag stored in the __file_info__ variable and named __supervision__,
that specifies the mode of human assisted learning for this file. The mode can be:
that specifies the mode of human assisted learning for this file. The mode is given by the value of a *string* objet that can be:
* __active__ the system is allowed to ask questions to the human in the loop;
* __interactive__ once the system produces a first hypothesis, the human in the loop provides corrections to the system to improve the hypothesis;
* __none__ Human assisted learning is OFF for this file. The system can still adapt the model in an unsupervised manner.
While processing an audio file, the system can perform unsupervised learning
and goes through the the Human Assisted Learning process if supervision mode is either
active or interactive.
and go through the Human Assisted Learning process if supervision mode is either __active__ or __interactive__.
The code below shows how to interact with the user simulation:
For __active__ mode, your system is free to initiate the interaction foth the `UserSimulation` by sending it `MessageToUser`.
```python
# Create a fake request that is used to initiate interactive learning
# For the case of active learning, this request is overwritten by your system
request = {"request_type": "toto", "time_1": 0.0, "time_2": 0.0}
```
For __interactive__ mode, the communication is also initiated by the system so that a system not taking into account this mode of interaction can just ignore this step.
For systems developed to interact this way, a fake `MessageToUser` needs to be sent to the `UserSimulation` as long as the system decides to take into account the corrections from the `UserSimulation`. As an answer to this fake message, the `UserSimulation` will then return an `Answer`.
# A request is defined as follow:
### Send a message to the human in the loop:
The system can send a question to the human in the loop
by using an object of type request
The request is the question asked to the system
package the request to be sent to the user simulation together with
the ID of the file of interest and the current hypothesis
In order to interact with the human simulation, an automatic system must send a **message** to which the human simulation will reply by returning an **answer**.
A message is sent to the human in the loop as an object of type `MessageToUser`.
The message is the question to be sent to the user simulation; it includes an object of type `Request` together with
the ID of the file of interest and the current hypothesis given as a `Reference` object
```python
message_to_user = {
"file_id": file_id, # ID of the file the question is related to
"hypothesis": current_hypothesis, # The current hypothesis in ALLIES format
"system_request": request, # the question for the human in the loop
file_info,
hypothesis,
request
}
# Send the request to the user simulation and receive the answer
human_assisted_learning, user_answer = user.validate(message_to_user)
```
* `file_info ` is the object obtained from the `database` with each file.
* `hypothesis ` is the current system hypothesis as a `Reference` object
* `request` id a `Request` object that contains the question for the human in the loop
The user simulation returns two objects:
* __human_assisted_learning__ a boolean, True if the system can ask more questions, False otherwise
* __user_answer__, the answer of the user simulation that is defined as follow:
* `human_assisted_learning` a boolean, True if the system can ask more questions, False otherwise
* `user_answer`, the answer of the user simulation that is defined as follow:
### Evallies API
This section describes the different objects provided as part of the evallies API.
#### FileInfo
Class of object that is use to describe an input file associated with information about its processing. Database objects return a `FileInfo` for each show in the sequence.
`FileInfo` contains three elements:
* `file_id` the unique ID of the current file
* `supervision` a `string` that indicates the type of supervision enabled for the current file (can be __active__, __interactive__ or __none__)
* `time_stamp` the date of the current show
#### Reference
Class used to exchange hypotheses of the automatic system and ground truth reference.
After processing a file, your system exchanges with the user simulation through a **Reference** object.
A `Reference` object contains:
* `speaker` the list of speaker IDs per segment
* `start_time` the list of start time for each speech segment (in seconds)
* `end_time` the list of end time for each speech segment (in seconds)
#### UEM
Class of object used to describe the part of an audio file that is annotated
A `UEM` object contains:
* `start_time` the list of start time for each speech segment (in seconds)
* `end_time` the list of end time for each speech segment (in seconds)
#### Request
Class of objects used to ask question to the user simulation
A `Request ` object contains:
* `request_type` the type of question asked to the user that can be:
* **same** to check whether it is the same speaker that speaks at two points in time
* **boundary** to ask the boundaries of a speech segment
* **name** to ask the name of a speaker speaking a a special time
* `time_1` one of the two points in time to compare for *same* request, for * boundary* and *name* it is the time of the segment on which the question is asked
* `time_2` only useful for *same* request
* `second_show` show ID of the second show to compare with in case of a *same* request involving several shows.
The code below shows how to interact with the user simulation:
```python
# Create a fake request that is used to initiate interactive learning
# For the case of active learning, this request is overwritten by your system
request = {"request_type": "toto", "time_1": 0.0, "time_2": 0.0}
```
#### MessageToUser
Object used to ask a question to the human simulation
A `MessageToUser ` object contains:
* `file_info` a `FileInfo` describing the show the question is refering to
* `hypothesis` a `Reference` object, the hypoethis generated by the system
* `system_request` a `Request` object that describes the question for the user simulation
#### Answer
A Class that contains the answer from the UserSimulation for the automatic system
* `answer` the answer given as a boolean
* `response_type` the type of question the UserSimulation is answering to; can be:
* **same** to check whether it is the same speaker that speaks at two points in time
* **boundary** to ask the boundaries of a speech segment
* **name** to ask the name of a speaker speaking a a special time
* `time_1` start boundary in case of * boundary* request, time_1 for *same* question
* `time_2` end boundary in case of * boundary* request, time_2 for *same* question
* `second_show` show ID of the second show to compare with in case of a *same* request involving several shows.
* `name` ID of the speaker in case of a * name* request
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment