ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-M0052
    EnToFrNE - a Parallel English-French Lexicon of Named Entities
    In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities, which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on. They are often denoted by proper names and can be abstract or have a physical existence. Examples of named entities include: United States of America, Paris, Google, Mercedes Benz, Microsoft Windows, or anything else that can be named.

    Certain natural terms like biological species and substances, which are sometimes considered named entities, are not included in the lexicon.

    The lexicon consists of 1,167,263 parallel named entities in English and French.

    Classification
    Named entities in the lexicon are tagged. The tags used are: PERSON, ORGANIZATION, LOCATION, PRODUCT and MISC. Each named entity belongs to one of these classes. The classes comprise:
    PERSON: humans, gods, saints, fictional characters;
    ORGANIZATION: political organizations, companies, schools, rock bands, sport teams;
    LOCATION: geographical terms, fictional places, cosmic terms;
    PRODUCT: industrial products, software products, weapons, art works, documents, concepts, standards, laws, formats, anthems, algorithms, journals, coats of arms, platforms, websites;
    MISC: events, languages, peoples, tribes, alliances, orders, scientific discoveries, theories, titles, currencies, holidays, dynasties, positions, projects, historical periods, battles, competitions, alliances, deceases, breeds, programs, set of locations, awards, musical genres, missions, artistic directions, set of organizations, networks.

    There are 1,167,263 entries in the lexicon. At least one tag is assigned to each one of them. The distribution of tags is as follows:

    PERSON: 387,676
    ORGANIZATION: 107,865
    LOCATION: 309,533
    PRODUCT: 149,137
    MISC: 247,655

    The total number of tags, 1,201,866, is slightly higher than the number of entries, due to the fact that some named entities may belong to more classes. For example, Tom Sawyer is tagged as both PRODUCT (the title of the novel) and PERSON (the character from the novel).

    Evaluation
    To evaluate the tagging, two common metrics in information retrieval have been used: precision and recall. Precision means the percentage of tags which are correct. On the other hand, recall refers to the percentage of total relevant tags correctly classified by the algorithm.
    An alternative to having two measures is the F-measure which combines precision and recall into a single performance measure. This metric is known as F1-score, which is simply the harmonic mean of precision and recall.

    In order to evaluate the tagging, a random sample containing 1,000 entries has been extracted from the lexicon. The entries from the sample have been tagged manually and then compared to the tagging performed by the algorithm. Next table shows the results:

    Precision Recall F1-score
    NE 0.99 0.91 0.95
    PERSON 0.99 0.97 0.98
    ORGANIZATION 0.94 0.87 0.90
    LOCATION 0.98 0.92 0.95
    PRODUCT 0.96 0.83 0.89
    MISC 0.96 0.83 0.89
    The first row (NE) refers to named entity recognition, regardless of the class. The precision of tagging is between 0.94 for ORGANIZATION and 0.99 for PERSON. The recall is slightly lower, from 0.83 for PRODUCT and MISC to 0.97 for PERSON. The higher values of precision show that the tagging algorithm was adjusted to tag the named entities correctly, rather than to extract more named entities for the lexicon.

    Formats
    The lexicon comes in two formats: csv and xml.
    The first row in the csv file is a title row and tab is used as a field separator. The columns’ titles are: en, fr, PERSON, ORGANIZATION, LOCATION, PRODUCT and MISC. Next rows contain the data: English name, French name and five digits, 0’s or 1’s, depending on which class the named entity belongs to. For example,

    en fr PERSON ORGANIZATION LOCATION PRODUCT MISC
    Linus's Law Loi de Linus 0 0 0 1 0

    means that the named entity Linus's Law is tagged as PRODUCT, since the column PRODUCT contains 1. All other classes contain 0’s.

    The structure of the xml file is similar. The columns’ names from the csv file are now names of elements:

    Linus's Law
    Loi de Linus

    PRODUCT


    Technical Information
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
    written lexicon 
     
    Members Prices
    Academic - Commercial 2000.00 EUR
    Academic - Research 600.00 EUR
    Commercial - Commercial 2000.00 EUR
    Commercial - Research 2000.00 EUR
    Non Member Prices
    Academic - Commercial 4000.00 EUR
    Academic - Research 1200.00 EUR
    Commercial - Commercial 4000.00 EUR
    Commercial - Research 4000.00 EUR

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0