Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Anglais Français
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalogue of Language Resources

    ELRA releases free Language Resources.

    The ELRA Catalogue of Language Resources offers a repository of Language Resources (LRs) made available through ELRA.

    (See full-size image)

    An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.

    Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.

    Other resources identified, but not available through ELRA, can be viewed in the Universal Catalogue.

    If you have any suggestions or comments, or need any further details about ELRA and its Catalogue of Language Resources, please refer to the contact us section.

    ELRA is a partner of OLAC (Open Language Archives Community). The catalogue can be viewed as an OLAC repository.

    New Resources
  • ELRA-S0405 : Gram Vanni data set
    The Gram Vanni data set consists of 130
    hours (21,000 different audio
    recordings) recorded by 4,000 unique
    Hindi speakers in India (20-25% female,
    60% people under 30 years of age, mostly
    rural). The data set was collected via a
    voice-based community media platform
    that runs over IVR (Interactive Voice
    Response) telephone systems. The
    platform is used for discussions on
    local policies, local news, questions
    and answers on agriculture, health and
    social norms, and poetry. The
    environment for recordings is mostly
    outdoor, with a medium level of
    background noise from roadside and
    public places. Speech samples are stored
    as sequences of 8 kHz in MP3 files. An
    orthographic transcription is provided
    (transliteration in Latin characters),
    including tagged named entities.

  • ELRA-S0403 : CLE Pakistan Urdu Speech Corpus
    This corpus consists of phonetically
    rich Urdu sentences and additional
    sentences covering telephone numbers,
    addresses and personal names. This
    speech corpus is recorded with a variety
    of microphone types. Sampling rate of
    speech files is 16 kHz. Each utterance
    is stored in a separate file and is
    accompanied by its orthographic
    transcription file in Unicode.

  • ELRA-M0051 : EnToSSLNE - a Lexicon of Parallel Named Entities from English to South Slavic Languages
    This lexicon consists of 26,155 parallel
    named entities in seven languages:
    English and six South Slavic ones:
    Bosnian, Bulgarian, Croatian,
    Macedonian, Serbian and Slovenian. The
    lexicon contains multiword entries which
    are not strictly named entities, but
    contain a word which is. Slovenian,
    Croatian and Bosnian are written in
    Latin script, Macedonian and Bulgarian
    in Cyrillic. Serbian language is
    specific since it may come in two
    scripts (Cyrillic and Latin) and two
    dialects (ekavica and ijekavica). This
    lexicon takes Serbian ekavica variant
    and its Cyrillic script. The lexicon
    comes in two formats: csv and xml.

  • ELRA-W0128 : ECPC Corpus (European Comparable and Parallel Corpora of Parliamentary Speeches Archive) – set 1
    This corpus is a collection of XML
    metatextually tagged corpora containing
    speeches from European chambers. It is a
    bilingual, bidirectional corpus written
    corpus in English and Spanish. This
    first set (ECPC_EP-05) consists of (1) a
    "clean" version in XML of European
    Parliament's 2005 daily sessions; (2) a
    POS-tagged version of the 2005 daily
    sessions; and (3) a sentence-based
    aligned version of 2005 daily sessions.
    In its raw format, ECPC_EP-05 contains
    3,668,476 tokens/words (excluding
    tagging) in English distributed over 60
    utf-8 files and 3,993,867 tokens/words
    (excluding tagging) in Spanish
    distributed over 60 utf-8 files.

  • ELRA-S0402 : Speaking atlas of the regional languages of France
    The Speaking atlas of the regional
    languages of France offers the same
    Aesop’s fable read in French and in a
    number of varieties of languages of
    France. This work, which has a
    scientific and heritage dimension,
    consists in highlighting the linguistic
    diversity of Metropolitan France and
    Overseas Territories, through recordings
    collected in the field and presented via
    an interactive map, with their
    orthographic transcription. As far as
    Occitan is concerned, about sixty
    varieties were collected in Gascony,
    Languedoc, Provence, northern Occitania
    and the Linguistic Crescent. Varieties
    of Basque, Breton, Franconian, West
    Flemish, Alsatian, Corsican, Catalan,
    Francoprovençal and Oïl language(s) are
    also provided, as well as about fifty
    languages in the French Overseas and
    non-territorial languages such as
    Rromani and the French sign language.

  • (last update: June 2019)

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0