Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Anglais Français
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • R&D Catalogue of Language Resources R&D Catalogue of Language Resources

    Considering the needs expressed by several academic institutions of the Human Language Technology field, ELDA is pleased to offer access to a version of its Catalogue of Language Resources dedicated to academic research. Indeed, at various occasions, while discussing with the players of the R&D academic community, we concluded to the importance to allow an easy and fast access to a list of resources more specifically produced for R&D purposes in Human Language Technology.

    Thus, we now provide a list of Language Resources, available at very affordable prices, and dedicated to a research use. So as to facilitate the access to this list, we preserved the interface and browsing tools of the ELDA catalogue. Of course, at any time, you may choose to return to the full version of the catalogue. Very soon, we will also implement an advanced search which will allow you to browse through our catalogue thanks to pre-defined selection criteria, such as the type of resources or the prices available (and many more criteria).

    Like the full version of the catalogue, the language resources available here are distributed into 4 categories : "Speech and Related Resources", "Written Resources", "Terminological Resources", and "Multimodal/Multimedia Resources".

    1/ Spoken LRs

    a - Telephone recordings
    The databases catalogued in this section have been produced with speaker recordings made over the telephone (fixed or mobile) network, or through a microphone. You will find speech resources recorded in various environments, and covering a large number of European and non-European languages, e.g. the databases produced in the framework of the SpeechDat project.

    b - Desktop/Microphone recordings
    The databases catalogued in this section have been produced with speaker recordings made over a microphone, e.g. the databases produced in the framework of the BABEL project databases.

    c - Broadcast Resources
    The databases catalogued in this section have been produced with speaker recordings made over radio, television or internet, such as the Italian Broadcast News Corpus.

    d - Speech Related Resources
    You will find in this section pronunciation and phonetic lexicons, such as BDLEX, PHONOLEX, and MHATLEX databases.

    2/ Written LRs

    a - Corpora
    This section contains monolingual and multilingual corpora, parallel or not, which may also be annotated. A few examples of the kind of resources you will find in this section are e.g. the corpora developed in the framework of the MULTEXT project, the Multilingual and Parallel Corpora (MLCC), French scientific corpora, newspaper corpora in Arabic, etc.

    b - Monolingual lexicons
    The section dedicated to monolingual lexicons contains various types of dictionaries, e.g. a dictionary of French verbs, the Japanese word dictionary, some PAROLE lexicons in many languages, etc.

    c - Multilingual lexicons
    Here you can find either bilingual or multilingual dictionaries and lexicons, such as the EuroWordNet databases.

    3/ Terminological LRs

    Monolingual, bilingual and multilingual terminological databases are available. They cover a large number of specialised domains, e.g. automobile engineering, insurance, linguistics, finance, etc., in a wide variety of languages.

    4/ Multimodal/Multimedia LRs

    The resources you will find in this section have been produced using different modalities, including the speech. An example of such resources is the database produced in the framework of the M2VTS project.


    New Resources
  • ELRA-W0128 : ECPC Corpus (European Comparable and Parallel Corpora of Parliamentary Speeches Archive) – set 1
    This corpus is a collection of XML
    metatextually tagged corpora containing
    speeches from European chambers. It is a
    bilingual, bidirectional corpus written
    corpus in English and Spanish. This
    first set (ECPC_EP-05) consists of (1) a
    "clean" version in XML of European
    Parliament's 2005 daily sessions; (2) a
    POS-tagged version of the 2005 daily
    sessions; and (3) a sentence-based
    aligned version of 2005 daily sessions.
    In its raw format, ECPC_EP-05 contains
    3,668,476 tokens/words (excluding
    tagging) in English distributed over 60
    utf-8 files and 3,993,867 tokens/words
    (excluding tagging) in Spanish
    distributed over 60 utf-8 files.

  • ELRA-S0402 : Speaking atlas of the regional languages of France
    The Speaking atlas of the regional
    languages of France offers the same
    Aesop’s fable read in French and in a
    number of varieties of languages of
    France. This work, which has a
    scientific and heritage dimension,
    consists in highlighting the linguistic
    diversity of Metropolitan France and
    Overseas Territories, through recordings
    collected in the field and presented via
    an interactive map, with their
    orthographic transcription. As far as
    Occitan is concerned, about sixty
    varieties were collected in Gascony,
    Languedoc, Provence, northern Occitania
    and the Linguistic Crescent. Varieties
    of Basque, Breton, Franconian, West
    Flemish, Alsatian, Corsican, Catalan,
    Francoprovençal and Oïl language(s) are
    also provided, as well as about fifty
    languages in the French Overseas and
    non-territorial languages such as
    Rromani and the French sign language.

  • ELRA-W0126 : Training and test data for Arabizi detection and transliteration
    The dataset is composed of : a
    collection of mixed English and Arabizi
    text intended to train and test a system
    for the automatic detection of
    code-switching in mixed English and
    Arabizi texts ; and a set of 3,452
    Arabizi tokens manually transliterated
    into Arabic, intended to train and test
    a system that performs Arabizi to Arabic

  • ELRA-S0396 : Mbochi speech corpus
    This corpus consists of 5131 sentences
    recorded in mbochi, together with their
    transcription and French translation, as
    well as the results from the work made
    during JSALT workshop: alignments at
    the phonetic level and various results
    of unsupervised word segmentation from
    audio. The audio corpus is made up of
    4,5 hours, downsampled at 16kHz, 16bits,
    with Linear PCM encoding. Data is
    distributed into 2 parts, one for
    training consisting of 4617 sentences,
    and one for development consisting of
    514 sentences.

  • ELRA-S0394 : Metalogue Multi-Issue Bargaining Dialogue
    This corpus consists of approximately
    2.5 hours of semantically annotated
    English dialogue data that includes
    speech and transcripts. Six unique
    subjects (undergraduates between 19 and
    25 years of age) participated in the
    collection. The dialogue speech was
    captured with two headset microphones
    and saved in 16kHz, 16-bit mono linear
    PCM FLAC format. Transcripts were
    produced semi-automatically, using an
    automatic speech recognizer followed by
    manual correction. All text is presented
    in UTF-8 as either plain text or XML.

  • (last update: February 2019)

    Copyright © 2006 ELRA
    ELRACatalogue R&D 0.8.0