ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalogue of Language Resources

    ELRA releases free Language Resources.


    The ELRA Catalogue of Language Resources offers a repository of Language Resources (LRs) made available through ELRA.


    (See full-size image)

    An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.

    Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.

    Other resources identified, but not available through ELRA, can be viewed in the Universal Catalogue.

    If you have any suggestions or comments, or need any further details about ELRA and its Catalogue of Language Resources, please refer to the contact us section.

    ELRA is a partner of OLAC (Open Language Archives Community). The catalogue can be viewed as an OLAC repository.

    New Resources
  • ELRA-W0126 : Training and test data for Arabizi detection and transliteration
    The dataset is composed of : a
    collection of mixed English and Arabizi
    text intended to train and test a system
    for the automatic detection of
    code-switching in mixed English and
    Arabizi texts ; and a set of 3,452
    Arabizi tokens manually transliterated
    into Arabic, intended to train and test
    a system that performs Arabizi to Arabic
    transliteration

  • ELRA-S0396 : Mbochi speech corpus
    This corpus consists of 5131 sentences
    recorded in mbochi, together with their
    transcription and French translation, as
    well as the results from the work made
    during JSALT workshop: alignments at
    the phonetic level and various results
    of unsupervised word segmentation from
    audio. The audio corpus is made up of
    4,5 hours, downsampled at 16kHz, 16bits,
    with Linear PCM encoding. Data is
    distributed into 2 parts, one for
    training consisting of 4617 sentences,
    and one for development consisting of
    514 sentences.

  • ELRA-S0397 : Chinese Mandarin (South) database
    This database contains the recordings of
    1000 Chinese Mandarin speakers from
    Southern China (500 males and 500
    females), from 18 to 60 years’ old,
    recorded in quiet studios. Recordings
    were made through microphone headsets
    and consist of 341 hours of audio data
    (about 30 minutes per speaker), stored
    in .WAV files as sequences of 48 KHz
    Mono, 16 bits, Linear PCM.

  • ELRA-S0398 : Chinese Mandarin (North) database
    This database contains the recordings of
    500 Chinese Mandarin speakers from
    Northern China (250 males and 250
    females), from 18 to 60 years’ old,
    recorded in quiet studios. Recordings
    were made through microphone headsets
    and consist of 172 hours of audio data
    (about 30 minutes per speaker), stored
    in .WAV files as sequences of 48 KHz
    Mono, 16 bits, Linear PCM.

  • ELRA-W0125 : TRAD Chinese-French Parallel Text - Blog
    This is a parallel corpus of 15,809
    characters in Chinese and reference
    translations in French of 11,769 words.
    The source texts are a selection of blog
    texts.

  • (last update: August 2018)

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0