Language and Speech Resources
Contents
Language and speech resources are of crucial importance for research
and development in language and speech technology. ELSNET aims at the
creation and distribution of pilot resources for experimentation purposes,
and acts as a platform for exchange of expertise across languages, and for
discussion of emerging standards. ELSNET collaborates closely with the
main organisations in the field of resources.
ELSNET, in close collaboration with the former ENABLER Network, is in
the process of building a map of the Resources Landscape. This map should
facilitate identification and access to Language Resources: surveys,
metadata, networks, projects, ...
The first release of the landscape can now be found on
http://www.ilc.cnr.it/elsnet4/
- The European Corpus Initiative Multilingual Corpus I
- The ECI/MCI CD-ROM contains over 98 million
words, covering most of the major European languages, as well as
Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus
in this effort is on textual material of all kinds, including
transcriptions of spoken material.
-
- Newspapers on the internet
- A list of links to electronic versions of newspapers from various
countries in several languages. The URL is
http://www.ims.uni-stuttgart.de/info/Newspapers.html
-
- The HCRC Map Task Corpus
- The HCRC Map Task Corpus is a set of 8 CD-ROMs containing linked
audio and transcriptions of a total of about 18 hours of spontaneous
speech that was recorded from 128 two-person conversations according to
a detailed experimental design.
- CD-ROMS available from
LDC
(no longer from ELSNET). The non-member price is ca $200.
- The project URL is http://www.hcrc.ed.ac.uk/maptask/
-
- The Groningen Speech Corpus
- The Groningen Speech Corpus was
collected by A.M. Sulter, MD and Prof. H.K. Schutte as part of a
research project funded by NWO (Netherlands Organization for Scientific
Research). The 4 CD-ROMs contain over 20 hours of speech. It is a corpus
of read speech material in Dutch, recorded on PCM tape under fairly good
conditions.
- CD-ROMS available from
ELRA/ELDA
(no longer from ELSNET). The non-member price is ca 800 euro.
- The Syntax/Senmantic Annotation Task
- In the course of 2000-2001 ELSNET has produced two small sample
corpora of parallel structure for German and Italian, about 1000
sentences of each language, illustrating 20 verbs, and their syntactic
and semantic subcategorization. The annotation concentrates on the
verbal predicates and their subcategorized complements, as well as on a
few relevant modifiers. A short report can be found on
http://www.elsnet.org/ssa
- ELRA
- The European Language Resources
Association (ELRA) was established as a non-profit organization
in Luxembourg in February, 1995. The overall goal of ELRA is to
provide a centralized organization for the validation,
management, and distribution of speech, text, and terminology
resources and tools, and to promote their use within the
European telematics R&TD community. The URL is http://www.icp.grenet.fr/ELRA/home.html.
-
- LDC
- The Linguistic Data Consortium
(LDC) is an open consortium of universities, companies and
government research laboratories. It creates, collects and
distributes speech and text databases, lexicons, and other
resources for research and development purposes. The University
of Pennsylvania is the LDC's host institution. The LDC was
founded in 1992 with a grant from the Advanced Research
Projects Agency (ARPA), and is partly supported by grant
IRI-9528587 from the Information and Intelligent Systems
division of the National Science Foundation. The URL is http://www.ldc.upenn.edu
-
- ENABLER
- The Enabler Network aims at improving cooperation among
national activities established by national authorities for
providing Language Resources for their languages. The
action aims at: establishing a regular exchange of
information; identifying and fostering possible synergies
and cooperation; promoting the compatibility and
interoperability of their results, thus facilitating the
successful transfer of technologies and tools among
languages and the construction of multilingual Language
Resources; increasing the visibility and the strategic
impact of those national activities in the field of HLT;
contributing to the creation of an overall framework in
which the public and private sectors, national efforts and
international coordination could cooperate in order to
answer the IST need for Language Resources.
- URL: http://www.enabler-network.org/
-
- NEMLAR
- The goal of the NEMLAR (Network for Euro-Mediterranean
LAnguage Resources) is to create a network of qualified
Euro-Mediterranean partners to specify and support the
development of high priority LRs for Arabic and other local
languages in a systematic, standards-driven, collaborative
learning context. The project will focus on identifying the
state of the art of LRs in the region, assessing priority
requirements through consultations with language industry
and communication players, and establishing a protocol for
developing a basic LR kit for the major forms of the
region's predominant language - Arabic, and other local
wide-spoken languages where appropriate.
- URL: http://www.nemlar.org
-
- TELRI
- The TELRI association aims at
collecting, promoting, and making available monolingual and
multilingual language resources and tools for the extraction of
language data and linguistic knowledge; with a special focus on
Central and eastern European languages. The URL is http://www.telri.de.
-
|