Central and Eastern European Survey
Resources
Institute of Formal and Applied Linguistics
Faculty of Mathematics and Physics, Charles University
NL and Speech Resources available: Textual, Speech, Software, Lexical resources
(including terminology).
Name: Czech National Corpus Nature: text Language: Czech Size: under construction - at the end of 1997 up to
100 million words Format: ASCII, SGML Coverage: mostly newspaper text,but also will certanly include prose,
ficiton, dialogues, hystorical(diachronical) part Medium: diskette Availability: part available through internet (free) the rest for commercial
purposes
Name: Penn Tree Bank Nature: text Language: English Size: tagged part up to 5 million words Format: ASCII, SGML Coverage: newspapers, technicalmanuals, brown corpus,dow jonesnewswire,
WBUR radio Medium: CD-ROM Availability: don't know, we are only users
Name: Brown Corpus Nature: text Language: English Size: 1,013,644 words Format: ASCII Coverage: newspaper text, prose Medium: diskette Availability: don't know, we are only users
Name: hand- POS tagged corpus Nature: text Language: Czech Size: 600 000 tokens, each token tagged by POS tag Format: ASCII - pair TOKEN|TAG per line Coverage: newspaper text - 60's and 70's Medium: diskette Availability: free for research purposes
Name: manually tagged corpus Nature: text Language: Czech Size: 150 000 tokens Format: SGML Coverage: newspaper and magazine text ( 1991 - 1997) Medium: diskette Availability: upon individual agreement
Name: Korektor (Spell Checker for Slovak Language) Nature: lexical, software Language: Slovak Size: over 120 000 lexical entries, count of word forms (over 6 000 000, canbe generated) Format: ASCII (with Slovak Character set), with special semantic structure), Commercial format (proprietary "binary" structure),
Software (Application Programming Interface (API), library in C) Coverage: Main sources (newspaper text, law, economy) Availability: Status (available, regularly updated)
Software description: For each word form, it returns boolean information, whether the word
is a correct form in Slovak Language
Name: Spell Checker with Hyphenation for Slovak Language Nature: lexical, software Language: Slovak Size: Around 5000 entries for TeX Hyphenation Algorithm, Size of Exceptionlist around 2000 Format: ASCII (with Slovak character set), Data for TeX hyphenation Algorithm
Commercial format: proprietary "binary" structure
Software: Application Programming Interface (API), library in C)
(Note: Usually in single system with Spell Checker, due to a list of the
exceptions) Coverage: general, domain independent
Precision_of_algorithm: 99.5%, on the word list from Item 3.
List of exceptions: All known entries from Spell Checker (viz.),
which are incorrectly hyphenated by the algorithm.
Note - remaining errors: almost only semantically dependent Medium: Hard disc, diskette Availability: Status (available, regularly updated)
Software description: It returns hyphenated form using special "hyphenation" character
Name: Hyphenated word list for Slovak Language. (Node: No special name) Nature: lexical Language: Slovak Size: around 150 000 hyphenated word forms (Note: All word forms from the Spell Checker can be generated
and hyphenated) Format: Format (ASCII (with Slovak character set), word list for
TeX hyphenation Algorithm) Coverage: For training purposes were added especially word forms incorrectly
hyphenated by the Algorithm Medium: Hard disc, diskette Availability: for internal use, possible as commercial product. Status: available,
regularly updated
Name: Lematizator: Lemmatization and Stemmer for Slovak Language Nature: lexical, software Language: Slovak Size: over 120 000 lexical entries Format: ASCII (with Slovak Character set
Commercial format: proprietary "binary" structure
Software: Application Programming Interface (API), library in C Coverage: General, domain independent Medium: Hard disc, diskette Availability: commercial product. Status: Available, regularly updated
Software description: Word form analyzer; result: basic form(s) (lemma) and stem(s).
Word form generator from a given lemma. (Note: some lemmas are semantically
distinguished)
Name: Morphology (Note: Morphology for Slovak Language) Nature: lexical, software Language: Slovak Size: over 120 000 lexical entries Format: ASCII (with Slovak Character set)
Commercial format: proprietary "binary" structure
Software: Application Programming Interface (API), library in C Coverage: general, domain independent Medium: Hard disc, diskette Availability: commercial product. Status: Aavailable, regularly updated.
Software description: Word form analyzer; result: basic form(s) (lemma) and
morphological informations (POS, case, number etc.
(Note: Some lemmas are semantically distinguished)
Name: Frequency list. (Note: Frequency list for Slovak Language) Nature: lexical Language: Slovak Size: 10 000 word forms Format: ASCII (with Slovak Character set) Coverage: newspapers Medium: Hard disc, diskette Availability: commercial product. Status: available
Software description: Special-purpose tool (Unix and Windows platform) for easy disambiguation
of morphological output. Available upon personal agreement.
|