Name: Czech National Corpus
Nature: text
Language: Czech
Size: under construction - at the end of 1997 up to
100 million words
Format: ASCII, SGML
Coverage: mostly newspaper text,but also will certanly include prose,
ficiton, dialogues, hystorical(diachronical) part
Medium: diskette
Availability: part available through internet (free) the rest for commercial
purposes
Name: Penn Tree Bank
Nature: text
Language: English
Size: tagged part up to 5 million words
Format: ASCII, SGML
Coverage: newspapers, technicalmanuals, brown corpus,dow jonesnewswire,
WBUR radio
Medium: CD-ROM
Availability: don't know, we are only users
Name: Brown Corpus
Nature: text
Language: English
Size: 1,013,644 words
Format: ASCII
Coverage: newspaper text, prose
Medium: diskette
Availability: don't know, we are only users
Name: hand- POS tagged corpus
Nature: text
Language: Czech
Size: 600 000 tokens, each token tagged by POS tag
Format: ASCII - pair TOKEN|TAG per line
Coverage: newspaper text - 60's and 70's
Medium: diskette
Availability: free for research purposes
Name: manually tagged corpus
Nature: text
Language: Czech
Size: 150 000 tokens
Format: SGML
Coverage: newspaper and magazine text ( 1991 - 1997)
Medium: diskette
Availability: upon individual agreement
Name: Korektor (Spell Checker for Slovak Language)
Nature: lexical, software
Language: Slovak
Size: over 120 000 lexical entries, count of word forms (over 6 000 000, canbe generated)
Format: ASCII (with Slovak Character set), with special semantic structure), Commercial format (proprietary "binary" structure),
Software (Application Programming Interface (API), library in C)
Coverage: Main sources (newspaper text, law, economy)
Availability: Status (available, regularly updated)
Software description: For each word form, it returns boolean information, whether the word
is a correct form in Slovak Language
Name: Spell Checker with Hyphenation for Slovak Language
Nature: lexical, software
Language: Slovak
Size: Around 5000 entries for TeX Hyphenation Algorithm, Size of Exceptionlist around 2000
Format: ASCII (with Slovak character set), Data for TeX hyphenation Algorithm
Commercial format: proprietary "binary" structure
Software: Application Programming Interface (API), library in C)
(Note: Usually in single system with Spell Checker, due to a list of the
exceptions)
Coverage: general, domain independent
Precision_of_algorithm: 99.5%, on the word list from Item 3.
List of exceptions: All known entries from Spell Checker (viz.),
which are incorrectly hyphenated by the algorithm.
Note - remaining errors: almost only semantically dependent
Medium: Hard disc, diskette
Availability: Status (available, regularly updated)
Software description: It returns hyphenated form using special "hyphenation" character
Name: Hyphenated word list for Slovak Language. (Node: No special name)
Nature: lexical
Language: Slovak
Size: around 150 000 hyphenated word forms (Note: All word forms from the Spell Checker can be generated
and hyphenated)
Format: Format (ASCII (with Slovak character set), word list for
TeX hyphenation Algorithm)
Coverage: For training purposes were added especially word forms incorrectly
hyphenated by the Algorithm
Medium: Hard disc, diskette
Availability: for internal use, possible as commercial product. Status: available,
regularly updated
Name: Lematizator: Lemmatization and Stemmer for Slovak Language
Nature: lexical, software
Language: Slovak
Size: over 120 000 lexical entries
Format: ASCII (with Slovak Character set
Commercial format: proprietary "binary" structure
Software: Application Programming Interface (API), library in C
Coverage: General, domain independent
Medium: Hard disc, diskette
Availability: commercial product. Status: Available, regularly updated
Software description: Word form analyzer; result: basic form(s) (lemma) and stem(s).
Word form generator from a given lemma. (Note: some lemmas are semantically
distinguished)
Name: Morphology (Note: Morphology for Slovak Language)
Nature: lexical, software
Language: Slovak
Size: over 120 000 lexical entries
Format: ASCII (with Slovak Character set)
Commercial format: proprietary "binary" structure
Software: Application Programming Interface (API), library in C
Coverage: general, domain independent
Medium: Hard disc, diskette
Availability: commercial product. Status: Aavailable, regularly updated.
Software description: Word form analyzer; result: basic form(s) (lemma) and
morphological informations (POS, case, number etc.
(Note: Some lemmas are semantically distinguished)
Name: Frequency list. (Note: Frequency list for Slovak Language)
Nature: lexical
Language: Slovak
Size: 10 000 word forms
Format: ASCII (with Slovak Character set)
Coverage: newspapers
Medium: Hard disc, diskette
Availability: commercial product. Status: available
Software description: Special-purpose tool (Unix and Windows platform) for easy disambiguation
of morphological output. Available upon personal agreement.
| [Survey] [Organisation] [General Info] [Training] [Resources] [Research] [Staff] [Publications] |