Lang-uk
Homepage
Корпуси
Словники
Ґазетіри
Сервіси
Бібліотеки
Моделі
Homepage
The team
About us
Manifesto
Twitter
Github
Lang-uk projects
Open data
Word Embeddings
(Word2Vec, LexVec and GloVe 300d vectors). Lemmatized version of vectors was built with the help of the
Ukrainian POS tag dictionary
UberText corpus
contains 67 496 871 sentences that include 665 mln tokens. Sources: periodicals, Wikipedia, fiction. We provide access to the whole corpus (including its tokenized version) as well as any part of the corpus
Corpus of laws
and legal acts that contain about 580 mln tokens
NER annotation corpus
includes 229 texts from the
Ukrainian Brown corpus
with 217,381 tokens and 6,751 annotated Named Entities
The Sentiment Dictionary for Ukrainian
with approximately 3,5K Ukrainian words that have non-neutral sentiment.
Gazetteers
include a collection of names for makes and models of automobiles, motorcycles, trucks, boats and also a list of country names
Data with limited access
Corpora that contain periodicals, fiction, laws, and legal acts are available for group members only and will be published openly if the copyright allows it.
Models
NER models
for
MITIE
Tools
Simple tokenizer
Annotation tool for crowdsource data processing