(Word2Vec, LexVec and GloVe 300d vectors). Lemmatized version of vectors was built with the help of the
Ukrainian POS tag dictionary
contains 67 496 871 sentences that include 665 mln tokens. Sources: periodicals, Wikipedia, fiction. We provide access to the whole corpus (including its tokenized version) as well as any part of the corpus
Corpus of laws
and legal acts that contain about 580 mln tokens
NER annotation corpus
includes 229 texts from the
Ukrainian Brown corpus
with 217,381 tokens and 6,751 annotated Named Entities
The Sentiment Dictionary for Ukrainian
with approximately 3,5K Ukrainian words that have non-neutral sentiment.
include a collection of names for makes and models of automobiles, motorcycles, trucks, boats and also a list of country names
Data with limited access
Corpora that contain periodicals, fiction, laws, and legal acts are available for group members only and will be published openly if the copyright allows it.
Annotation tool for crowdsource data processing