Lang-uk projects

Open data

Word Embeddings (Word2Vec, LexVec and GloVe 300d vectors). Lemmatized version of vectors was built with the help of the Ukrainian POS tag dictionary
UberText corpus contains 67 496 871 sentences that include 665 mln tokens. Sources: periodicals, Wikipedia, fiction. We provide access to the whole corpus (including its tokenized version) as well as any part of the corpus
Corpus of laws and legal acts that contain about 580 mln tokens
NER annotation corpus includes 229 texts from the Ukrainian Brown corpus with 217,381 tokens and 6,751 annotated Named Entities
The Sentiment Dictionary for Ukrainian with approximately 3,5K Ukrainian words that have non-neutral sentiment.
Gazetteers include a collection of names for makes and models of automobiles, motorcycles, trucks, boats and also a list of country names

Corpora that contain periodicals, fiction, laws, and legal acts are available for group members only and will be published openly if the copyright allows it.