Models: lang-uk

NER models for MITIE

With the help of a Ukrainian annotated corpus, we were able to train a model that automatically labels words in unfamiliar texts with the corresponding entities (name, geographical locations, company, etc.). For the NER recognition, we have chosen MITIE library. It is an open source library which allows free commercial usage. MITIE also provides high quality by combining standard text features and CCA embeddings. Primarily written in C++, MITIE also supports other programming languages such as C, Python, Java, Matlab. For further reference, please see the MITIE documentation and examples.

To calculate CCA embeddings we used the genre balanced Ukrainian corpus that consists of Ukrainian newswire, Wikipedia articles, and fiction.

Download the NER model for Ukrainian

We have also developed the NER model for Russian using the annotated corpus compiled by the organizers of the Dialogue 2016 conference. We used Wikipedia articles for the calculation of the CCA embeddings.

Download NER model for Russian

Word embeddings (Word2Vec, GloVe, LexVec)

Based on the collected corpora of newswire, articles, fiction, juridical texts we computed the most common word embeddings: Word2Vec (and it’s improved version - LexVec) and GloVe. We have decided to share our results and make them publicly accessible because the calculations take a lot of time and computational resources.

We have created separate models with 300d vectors for each genre. We have also computed the same vectors for lemmatized versions and the corpora. If you need different model settings - use the compiled corpora and calculate the models according to your needs.

Corpus	300d as is	300d lowercase	300d lemmatized	300d lemmatized lowercase
Fiction	LexVec 105MB Word2Vec 98MB GloVe 99MB	LexVec 100MB Word2Vec 93MB GloVe 94MB	LexVec 53MB Word2Vec 51MB GloVe 51MB	LexVec 53MB Word2Vec 50MB GloVe 50MB
Newswire	LexVec 329MB Word2Vec 328MB GloVe 324MB	LexVec 296MB Word2Vec 296MB GloVe 292MB	LexVec 160MB Word2Vec 161MB GloVe 159MB	LexVec 156MB Word2Vec 157MB GloVe 155MB
Ubercorpus	LexVec 536MB Word2Vec 530MB GloVe 526MB	LexVec 483MB Word2Vec 480MB GloVe 476MB	LexVec 297MB Word2Vec 295MB GloVe 294MB	LexVec 291MB Word2Vec 289MB GloVe 288MB

Tetiana Kodliuk has developed test sets similar to English for quality evaluation of the vectors. The results of the evaluation can be found here.

word2vec for small corpora (fiction):

./word2vec_standalone.py -size 300 -negative 7 -window 4 -threads 6 -min_count 10 -iter 5 -alpha 0.030

lexvec for small corpora (fiction):

./lexvec -dim 300 -verbose 2 -negative 7 -subsample 1e-3 -window 4 -threads 6 -minfreq 10 -iterations 5 -alpha 0.030

GloVe for small corpora (fiction):

MEMORY=4.0
VOCAB_MIN_COUNT=10
VECTOR_SIZE=300
MAX_ITER=15
WINDOW_SIZE=9
BINARY=2
NUM_THREADS=12
X_MAX=10:

For big corpora minfreq is 25

Sentiment analysis model

Sergyi Shehovets and Oles Petriv have developed the neural networks sentiment analysis model which generates similar words with the help of word2vec and lexvec.

Examples: