NER models for MITIE
With the help of a Ukrainian annotated corpus, we were able to train a model that automatically labels words in unfamiliar texts with the corresponding entities (name, geographical locations, company, etc.). For the NER recognition, we have chosen MITIE library. It is an open source library which allows free commercial usage. MITIE also provides high quality by combining standard text features and CCA embeddings. Primarily written in C++, MITIE also supports other programming languages such as C, Python, Java, Matlab. For further reference, please see the MITIE documentation and examples.
To calculate CCA embeddings we used the genre balanced Ukrainian corpus that consists of Ukrainian newswire, Wikipedia articles, and fiction.
Download the NER model for Ukrainian
We have also developed the NER model for Russian using the annotated corpus compiled by the organizers of the Dialogue 2016 conference. We used Wikipedia articles for the calculation of the CCA embeddings.
Download NER model for Russian
Word embeddings (Word2Vec, GloVe, LexVec)
Based on the collected corpora of newswire, articles, fiction, juridical texts we computed the most common word embeddings: Word2Vec (and it’s improved version - LexVec) and GloVe. We have decided to share our results and make them publicly accessible because the calculations take a lot of time and computational resources.
We have created separate models with 300d vectors for each genre. We have also computed the same vectors for lemmatized versions and the corpora. If you need different model settings - use the compiled corpora and calculate the models according to your needs.
Tetiana Kodliuk has developed test sets similar to English for quality evaluation of the vectors. The results of the evaluation can be found here.
word2vec for small corpora (fiction):
./word2vec_standalone.py -size 300 -negative 7 -window 4 -threads 6 -min_count 10 -iter 5 -alpha 0.030
lexvec for small corpora (fiction):
./lexvec -dim 300 -verbose 2 -negative 7 -subsample 1e-3 -window 4 -threads 6 -minfreq 10 -iterations 5 -alpha 0.030
GloVe for small corpora (fiction):
MEMORY=4.0
VOCAB_MIN_COUNT=10
VECTOR_SIZE=300
MAX_ITER=15
WINDOW_SIZE=9
BINARY=2
NUM_THREADS=12
X_MAX=10:
For big corpora minfreq is 25
Sentiment analysis model
Sergyi Shehovets and Oles Petriv have developed the neural networks sentiment analysis model which generates similar words with the help of word2vec and lexvec.
Examples: