Corpora: lang-uk

NER annotation corpus

It includes 229 texts from the Ukrainian Brown corpus with 217,381 tokens and 6,751 annotated Named Entities.

Data description

The corpus of annotated data can be found in the data folder. It includes:

229 texts
217,381 tokens
6,751 Named Entities:
PERS - 4,060
LOC - 1,442
ORG - 649
MISC - 600

The source of the data is the open Brown corpus of Ukrainian texts. For each processed text, there are two files:

the one with extension tok.txt includes the tokenized version of the text (tokenization was done with the following rules)
the one with extension tok.ann includes NER annotations for this text in Brat Standoff Format (each line of the file includes 3 notations divided by a tab: the number of annotation, the start and end indices (with a space in between) in the tokenized text, and the named entity.

Annotation was performed by two annotators for each text with the following rules. The annotators’ disagreement was resolved by a third annotator.

For training and validation models, we recommend using the standard division into dev and test sets.

Licensing

This data is available for usage under the licence "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License."

UberText Corpus

We have collected and arranged a large (more than 6 Gb) amount of texts of Ukrainian periodicals and continue to extend our archive. We are aiming to reach one billion words.

The exact same texts are used for Word Embeddings calculation. Unfortunately, licence restrictions of some periodicals prohibit publishing their texts unchanged.

To give public access to this data, we split them into sentences and then shuffled them randomly. Thus, everyone can use these texts to calculate any statistical models that work on a sentence level. We also published the lemmatized version of these texts in a different archive. For tokenization and lemmatization, we used the nlp-uk library from Andriy Rysin and the BrUk group.

The archive contains sentences from the following periodicals:

Download

Корпус	# of tokens	# of sentences	Download	Download the lemmatized version
News	461451019	31021650	1.1GB	951MB
Wikipedia	185645357	15786948	403MB	371MB
Fiction	18323509	1811548	41MB	38MB
Ubercorpus	665419885	48620146	1.6GB	1.5GB

Information for licensors:

We distribute texts in a form of shuffled sentences, which makes it impossible to restore the original text.
Texts are distributed on condition of Fair Use for the needs of statistical and scientific analysis.
We are a non-profit organization and do not gain any advantage from distributing the mentioned materials.
If you have any comments on the texts distributed here, please contact us at abuse@lang.org.ua

Corpus of laws and legal acts

Thanks to Oleksandr Shvets we have a large (more than 9 Gb) corpus of laws and legal acts of Ukraine. We tokenized and lemmatized it and have started calculating its Word Embeddings.

Download

Corpus	# of tokens	# of sentences	Download	Download the lemmatized version
Laws and legal acts	578988264	29208302	560MB	498MB