NER annotation corpus
It includes 229 texts from the Ukrainian Brown corpus with 217,381 tokens and 6,751 annotated Named Entities.
Data description
The corpus of annotated data can be found in the data folder. It includes:
- 229 texts
- 217,381 tokens
- 6,751 Named Entities:
- PERS - 4,060
- LOC - 1,442
- ORG - 649
- MISC - 600
The source of the data is the open Brown corpus of Ukrainian texts. For each processed text, there are two files:
- the one with extension tok.txt includes the tokenized version of the text (tokenization was done with the following rules)
- the one with extension tok.ann includes NER annotations for this text in Brat Standoff Format (each line of the file includes 3 notations divided by a tab: the number of annotation, the start and end indices (with a space in between) in the tokenized text, and the named entity.
Annotation was performed by two annotators for each text with the following rules. The annotators’ disagreement was resolved by a third annotator.
For training and validation models, we recommend using the standard division into dev and test sets.
Licensing
This data is available for usage under the licence "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License."
UberText Corpus
We have collected and arranged a large (more than 6 Gb) amount of texts of Ukrainian periodicals and continue to extend our archive. We are aiming to reach one billion words.
The exact same texts are used for Word Embeddings calculation. Unfortunately, licence restrictions of some periodicals prohibit publishing their texts unchanged.
To give public access to this data, we split them into sentences and then shuffled them randomly. Thus, everyone can use these texts to calculate any statistical models that work on a sentence level. We also published the lemmatized version of these texts in a different archive. For tokenization and lemmatization, we used the nlp-uk library from Andriy Rysin and the BrUk group.
The archive contains sentences from the following periodicals:
Download
Корпус |
# of tokens |
# of sentences |
Download |
Download the lemmatized version |
News |
461451019 |
31021650 |
1.1GB |
951MB |
Wikipedia |
185645357 |
15786948 |
403MB |
371MB |
Fiction |
18323509 |
1811548 |
41MB |
38MB |
Ubercorpus |
665419885 |
48620146 |
1.6GB |
1.5GB |
Information for licensors:
- We distribute texts in a form of shuffled sentences, which makes it impossible to restore the original text.
- Texts are distributed on condition of Fair Use for the needs of statistical and scientific analysis.
- We are a non-profit organization and do not gain any advantage from distributing the mentioned materials.
- If you have any comments on the texts distributed here, please contact us at abuse@lang.org.ua
Corpus of laws and legal acts
Thanks to Oleksandr Shvets we have a large (more than 9 Gb) corpus of laws and legal acts of Ukraine. We tokenized and lemmatized it and have started calculating its Word Embeddings.
Download
Corpus |
# of tokens |
# of sentences |
Download |
Download the lemmatized version |
Laws and legal acts |
578988264 |
29208302 |
560MB |
498MB |