UberText 2.0.

We introduce UberText 2.0, which is the new and extended version of UberText, a corpus of modern Ukrainian texts designed to meet various NLP needs.

Modern development of word embeddings, transformers, neural machine translators, speech-to-text models, and question answering systems opens new horizons for natural language processing. Most of the models mentioned above rely heavily on the availability of corpora for a target language. While it is not usually a problem to obtain such a dataset for languages such as English, Chinese or Spanish, for low-resource languages, the absence of publicly available corpora is a severe barrier for researchers.

The core concept of our corpus is that the same data, once collected and processed, can be later used to produce various deliverables suitable for different computational linguistics tasks. The corpus size, the additional layers (like POS tags and lemmas), and its availability for direct download make it an invaluable dataset. At the same time, the data model behind it and its flexible architecture allows exporting the corpus version pinpointed to a particular task or research need.

The pipeline behind the corpus simplifies data collection, pre- and post-processing, and export of the deliverables, helping set up a regular release cycle so that end users can use the fresh copy of the data or update their models built on the previous versions when needed.

Corpus composition

UberText 2.0 has five subcorpora:

news (short news, longer articles, interviews, opinions, and blogs) scraped from 38 central, regional, and industry-specific news websites;
fiction (novels, prose, and some poetry) scraped from two public libraries;
social (264 public telegram channels), acquired from the project TGSearch;
wikipedia — the Ukrainian Wikipedia as of January 2023;
court (decisions of the Supreme Court of Ukraine), received upon request for public information.

Projects using UberText 2.0

This is an incomplete list of models and datasets created using UberText 2.0:

first flair embeddings (Akbik et al., 2018) of the Ukrainian language;
compact POS and NER trained on these embeddings;
fastText vectors (skipgram, cbow) of a high quality;
lean language models for a Ukrainian speech-to-text project;
models for punctuation restoration;
GPT-2 models of different sizes for the Ukrainian language and fine-tuning for various tasks using instructions
fine-tuned paraphrase-multilingual-mpnet-base-v2 sentence transformer on the sentences mined from the corpus to achieve better performance on WSD task
Electra embeddings trained on UberText 2.0 and combined corpus, made by Stefan Schweter;
Transformer based accentification of words for Ukrainian language. Trained on sentences from UberText 2.0 accented by ukrainian-word-stress. Works 10 times faster than ukrainian-word-stress

Citing UberText 2.0

@inproceedings{chaplynskyi-2023-introducing,

title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",

author = "Chaplynskyi, Dmytro",

booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",

month = may,

year = "2023",

address = "Dubrovnik, Croatia",

publisher = "Association for Computational Linguistics",

url = "https://aclanthology.org/2023.unlp-1.1",

pages = "1--10",

}

Preprint is available here.

Download links

Below you can find download links to the different layers of each subcorpora.

	base	cleansed	split into sentences	tokenized
court	302 mb	302 mb	300 mb	300 mb
fiction	414 mb	414 mb	398 mb	398 mb
news	3.5 gb	3.5 gb	3.4 gb	3.4 gb
social	91 mb	91 mb	87 mb	87 mb
wikipedia	806 mb	806 mb	803 mb	798 mb

The lemma frequency dict built on the combined corpora is also available. It contains information about the lemma/POS (UD), its frequency in the corpora, and the number of documents it appears in.

Previous versions of the UberText

The first version of the UberText corpus is available here. You can use it when you need to compare the results of your experiments with the results of the previous experiments or measure the impact of the corpus size on the intrinsic/extrinsic metrics.