Filling the Gap in Greek NLP
Modern Greek is the official language of Greece, one of the official languages of Cyprus, and the native language of around 13 million people. Yet, despite its long history and linguistic richness, Modern Greek remains relatively under-served in mainstream natural language processing (NLP) tools and resources.
Most widely used multilingual models and toolkits have been optimized for English and a handful of other high-resource languages. Greek, with its own alphabet, complex morphology, flexible word order, and the widespread phenomenon of Greeklish, presents challenges that are not adequately addressed by generic tools.
GR-NLP-TOOLKIT was created precisely to respond to this gap: a dedicated, open-source NLP toolkit tailored to Modern Greek, built on top of pre-trained Transformers and designed to deliver state-of-the-art performance.
What is GR-NLP-TOOLKIT?
GR-NLP-TOOLKIT is an open-source natural language processing toolkit developed specifically for Modern Greek by researchers at the Athens University of Economics and Business, in collaboration with Archimedes/Athena RC and helvia.ai.
The toolkit:
- is based on pre-trained Transformer models,
- is freely available,
- can be easily installed in Python via PyPI (
pip install gr-nlp-toolkit), - and is also exposed through a HuggingFace demo space and a public HTTP API (GREEK-NLP-API) for non-commercial use.
Its aim is to serve as a robust, ready-to-use solution for anyone working seriously with Greek text: researchers, students, public bodies, companies, and independent developers.
Five Core NLP Tasks for Modern Greek
GR-NLP-TOOLKIT delivers state-of-the-art performance in five key NLP tasks:
- Part-of-Speech (POS) Tagging
Assigning a POS tag (verb, noun, adjective, etc.) to each token. - Morphological Tagging
Enriching POS tagging with detailed grammatical features such as tense, mood, voice, person, number, gender, and case. - Dependency Parsing
Identifying syntactic relations between words (e.g., subject, object, modifiers), producing dependency trees for Greek sentences. - Named Entity Recognition (NER)
Detecting and classifying entities such as persons, organizations, locations, dates, quantities, and more in Greek text. - Greeklish-to-Greek Transliteration
Converting Greek written with Latin-keyboard characters (Greeklish) back into properly written Greek script.
The last feature is particularly important for real-world Greek text, where informal communication often mixes Greek, Greeklish, and English.
Why Greek is Difficult for Multilingual Models
Multilingual transformer models such as XLM-R have enabled cross-lingual NLP at scale. However, Greek is still under-represented in many of the corpora used for pretraining such models. This leads to two main issues:
- Vocabulary under-representation: many Greek words are rare in multilingual corpora, causing tokenizers to over-fragment them, sometimes down to individual characters.
- Complex morphology and flexible word order: make tasks like POS tagging, morphological tagging, and parsing more challenging.
GR-NLP-TOOLKIT addresses these issues by building on GREEK-BERT, a dedicated Greek Transformer model, which has been shown to consistently outperform generic multilingual models like XLM-R on several Greek NLP tasks, including NER and dependency parsing.
Greeklish: When Greek Switches Keyboard
A distinctive phenomenon in Greek digital communication is Greeklish: Greek written with Latin characters, for instance:h athina kai h thessaloniki einai poleis.
There is no universally accepted mapping between Greek and Latin characters, which means:
- the same Greek word can be written in many different Greeklish variants,
- most Greek NLP datasets and models are trained only on Greek script,
- generic NLP tools struggle to interpret Greeklish text.
GR-NLP-TOOLKIT integrates a state-of-the-art Greeklish-to-Greek converter, based on the BYT5 model, which operates directly on bytes and is well-suited for mixed-alphabet text.
This allows users to:
- first normalize Greeklish into Greek script,
- then apply POS tagging, NER, or parsing on the normalized output, even within a single processing pipeline.
How Does It Compare to Existing Toolkits?
Several well-known NLP toolkits provide some level of support for Modern Greek:
- NLTK: offers only very basic functionality (tokenization and stop-word lists).
- spaCy: supports POS tagging, morphological tagging, lemmatization, NER, and dependency parsing for Greek, but relies on static FASTTEXT embeddings, without Greek-specific Transformers.
- Stanza: provides POS tagging, morphological tagging, lemmatization, and dependency parsing for Greek, again based on FASTTEXT, and does not support NER or Greeklish-to-Greek.
In contrast, GR-NLP-TOOLKIT:
- leverages GREEK-BERT for POS tagging, morphological tagging, NER, and dependency parsing,
- includes a BYT5-based Greeklish-to-Greek transliteration module,
- and, according to experiments reported by the authors, achieves state-of-the-art results on Greek NER, POS and morphological tagging, and dependency parsing, outperforming or matching spaCy and Stanza on these tasks.
Most importantly, it is the only open-source toolkit that currently offers integrated Greeklish-to-Greek conversion as a first-class feature.
Access: Python Package, Demo Space, and Public API
GR-NLP-TOOLKIT is designed to be accessible across different levels of technical expertise:
- Python developers can install it with:
pip install gr-nlp-toolkitand build processing pipelines combining POS tagging, NER, dependency parsing, and Greeklish-to-Greek. - Non-programmers or exploratory users can experiment via an open-access demo on HuggingFace, where they can choose tasks, type or paste Greek/Greeklish text, and inspect the results through a web interface.
- Developers in other languages (e.g., Java, JavaScript, Go, etc.) can integrate the functionality via the GREEK-NLP-API, a fully documented HTTP API that follows the OpenAPI standard and is available for non-commercial use.
The source code is hosted on GitHub, and contributions from the open-source community are explicitly encouraged.
Conclusions and Future Directions
GR-NLP-TOOLKIT is a major step forward for Greek-language NLP:
- it provides a modern, Transformer-based, open-source toolkit tailored to the linguistic realities of Modern Greek,
- it bridges the gap between general-purpose multilingual models and the specific needs of Greek text processing,
- and it uniquely addresses Greeklish, an essential aspect of real-world Greek digital communication.
The authors plan to extend the toolkit with additional components, such as toxicity detection and sentiment analysis, and invite collaboration from researchers, organizations, and developers interested in strengthening the Greek AI and NLP ecosystem.
In short, GR-NLP-TOOLKIT turns Modern Greek from a “low-resource” afterthought into a first-class citizen in the world of open, state-of-the-art natural language processing.
—
Source of this article: arxiv.org