Natural Language Processing for Modern Greek has long faced structural challenges: limited high-quality datasets, fragmented modeling efforts, and a shortage of architectures capable of handling complex, domain-specific text. Legal documents, in particular, require models with exceptional contextual depth and linguistic precision.
To address these gaps, Novelcore introduces the Greek Embedding Models (GEMs), a new family of transformer-based models designed from the ground up to advance Greek NLP through architectural diversity and quality-driven corpus curation.
Why Greek Needs New Models
Although Greek NLP has evolved with models such as Greek-BERT, Greek-Legal-BERT, Greek-Legal-RoBERTa and Greek Longformer, progress has remained uneven. Prior work relied heavily on early encoder architectures and small context windows, leaving unexplored many of the advances that define modern transformer design.
GEMs unify these efforts through a systematic, research-driven methodology.
A Diverse and Modern Set of Architectures
The GEM family encompasses both established and cutting-edge encoders:
• RoBERTa
• ELECTRA
• ConvBERT
• ModernBERT
• Longformer with 1024-token context
Additionally, GEMs introduce the first bilingual Greek–English embedding models optimized for cross-lingual legal tasks.
Quality-Based Corpus Curation
A central innovation of GEMs is the emphasis on dataset quality:
• curated Greek legal corpora from official sources,
• large general-domain datasets carefully cleaned and deduplicated,
• targeted repetition of high-quality legal subcorpora such as the Raptarchis dictionary,
• a custom bilingual corpus combining Greek and English legal text.
This data-centric approach ensures that GEMs learn robust, domain-appropriate representations.
Experimental Results
Across three major benchmarks—NER, legal topic classification and natural language inference—the GEM-RoBERTa and GEM-ConvBERT models consistently outperform all existing Greek models. Statistical analysis confirms their superiority, while bilingual models provide competitive and sometimes improved performance, demonstrating the benefits of cross-lingual learning.
Looking Ahead
The GEMs family establishes new baselines for Greek NLP and provides a unified foundation for future research. Upcoming directions include applications in summarization, question answering, information retrieval and expansion into other high-value domains such as biomedicine and financial text analysis.
—
Source of this article: arxiv.org