GRDD+: A Large-Scale Greek Dialectal Dataset for the Age of LLMs

LLMs meet Greek dialects

Modern Greek is far from monolithic. From Cretan and Cypriot Greek to Pontic, Heptanesian, Tsakonian and Griko, the Greek-speaking world is characterised by rich dialectal variation shaped by geography, contact, and history.

Meanwhile, Large Language Models (LLMs) dominate contemporary NLP. Despite their impressive performance on standard language, LLMs struggle significantly with dialects, especially in low-resource languages. Dialectal Greek is a textbook example: most models have seen little to no data from regional varieties, and their performance quickly degrades outside Standard Modern Greek.

The GRDD+ (Greek Dialectal Dataset Plus) project directly addresses this gap, offering a large, carefully constructed dataset of Greek dialects and using it to evaluate how far fine-tuned LLMs can go in generating natural dialectal text.

What is GRDD+?

GRDD+ is an extended Greek dialectal dataset that builds on the original GRDD corpus and now includes:

  • 10 dialects/varieties,
  • a total of 6,374,939 words,
  • a mix of high-resource and severely endangered Greek varieties.

From the original GRDD, which included:

  • Cretan Greek,
  • Pontic Greek,
  • Cypriot Greek,
  • Northern Greek,

GRDD+ significantly enlarges the existing corpora and introduces six additional varieties:

  • Greco-Corsican (the now extinct Greek of Cargèse in Corsica),
  • Griko (Southern Italian Greek, an endangered minority language),
  • Heptanesian (Ionian Islands),
  • Tsakonian (a highly divergent variety with Doric roots),
  • Maniot (Laconian Mani),
  • Katharevousa (the historical “purist” written variety),
    plus a specialised subcorpus CretDeiAdv focusing on Cretan deictic adverbs.

To date, this is the first dataset of such size and dialectal breadth for Greek, and one of the very few of its kind internationally for a language with rich internal variation.

Data collection and structure

The GRDD+ dataset was compiled from freely available dialectal sources, including:

  • blogs and websites,
  • literary and folklore texts (songs, poems, tales, dialogues),
  • translations into dialects made by native speakers,
  • digitised books (via Google Cloud Vision OCR), especially for varieties such as Greco-Corsican, Griko, Heptanesian, Maniot, Pontic.

After collection, the data was cleaned by:

  • removing numbers, URLs, special characters, duplicate lines, and extra whitespace,
  • extracting only the dialectal text from OCR’d books (excluding metadata and paratext).

The result is a corpus that approximately doubles the size of the original GRDD (from ≈3.15M to ≈6.37M words). It now spans:

  • major living dialects like Cretan, Cypriot and Pontic,
  • endangered or near-extinct varieties like Griko, Tsakonian and Greco-Corsican,
  • and a large body of Katharevousa texts, important for diachronic and stylistic studies.

GRDD+ thus serves both as a technical resource for NLP and a linguistic documentation asset.

From data to models: fine-tuning LLMs on Greek dialects

Beyond dataset construction, the authors conducted a systematic study of how dialectal fine-tuning affects LLM performance.

They:

  • built a fine-tuning dataset of 26,118 examples from four dialects (Cretan, Pontic, Northern Greek, Cypriot Greek),
  • used a sliding window of 100-word chunks and created prompt–completion pairs with dialect-specific instructions (e.g. “Γράψε στην κρητική διάλεκτο: …”),
  • trained three 8B-parameter model architectures:
    • Llama-3-8B,
    • Llama-3.1-8B,
    • Krikri-8B (a Greek-specialised LLM trained on 56.7B Greek tokens).

Fine-tuning was performed using LoRA (Low-Rank Adaptation), a parameter-efficient method that updates only a small fraction of model parameters while keeping the base model frozen.

The fine-tuned models were then compared against:

  • their base (non–fine-tuned) versions, and
  • three frontier models: Claude-3.7-Sonnet, Gemini-2.5-Pro, and ChatGPT-5.

Evaluation with native speakers

Model quality was assessed via human evaluation rather than automatic metrics.

For each dialect:

  • 7 different prompts were used (short story, medium stories, long story, dialogue, creative writing),
  • each model generated 7 texts per dialect,
  • native speakers of the relevant dialect rated each output on a 1–5 scale of dialectal naturalness:
    • 5 = perfectly natural, native-like,
    • 1 = not dialectal / completely unnatural.

Inter-rater reliability was analysed with multiple metrics:

  • Krippendorff’s Alpha (≈0.37–0.55) showed fair to moderate agreement on absolute scores,
  • ICC(3,1) (0.87–0.96) showed excellent consistency in relative rankings,
  • Weighted Cohen’s Kappa fell in between, confirming that raters broadly agreed on which models were better or worse even if they used the 1–5 scale somewhat differently.

This validates the use of averaged scores for comparing models.

Key findings: what works for dialectal Greek?

The results offer several important insights:

1. Base models are almost dialect-blind

The non–fine-tuned versions of Llama-3, Llama-3.1 and Krikri scored close to 1.0–1.5 out of 5, essentially showing minimal dialectal capabilities.

2. Dialectal fine-tuning makes a big difference

All fine-tuned models saw gains of roughly +1.5 to +2.0 points on the 5-point naturalness scale. In practice, this means moving from “clearly wrong or non-dialectal” to “reasonably to highly natural” dialectal text.

3. Greek-specialised doesn’t always mean best

Although Krikri-8B is the only model explicitly trained on large-scale Greek data, it did not consistently outperform the other Llama variants after fine-tuning:

  • it ranked first only for Northern Greek,
  • second or third for Cretan, Cypriot, and Pontic.

This suggests that dialect-specific adaptation may outweigh general Greek specialisation when the goal is dialectal generation.

4. Frontier models are strong but not unbeatable

  • Claude-3.7-Sonnet performed very well across the board, topping Cretan and Northern Greek and placing second (close to the top) for Cypriot and Pontic.
  • ChatGPT-5 showed solid but somewhat inconsistent performance across dialects.
  • Gemini-2.5-Pro consistently underperformed in dialectal tasks.

In several cases, a fine-tuned 8B model outperformed all frontier models on a given dialect – a key result for resource-constrained scenarios.

5. More data is not always better (on its own)

A particularly intriguing pattern emerged:

  • Northern Greek, with only 333 fine-tuning examples (1.7% of the training set), achieved strong scores (up to 3.86/5),
  • Pontic, with twelve times more examples, scored lower and never crossed the 3.0 threshold for any fine-tuned model.

This points to a more complex relationship between:

  • data size,
  • data quality and homogeneity,
  • and linguistic distance from Standard Modern Greek.

Limitations and future directions

The authors are explicit about the limitations:

  • Dataset imbalance: Cretan and Cypriot are heavily represented, Northern Greek is severely under-represented.
  • Subjective evaluation: even with good reliability metrics, naturalness judgments retain an inevitable subjective component.
  • Restricted model set: only three 8B models and one LoRA configuration were explored.
  • Lack of formal distance metrics: hypotheses about linguistic distance to Standard Modern Greek remain qualitative.
  • Sociolinguistic complexity: phenomena like diglossia, code-switching, register and genre were not modelled in depth.

Future work will include:

  • fine-tuning on all six newly added varieties (Greco-Corsican, Griko, Heptanesian, Tsakonian, Maniot, Katharevousa),
  • experimenting with additional architectures (e.g. Mistral, Gemma) and parameter-efficient strategies,
  • designing automatic metrics for dialectal quality,
  • expanding and refining the dataset with better genre labels and sociolinguistic annotations.

Why GRDD+ matters for Greek AI and beyond

The GRDD+ project demonstrates a crucial, practical point:

Even relatively small amounts of high-quality dialectal data can unlock substantial improvements in LLM performance on regional varieties.

For the Greek AI ecosystem, this means:

  • we can build dialect-aware LLMs without needing frontier-scale resources,
  • we can better support linguistic diversity, cultural heritage and minority communities,
  • we can design applications – from conversational agents to educational tools – that speak to users in their own dialects, not just in a standardised variety.

GRDD+ is more than a dataset. It is a foundation for Greek dialectal NLP and a concrete example of how targeted data and careful evaluation can make LLMs more inclusive, accurate, and culturally aware, in Greek, and by extension, in any language with rich internal variation.

Source of this article: arxiv.org