Local Open AI and AI Factories: a practical architecture for safer, cheaper and more democratic AI
From PHAROS to local models: a layered architecture for open AI
The right strategy for artificial intelligence is not to choose one single technological solution. Not everything needs to run on a supercomputer, and it is equally unreasonable for every public body, university, school or business to depend permanently on commercial cloud APIs. The rational approach is layered: national high-performance computing infrastructure for heavy workloads, local open infrastructure for secure everyday use, and alternative hardware platforms to avoid a new form of technological lock-in.
PHAROS and DAEDALUS belong to the first layer. They are national infrastructures for large-scale tasks: training or substantial adaptation of large models, benchmarking Greek language models, creating and validating high-quality datasets, scientific simulations, applications in health, culture, climate and sustainability, and support for research teams and start-ups that need compute capacity beyond the reach of a single organisation. This is the role of the supercomputer: it creates, tests, compares and improves models.
The second layer is local open-source AI. Here the goal is not to train a huge model from scratch, but to run a secure, affordable and auditable AI service close to the data. A pilot deployment may use, for example, two Apple Mac Studio M3 Ultra nodes with 256 GB unified memory and two NVIDIA DGX Spark GB10 nodes with 128 GB unified memory, 4 TB NVMe storage per node, and support for Metal, CUDA, llama.cpp, vLLM and TensorRT-LLM. The Apple nodes are well suited for low-power always-on inference, small and medium models, embeddings, document search, summarisation, transcription and privacy-sensitive workflows. The NVIDIA nodes cover heavier inference, larger models, batch processing and more demanding experimentation.
The third layer, which is becoming increasingly important, is the AMD/ROCm ecosystem. AMD now offers an alternative path for low-cost open LLMs, both in data centres and in local deployments. At the data centre level, AMD Instinct MI300X, MI325X and MI350 are relevant mainly because of their very large memory per accelerator: 192 GB HBM3 on MI300X, 256 GB HBM3E on MI325X and up to 288 GB HBM3E on the MI350 series. For large open models, memory is a key cost factor. When more parameters fit on one GPU or on fewer GPUs, complexity, interconnect requirements, power consumption and total cost of ownership can be reduced.
The value of the AMD ecosystem is not only in hardware. It is also in ROCm, AMD’s open source software stack for accelerated AI and HPC computing. ROCm documentation now describes support for major LLM serving engines, including vLLM and Hugging Face Text Generation Inference. At the same time, tools such as llama.cpp, Ollama, Vulkan and HIP/ROCm enable heterogeneous deployments, where different hardware can be used depending on the workload. This matters for organisations that do not want to be locked into a single vendor chain.
In practice, a mature strategy for low-cost open LLMs should not be framed as “NVIDIA or AMD”, “Apple or data centre”, “PHAROS or local node”. It should be combined. PHAROS and DAEDALUS should be used for heavy training, evaluation and national model infrastructure. Local NVIDIA and Apple nodes should be used for immediate, reliable and secure inference inside organisations. AMD/ROCm solutions add competition, large memory per GPU, potential cost advantages and an alternative open software ecosystem.
This is especially important for public administration, businesses and education. A municipality, university or hospital can use local models for everyday tasks such as regulatory search, document summarisation, request classification, user support and secure access to internal knowledge. A ministry or research centre can use PHAROS for larger experiments and evaluations. A small or medium-sized enterprise can start with a workstation or small local node and later scale to a data centre. What matters is that the architecture is based on open standards, open models where possible, interchangeable backends and publicly accountable governance.
This avoids two mistakes. The first is the illusion that everything must be solved by one central super-system. The second is dependence on thousands of disconnected small solutions without common standards, security or evaluation. Democratic artificial intelligence needs central compute power where it is necessary, local control where it is critical, and an open hardware and software ecosystem so that public money builds public capability.
Article sources:
GRNET, Pharos: The Greek AI Factory: The official description of the Greek AI Factory, aiming to connect supercomputing power, research, the public sector, businesses and critical domains such as health, language, culture and sustainability: https://grnet.gr/business-directory/Pharos-AI-Factory/,
GRNET, DAEDALUS in Lavrio: The official announcement on the implementation of the new European supercomputer DAEDALUS in Lavrio and its total computing power: https://grnet.gr/en/2025/03/26/daedalus-dc-ylopoihsh-lavrio/,
Apple, Mac Studio Technical Specifications: Apple’s official technical page documents that the Mac Studio with M3 Ultra can be configured with a 32-core CPU, 80-core GPU, 32-core Neural Engine, 256 GB of unified memory and SSD storage up to 16 TB, features that are important for local inference and large quantized models: https://support.apple.com/en-us/122211/,
Apple, Mac Studio Specs: Apple’s official product specification page also lists 819 GB/s memory bandwidth for the M3 Ultra, an important characteristic for AI workloads that benefit from Apple Silicon’s unified memory architecture: https://www.apple.com/mac-studio/specs/,
Apple Developer, Accelerated PyTorch training on Mac: Apple’s documentation shows that PyTorch can be accelerated on Mac through Metal Performance Shaders, confirming that Apple Silicon can support accelerated machine learning workflows: https://developer.apple.com/metal/pytorch/,
Apple Machine Learning Research, MLX: MLX is Apple’s machine learning framework for Apple Silicon, optimized for the unified memory architecture, and is a key source for documenting Apple’s local AI software ecosystem: https://opensource.apple.com/projects/mlx,
AMD ROCm Documentation, Deploying your model: The ROCm documentation explicitly states support for vLLM and Hugging Face Text Generation Inference for serving large language models on AMD GPUs, making it the most important source for the software layer of low-cost open LLMs on AMD: https://rocm.docs.amd.com/en/docs-7.0.0/how-to/rocm-for-ai/inference/deploy-your-model.html,
AMD Instinct MI300X: AMD’s official page documents the MI300X as an accelerator for Generative AI and HPC, with 192 GB of HBM3 memory and 5.325 TB/s theoretical memory bandwidth, features that are critical for large open models: https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html,
AMD Instinct MI325X / MI350 Series: AMD’s official pages list 256 GB of HBM3E memory and 6 TB/s theoretical bandwidth for the MI325X, as well as 288 GB of HBM3E memory, 8 TB/s and new data types such as MXFP6 and MXFP4 for the MI350 series, supporting the argument that large memory per GPU can reduce complexity and cost in large open LLM deployments: https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html and https://www.amd.com/en/products/accelerators/instinct/mi350.html,
AMD ROCm on Radeon and Ryzen: The documentation for Radeon and Ryzen shows that ROCm is not limited to data centers, but also covers client and edge environments, with support for tools such as PyTorch, TensorFlow, vLLM and llama.cpp: https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html,
Unsloth, Fine-tuning LLMs on AMD GPUs: The Unsloth documentation states support for AMD GPUs, including Radeon RX and MI300X, for low-cost local fine-tuning of large language models: https://unsloth.ai/docs/get-started/install/amd,
llama.cpp, Build documentation: The llama.cpp documentation lists support for multiple backends, including Metal, CUDA, HIP and Vulkan, making it a critical tool for heterogeneous local AI infrastructures using Apple, NVIDIA and AMD hardware: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.