June 15, 2026

Why Today’s LLM Agents Do Not Self-Evolve as Faithfully as We Assume

Performance gains are not the same as faithful learning

A powerful assumption has entered the debate on large language model agents: if an agent stores past experience, summarizes it, retrieves it later and performs better, then it must be learning from experience. This assumption is central to many current ideas about self-evolving LLMs. It supports the belief that agents can improve during deployment without changing the underlying model weights, simply by accumulating memory, extracting lessons and reusing those lessons in future tasks.

The problem is that improved performance does not prove faithful learning. A system may perform better because previous traces help it imitate a useful trajectory. It may benefit from formatting cues, retrieval effects, longer prompts or task-specific priors. It may even appear to use an experience while ignoring its actual meaning. This distinction matters. If we want to deploy LLM agents in public administration, education, health, research or cybersecurity, we cannot be satisfied with the claim that memory “helps”. We need to know whether the system’s decisions are causally grounded in the experience it was given.

The recent paper “Large Language Model Agents Are Not Always Faithful Self-Evolvers” is important precisely because it tests this question. It does not ask only whether experience improves agents. It asks whether agents faithfully rely on that experience. The answer is uncomfortable for the dominant narrative: agents are much more faithful to raw experience than to condensed experience. They often rely on concrete traces, but disregard, misuse or weakly integrate abstract summaries.

The fragile promise of condensed experience

Current self-evolving methods often distinguish between raw and condensed experience. Raw experience contains detailed traces: observations, actions, intermediate states, trajectories and outcomes. Condensed experience is different. It consists of distilled lessons, heuristics, reusable strategies or abstract guidance extracted from previous successes and failures.

At first sight, condensed experience seems more scalable. Full trajectories are long and expensive. Summaries are compact, transferable and easier to retrieve. This makes them attractive for long-running agents. But the paper shows that this convenience comes at a cost. When researchers perturb raw experience by removing it, shuffling it or replacing it with irrelevant trajectories, agent performance often changes substantially. This indicates genuine dependence on concrete experience. When they perturb condensed experience, however, the effect is frequently weak. In some settings, corrupt, irrelevant or even semantically empty condensed content barely changes downstream behavior.

This is not a minor implementation flaw. It challenges a core assumption behind memory-based self-evolution: that abstraction preserves the useful part of experience. In practice, abstraction can remove the very details that make experience actionable. A condensed lesson may be too generic, too detached from the current task, or too confident about a pattern that no longer applies. The agent may follow an outdated heuristic instead of inspecting the current environment. Or it may ignore the summary entirely because stronger local context signals dominate the prompt.

This explains why some systems look better with memory while still failing to use memory faithfully. Experience becomes a performance artifact rather than a reliable source of grounded adaptation.

Bigger models do not automatically solve the problem

A common response is to assume that scaling will fix this. Larger models may produce better summaries and process context more effectively. Yet the evidence does not support a simple scaling solution. Larger models often achieve higher baseline performance, but the gap between raw-experience faithfulness and condensed-experience faithfulness remains. In other words, scale improves capability, but it does not automatically produce faithful integration of experience.

This matters for the politics of AI. The mythology of self-improving intelligence is often used to justify accelerated deployment, privileged access to compute, reduced scrutiny and closer integration between large AI companies and state functions. But if the mechanisms of self-evolution remain weakly grounded, then claims about autonomous improvement should not be treated as a mandate for institutional exception. They should be treated as a research hypothesis requiring rigorous evaluation.

The real question is not whether AI systems can assist with coding, research, administration or scientific exploration. They can. The real question is whether societies should allow closed, opaque systems to become strategic infrastructure before we can verify how they learn, remember, retrieve, forget and act.

From memory as storage to experience as accountable infrastructure

The way forward is not to abandon agent memory. It is to make experience integration more reliable, interpretable and accountable. First, condensed experience should not be optimized only for brevity. It must preserve context, scope conditions, failure modes, uncertainty and evidence. A useful lesson is not merely a sentence that sounds general. It is a structured claim about what worked, under which conditions, with which exceptions and with which risks.

Second, experience should not be statically prepended to every prompt. Agents need dynamic memory activation based on task demands, uncertainty, interaction state and verified relevance. Some tasks may require full trajectories. Some may require retrieval-augmented evidence. Some may require no past experience at all. Indiscriminate memory injection can dilute attention, mislead the model and reduce reliability.

Third, critical systems need auditability. For every important answer or action, we should be able to inspect which experience was retrieved, why it was considered relevant, whether it was raw or condensed, how it influenced the output and whether a human operator reviewed the result. Without this, self-evolution becomes a black-box claim.

For Europe and Greece, this is not only a technical issue. It is a matter of digital sovereignty. If public-sector AI systems are built on closed models, opaque memory stores and proprietary agent frameworks, then the state becomes dependent on infrastructures it cannot meaningfully inspect. If, instead, AI is built as public infrastructure, with open standards, open-source components, documented datasets, model cards, datasheets, retrieval with verifiable sources and human final responsibility, then experience can become a public knowledge asset rather than a private control layer.

The lesson is clear. The future of AI should not be governed by the myth of a machine that recursively improves beyond social oversight. It should be built around systems that learn in ways we can test, reproduce, contest and govern. Faithful self-evolution, if it is to become real, will not be magic. It will be engineered, audited and democratically accountable.

Sources:

Zhao et al., “Large Language Model Agents Are Not Always Faithful Self-Evolvers”: The core paper provides the main evidence that LLM agents consistently rely more on raw trajectories than on condensed summaries, and that this faithfulness gap persists across frameworks, model scales and single-agent or multi-agent settings: https://arxiv.org/html/2601.22436v3.
GlossAPI, “Artificial Intelligence as an Infrastructure of Power”: This article provides the political frame used here, connecting the mythology of self-improving AI with compute concentration, strategic state power and the need for open, accountable public AI infrastructure: https://blog.glossapi.gr/en/artificial-intelligence-as-an-infrastructure-of-power/.
Zhao et al., “ExpeL: LLM Agents Are Experiential Learners”: This work is one of the foundational examples of an agent that autonomously gathers task experience, extracts natural-language insights and reuses both insights and successful examples at inference time: https://arxiv.org/abs/2308.10144.
Ouyang et al., “ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory”: This paper is directly relevant because it proposes a memory framework that distills reasoning strategies from successful and failed experiences, placing condensed memory at the center of agent self-evolution: https://arxiv.org/abs/2509.25140.
Hu et al., “Memory in the Age of AI Agents”: This survey gives a broader conceptual map of agent memory, distinguishing its forms, functions and dynamics, and showing why memory should be treated as a first-class design primitive in agentic AI: https://arxiv.org/abs/2512.13564.
Mitchell et al., “Model Cards for Model Reporting”: This paper supports the need for transparent model documentation, especially when machine learning systems are used in high-impact settings and require clear reporting of intended uses, limits and evaluation conditions: https://arxiv.org/abs/1810.03993.
Gebru et al., “Datasheets for Datasets”: This work is relevant because reliable experience integration depends not only on model behavior, but also on documented datasets, provenance, composition, collection processes and recommended uses: https://arxiv.org/abs/1803.09010.
European Commission, “General-Purpose AI Models in the AI Act”: This source documents the European regulatory context for general-purpose AI models, including transparency, reporting and lifecycle obligations that matter for accountable public AI systems: https://digital-strategy.ec.europa.eu/en/faqs/general-purpose-ai-models-ai-act-questions-answers.