Code & Cluster: Agentic AI

Showing posts with label Agentic AI. Show all posts

Monday, June 15, 2026

Everyone Is Arguing About MCP vs Skills vs Tools vs CLI. They're Missing the Point.

The agentic AI ecosystem has developed a terminology problem.

Depending on which startup, or conference talk or group you're listening to, the future of AI agents is supposedly built around:

Tools
Skills
MCP
CLI environments

After building and studying agent systems for the last couple of years, I feel these debates are missing the real question. It isn't which abstraction to use.

It's: where should orchestration live? Once you understand that, tools, skills, MCP, and CLI all start to make sense.

To illustrate this, let's imagine we're building a fictional AI Operations Agent.

The agent can:

investigate incidents
analyze logs and metrics
review deployments
create Jira tickets
optimize cloud costs
assist with root cause analysis

As we'll see, the architecture evolves dramatically as the system grows.

The Fundamental Problem

At its core, an LLM cannot actually perform operations work.

It cannot:

query Datadog
inspect GitHub commits
retrieve deployment history
restart services
create Jira tickets

All it can do is reason over information provided in its context.

Everything useful in agentic AI comes from connecting the model to external capabilities.

The question becomes:

How should those capabilities be exposed?

The Tool-Based Agent

Most agent systems start with tools.

A tool is simply a primitive capability exposed to the model.

For our operations agent, the tools might look like:


def query_logs(service_name: str):
    ...

def query_metrics(service_name: str):
    ...

def get_recent_deployments(service_name: str):
    ...

def create_jira_ticket(summary: str):
    ...

The architecture is straightforward.


User
 ↓
LLM decides next action
 ↓
Agent Runtime
 ↓
Tool Execution
 ↓
Results returned to LLM

Suppose a user asks:

Why is checkout-service returning elevated 500 errors?

The interaction loop might look like this:

The LLM determines that logs are required.
The runtime executes query_logs().
Results are returned.
The LLM determines deployment history is also needed.
The runtime executes get_recent_deployments().
Results are returned.
The LLM generates a root cause hypothesis.

This works surprisingly well. Initially.

The Hidden Scaling Problem

The challenge appears when the system grows.

At first the agent has only a handful of tools.


query_logs()
query_metrics()
query_deployments()
create_ticket()

Six months later:


query_logs()
query_metrics()
query_traces()
query_alerts()
query_runbooks()
restart_service()
rollback_deployment()
create_ticket()
query_cost_data()
analyze_utilization()
deploy_hotfix()
...

Now the LLM is no longer merely selecting tools.

It is orchestrating an operational workflow.

It must decide:

which tools to invoke
in what order
when enough information has been gathered
how to recover from failures
what actions should be taken

In other words:

Tool-based systems push orchestration into the model.

This turns out to be both powerful and dangerous.

The problem has three dimensions. First, reliability: too much execution logic is living inside a probabilistic system that may sequence it differently on every run. Second, cost: every invocation pays LLM inference tokens to rediscover a workflow that never changes. Third, latency: each tool-selection step is a round-trip to the model — a waterfall of LLM calls where a single deterministic function would do.

The sequence:

query logs → query metrics → retrieve deployment history → retrieve recent alerts

is completely deterministic. You already know exactly what to fetch. Forcing the LLM to reason its way to that sequence on every invocation is not intelligence — it's waste. Unreliable, slow, and expensive waste.

We are slowly realizing that not all orchestration belongs inside the model. This realization leads directly to skills.

The Two Orchestration Layers

Every agent system, no matter how it's described, is really making decisions at two different layers:

Semantic orchestration — deciding what needs to happen. Which capability is relevant. What the user actually meant. This is inherently ambiguous, and it's the one job LLMs are genuinely good at.

Execution orchestration — deciding how it happens. Ordering, parallelism, retries, error handling, state. This is inherently deterministic, and it's the one job LLMs are bad at — slow, expensive, and inconsistent across runs.

Most of the confusion in this space comes from putting both layers in the same place. Tools, skills, MCP, and CLI environments are best understood as different ways of drawing the line between these two layers — not as competing abstractions. Keep this distinction in mind; every section below maps back to it.

Skills: Coarse-Grained Semantic Capabilities

Skills are often described vaguely, so let's define them carefully. A skill is not simply a larger tool. Nor is it necessarily another autonomous agent.

More precisely: a skill is a subgraph with a semantic boundary. It encapsulates three things the LLM should not have to think about: execution order, parallelism, and retry policy. The planner sees a named capability. The skill owns how that capability runs.

For our operations platform, skills might include: IncidentAnalysisSkill, DeploymentReviewSkill, CapacityPlanningSkill, CostOptimizationSkill.

Tools describe primitive operations. Skills describe business-level capabilities and the execution policy that implements them.

The Planner LLM Still Matters

Without an LLM, mapping human requests to skills becomes extremely difficult.

Consider:

Why did checkout-service start failing after lunch?

Is this:

an incident investigation?
a deployment review?
a performance analysis?

The ambiguity lives in natural language.

The LLM remains extremely useful for:

intent understanding
semantic routing
contextual interpretation

The architecture increasingly becomes:


User Request
 ↓
Planner LLM
 ↓
Skill Selection
 ↓
Skill Execution
 ↓
Results
 ↓
Planner LLM
 ↓
Response

Skills Move Execution Orchestration Out Of The Model

Skills are the clearest example of this split in action: semantic orchestration stays with the planner, execution orchestration moves into the skill. This is the architectural shift many modern systems are making.

For example:


class IncidentAnalysisSkill:

  async def execute(self, service_name):

    # Fetch concurrently — no reason to serialize
    logs, metrics, deployments, alerts = (
      await asyncio.gather(
        query_logs(service_name),
        query_metrics(service_name),
        get_recent_deployments(service_name),
        get_recent_alerts(service_name)
      )
    )

    return {
      "logs": logs,
      "metrics": metrics,
      "deployments": deployments,
      "alerts": alerts
    }

The planner LLM still decides: I need IncidentAnalysisSkill. But once selected, deterministic execution logic takes over — including decisions the LLM would never make well, like whether to fetch in parallel or how to handle a partial failure.

Note: this example is deliberately simplified. In production, skills typically need streaming partial results, per-tool timeouts, and circuit breakers. The point is that all of that complexity lives in the skill — not in the model.

Where MCP Fits

MCP doesn't touch this split at all — it just makes the capabilities on the execution side portable across runtimes. This is where many online discussions become confused.

MCP is not competing with skills.

MCP solves a completely different problem.

The ecosystem currently suffers from massive fragmentation.

Every framework invents:

different tool schemas
different invocation mechanisms
different discovery protocols

MCP attempts to standardize that layer.

The easiest way to think about MCP is:

MCP is trying to become the interoperability layer for AI capabilities.

Closer to OpenAPI for agent ecosystems — a shared interface contract that defines how capabilities are described, discovered, and invoked, without caring what carries the bytes underneath.

What MCP Exposes

Most coverage of MCP focuses only on tools. But the protocol defines three primitives, and the distinction matters:

Tools — callable functions the model can invoke. This is what most people mean when they talk about MCP.

Resources — structured data the model can read: log files, database records, API responses. Resources are pull-based; the model requests them rather than the server pushing them.

Prompts — reusable prompt templates with parameters, surfaced by the server. This lets server authors encode domain knowledge — e.g., a Datadog MCP server can expose a prompt template for structured incident analysis, not just raw metric endpoints.

For our operations agent, a well-designed Datadog MCP server doesn't just expose query_metrics(). It might also expose a resource for the current service topology and a prompt template that structures how the model should reason about an anomaly. That's a qualitatively richer integration than a tool-only mental model suggests.

from mcp.server import Server
server = Server("datadog-server")

@server.tool()
async def query_service_metrics(service):

...

@server.resource("topology://services")
async def get_service_topology():
...

@server.prompt()
async def analyze_anomaly(service, window):
return f"Given metrics for {service} over {window}"

One practical implication: an MCP server built today for Claude can work tomorrow with any other MCP-compatible runtime. That portability is the actual value proposition — not the protocol mechanics themselves.

Why Coding Agents Love CLI Environments

Now let's discuss the most interesting trend in the ecosystem. CLI environments push the line further than either tools or skills: here, the model itself participates in execution orchestration by writing commands, which is exactly why sandboxing becomes non-negotiable.

Coding and operations agents are increasingly bypassing structured tools altogether.

Instead of exposing:


query_git_diff()
run_tests()
deploy_service()

they expose:


git
docker
kubectl
terraform
pytest

The shell becomes the capability layer.

The agent generates commands directly.


kubectl logs checkout-service
git diff
terraform plan

This is a radically different philosophy.

Instead of carefully defining capabilities, you expose a real execution environment.

Why CLI Agents Feel More Powerful

Unix environments already solved composability decades ago. Commands naturally chain together. The shell already provides: composition, state, interoperability, flexibility.

Instead of creating hundreds of tool schemas, the environment itself becomes the API. This is why modern coding agents often feel dramatically more capable than traditional tool-calling agents.

The Tradeoff

Of course, flexibility comes with risk. A structured tool:

restart_service(service_name)

is easier to constrain than:

rm -rf /

This is not a theoretical concern. A CLI agent operating against a production Kubernetes cluster with broad permissions is one hallucinated flag away from an outage. The blast radius of a mistake is bounded only by what the execution environment allows — not by what the tool schema permits. With structured tools, you constrain at the interface. With CLI, you constrain at the environment.

This is why sandboxed execution environments have become infrastructure in their own right. Tools like E2B, Daytona, and Modal provide ephemeral, isolated runtimes specifically built for agent workloads: short-lived containers with scoped filesystem access, network egress controls, and resource limits. The architecture shifts from "give the agent a shell" to "give the agent a shell inside a box it can't break out of." This also changes the observability story — a sandboxed run produces a complete audit trail of every command executed, which structured tool-calling often doesn't.

The deeper implication for our orchestration framework: CLI agents are the one place where the semantic/execution boundary genuinely blurs. The model is no longer just selecting a capability — it is participating in execution orchestration by writing the commands themselves. That is both the source of their power and the reason the safety layer beneath them has to be more sophisticated than a permission flag.Where The Industry Appears To Be Heading

If you step back from the terminology wars, a pattern begins to emerge.

Modern agent systems increasingly look like this:


                 LLM
                  ↓
        Semantic Planning
                  ↓
               Skills
                  ↓
      Deterministic Workflows
            ↓          ↓
         Tools       MCP
            ↓          ↓
      Local Systems  External Systems
                  ↓
            CLI Runtime

Notice what happened.

The LLM is not "the system."

It has become one component inside a larger architecture.

Reasoning remains probabilistic.

Execution becomes deterministic.

Interoperability becomes protocol-driven.

Capabilities become reusable infrastructure.

Final Thoughts

Tools, Skills, MCP, and CLI environments are different answers to where the line falls between semantic and execution orchestration — not competing abstractions.

They exist at different layers of the stack.

Tools expose primitive capabilities.

Skills expose semantic capabilities.

MCP standardizes interoperability.

CLI environments expose raw execution contexts.

The real architectural question is not:

Which abstraction wins?

It is:

Where should orchestration live?

The most successful systems increasingly place semantic orchestration inside LLMs and execution orchestration inside deterministic software.

The future of agentic AI looks less like autonomous chatbots wandering through APIs and more like carefully engineered systems where probabilistic reasoning and deterministic execution work together.

Thursday, April 9, 2026

Why Deterministic AI Agents Are The Wrong Goal ?

1. The Probabilistic Machine

In the rush to build "enterprise-grade" AI agents, many teams are chasing a seductive idea: what if we could make AI fully deterministic — predictable, repeatable, always correct?

It sounds reasonable. That's how traditional software works. But it starts with a fundamental misunderstanding of what these systems actually are.

Large Language Models are not knowledge databases or rules engines. At their core, they are statistical machines — predicting the probability of the next token in a sequence, over and over, until a response takes shape. When you craft a precise prompt or inject carefully curated context, you are not overriding that mechanism. You are nudging it. Good context narrows the probability distribution and raises the likelihood of a useful answer. But it doesn't change the fundamental nature of the system. The model remains probabilistic. You can engineer the inputs to the ceiling — you cannot engineer your way out of uncertainty.

2. The Enterprise Trap

This creates a real tension — because the properties that make LLMs powerful are almost perfectly opposed to what enterprises want from software.

Enterprises don't want suggestions. They want answers. They don't want "usually correct" — they want auditable, repeatable, defensible outputs. When something goes wrong, someone needs to explain exactly why the system did what it did.

A probabilistic system doesn't give you that cleanly. And so the instinct is to reach for determinism — to constrain the model until it behaves like a well-behaved service. Temperature to zero. Rigid output schemas. Exhaustive prompt engineering. Rules stacked on rules.

This isn't irrational. Repeatability, auditability, compliance — these are legitimate needs. But chasing determinism in the model itself is solving the wrong problem. You end up with a system that is neither reliably deterministic nor making full use of what the model can actually do. You've neutered the intelligence without gaining the guarantees you wanted.

The goal shouldn't be a deterministic model. It should be a reliable system. Those are not the same thing — and confusing them is where most enterprise AI projects go wrong.

3. What Actually Works: Hybrid Architecture

The real breakthrough of modern AI systems wasn't just model quality. It was a design philosophy.

Tools like ChatGPT and Claude succeeded not because they eliminated uncertainty, but because they made uncertainty part of the interaction. They don't say "here is the correct answer." They say "here's a strong answer — want me to refine it?" That subtle shift changes everything. The human stays in the loop. The model doesn't pretend to be infallible. And because of that, users actually trust the output enough to act on it.

This points toward the pattern that wins in production: a hybrid architecture where determinism and intelligence each live where they belong.

The structure looks like this. A deterministic shell handles everything that must be correct and repeatable — workflows, APIs, validation rules, policy enforcement. A probabilistic core handles everything that requires reasoning — summarization, analysis, decision support, generation. And control points sit between them — confidence thresholds, structured outputs, human approvals — managing the boundary between the two.

The LLM proposes. The system validates. The human, when needed, decides.

Determinism doesn't disappear. It moves to where it actually belongs.

4. Vestra: A Concrete Example

To make this real, consider Vestra — an AI-powered investment analysis agent I've been building.

A naive approach would let the LLM do everything: fetch market data, apply financial rules, generate investment decisions. That system would fail badly. It would hallucinate stock picks, misinterpret regulations, produce outputs that couldn't survive a compliance review.

So Vestra is deliberately split into three layers.

The deterministic shell ingests user portfolio data, computes financial metrics, and enforces business rules. This is traditional code. It behaves identically every time.

The probabilistic core — the LLM — receives only clean, verified data from that shell. It doesn't touch raw market feeds. It reasons over what it's given: evaluating the portfolio against the user's goals, time horizon, and macro context, and generating high-level strategic insights. Rebalance allocations. Increase international exposure. Add diversified index funds.

The control points manage what happens next. The LLM is constrained to return structured JSON, so downstream code can validate and process the output safely. And Vestra never acts autonomously — it presents recommendations to the user with clear disclosure that they are AI-generated. The user accepts, rejects, or refines them, and is encouraged to seek professional advice for significant decisions.

The LLM doesn't fetch data. It doesn't execute trades. It doesn't make final calls.

Determinism lives in the code. Intelligence lives in the model. Control lives with the user.

5. The Broader Principle

There's a persistent belief that keeping humans in the loop is a temporary crutch — something to tolerate until the models get good enough to go fully autonomous. That belief keeps getting disproven in production.

Human involvement isn't a limitation to engineer around. It's a design pattern that makes systems more accurate and more trusted. Users who can steer, refine, and push back on AI outputs consistently get better results than users handed a black-box decision. The interaction is where the value compounds.

That said, not every system should work this way. Where the stakes are high but the timeline allows deliberation — investment analysis, legal drafting, code review, customer support — interactivity is a feature, not a compromise. The human brings judgment the model lacks; the model brings breadth the human can't match.

But in real-time fraud detection, high-throughput automation, or millisecond trading systems, that loop collapses. There's no time for human approval. These systems need hard rules and hard thresholds, with AI informing the design rather than driving the execution.

The mistake isn't choosing the wrong architecture. It's assuming one architecture fits everything.

6. The Real Opportunity

A year or two ago, the bold prediction was that autonomous agents would replace entire workflows by now. That hasn't happened — not because the models aren't capable enough, but because the systems around them weren't designed for it. Fully autonomous agents keep failing in the same ways: confidently wrong outputs, no graceful degradation, users who don't trust them enough to act on what they produce.

What's actually working is quieter and less glamorous. Systems that surface their own uncertainty. Systems where the human is a genuine collaborator, not an afterthought. Systems where the experience is designed so carefully that raw model capability almost becomes secondary.

The opportunity isn't in building perfect AI agents. It's in building systems that help humans navigate imperfection — that make strong suggestions, explain their reasoning, accept feedback, and improve through iteration.

Progress doesn't come from eliminating uncertainty. It comes from designing systems that help people work with it.

Chasing deterministic AI agents may feel like building the future. But the real future belongs to systems that are interactive, collaborative, and intelligently imperfect.

And that's not a limitation. That's the breakthrough.

Saturday, March 7, 2026

AI Context Explained: The Real Engineering Behind Modern AI Systems

Most discussions about large language models focus on prompts — how to phrase instructions to get better responses. But in real AI systems, prompts are only a small part of the story.

What actually determines the quality of an AI system is context: the information available to the model when it generates a response. This includes prompts, conversation history, retrieved documents, tool outputs, and sometimes structured application state. Designing how this information is assembled and provided to the model is what many engineers now call context engineering.

Providing the right context to the LLM is the only reliable way to get accurate, production-grade answers. In this blog, I explore what context actually is, the hidden dangers of massive context windows, and how it should be used in Agentic AI.

Context

Context refers to all the information that is not in the user's immediate question, but is required to help the LLM generate a relevant, highly specific answer. It is the data that gives the LLM situational awareness.

Consider this basic prompt:

Prompt: What is a good stock mutual fund to invest in? Response (Abbreviated): 1. T. Rowe Price Global Technology Fund (PRGTX) 2. Wasatch Ultra Growth Fund (WAMCX).

For many investors, both of these are far too aggressive, high-risk, and expensive. Let's change the prompt slightly to inject some context:

Prompt: What is a good stock mutual fund to invest in? I am 56 years old, nearing retirement. I prefer low-risk, low-cost, highly diversified funds. Response (Abbreviated): 1. Vanguard Target Retirement 2035 Fund (VTTHX) 2. Fidelity ZERO Total Market Index Fund (FZROX) 3. Vanguard Total Bond Market Index Fund (VBTLX).

This response is drastically different and entirely appropriate for a conservative investor. The phrase "56 years old, nearing retirement. I prefer low-risk, low-cost, highly diversified funds" is the context. Without it, asking the LLM the same question multiple times will yield scattered, generic, or even dangerous financial advice.

How is Context Passed to the Model?

Whether you use native provider APIs (OpenAI, Google) or orchestration frameworks like LangChain, context is not a separate magical parameter. It is embedded directly into the input messages.

A raw API call looks like this:

# Python
client.responses.create(
    model="gpt-5.2",
    messages=[ 
        {"role": "system", "content": "You are a financial advisor."}, # system prompt
        {"role": "user", "content": "What is a good fund to invest in?"} # User prompt or query
        {"role": "user", "content": "I am 56 and prefer low risk."}, # Context    ],
    temperature=0.0
)

Everything the LLM knows is stuffed into that messages array. In an agentic system, it is all about getting the right information into that array at the right time.

Context generally falls into three categories:

Static: Data that rarely changes (e.g., "User is a male, NY Yankees fan, foodie").
Dynamic: Data that evolves as the agent runs and interacts with tools (e.g., the results of a real-time stock price lookup).
Long-Lived: Data that spans across multiple sessions or days (e.g., "User already rejected the Vanguard recommendation yesterday").

In practice, building AI systems is often less about “prompt engineering” and more about deciding what information should be included in the model’s context at the moment of inference.

The data comes from a variety of sources. Cache would have the most recent data. The databases have the system of record. The log have events as they occurs. Context engineering involves getting the right data from the right place at the right time.

Note that context can go stale. If you feed in stale context, you get less the accurate answers. Keeping it current is part of the engineering.

The Illusion of the Infinite Context Window

The context window refers to how much of your context the LLM retains for a conversation. The more it can remember, the better. Right ?

LLM providers aggressively advertise their context window sizes. Bigger appears better, but that is a dangerous trap for developers.

The context window simply represents the hard cap on how much text an LLM can "see" at once. Look at the landscape in early 2026:

Meta Llama 4 Scout: ~10 Million tokens
Gemini 3 Pro: ~1.0M - 2.0M tokens
OpenAI GPT-5.2: ~400,000 tokens
Claude 4.5 Sonnet: ~1.0M tokens
DeepSeek-R1: ~164,000 tokens

However, the context window is a model attribute, not an agent capability. Research in 2025 and 2026 has consistently proven that models severely degrade well before hitting their upper limits. This phenomenon is known as Context Rot.

Just because a model can accept 1 million tokens (about 8 full novels) doesn't mean it pays equal attention to all of them. Studies show that when a context window passes 50% capacity, models begin to heavily favor tokens at the very beginning or the very end of the prompt, completely ignoring critical constraints buried in the middle.

The industry is now focusing on the Maximum Effective Context Window (MECW). A model might advertise 1 million tokens, but its MECW—the point where accuracy actually drops off a cliff—might be only 130k tokens.

The Agent Loop

Because of Context Rot, you cannot just dump an entire database into the LLM and expect it to figure things out. This is why we build Agents.

An LLM is a stateless text predictor. An Agent is a software loop that uses the LLM as a reasoning engine to manage its own context. Agents operate in a continuous cycle: Observe → Think → Act.

Imagine building an AI-based investment analysis product. The agent doesn't just ask the LLM one massive question. It loops:

Observe: The user asks, "Should I adjust my portfolio for the upcoming rate cuts?"
Think (LLM): The model realizes it lacks context. It outputs a tool-call: get_user_portfolio() and get_risk_tolerance().
Act (Code): Agent code queries a PostgreSQL database to fetch the financial profile.
Update Context: It appends appends only the relevant portfolio metrics into the messages array.
Loop: The agent sends this newly enriched, highly specific context back to the LLM to generate the final advice.

In this loop, the context is actively mutating. The agent is continuously pruning the messages array, summarizing old turns, and injecting fresh tool outputs to keep the token count well within the Maximum Effective Context Window.

Prompt Engineering vs. Context Engineering

If prompt engineering is about how you ask the question, context engineering is about giving the model a little more data before it attempts to answer.

To use an operating system analogy: The LLM is the CPU, the Prompt is the executable command, and the Context Window is the RAM and context is the data in RAM.

Prompt Engineering is writing a better command. It is user-facing, static, and brittle.
Context Engineering is managing the data in RAM. It is developer-facing, dynamic, and systemic.

As we move toward enterprise-grade AI, prompts are no longer enough. Context engineering involves building the infrastructure that feeds the model. It encompasses Retrieval-Augmented Generation (RAG) to find specific documents, Episodic Memory Graphs to track user decisions over time, and Context Pruning to prevent token overflow.

The Frontier: Context Graphs

While context engineering today is mostly about managing lists of messages, the future of enterprise AI lies in Context Graphs. Current LLM context is linear—a flat, chronological scroll of "User said X, Agent did Y." This works for chat, but it fails for complex enterprise workflows. Real-world business data isn't a timeline; it's a web of relationships.

Enter the Context Graph. Instead of dumping raw logs into the window, advanced agents now build and maintain a dynamic graph structure. Nodes represent entities (User, File, Decision, Error). Edges represent causality or relationships (e.g., User Upload caused Error 500, which triggered Retry Logic).

This structure transforms the context from a "temporary scratchpad" into an organizational brain. If a human auditor later asks, "Why did the agent reject this loan application?", a linear log forces the LLM to re-read thousands of lines of text to guess the reason. A Context Graph simply traverses the edge: Loan Application -> {rejected\_because}} -> Risk Score > 80.

For enterprise applications, this is the missing link. It allows agents to reason across disconnected data points (e.g., linking a Slack message from Tuesday to a Code Commit on Friday) without needing a massive, expensive context window to hold all the noise in between.

Conclusion

A perfectly engineered prompt might get you a clever answer once. But a well-engineered context pipeline ensures your Agent gets the accurate answer securely, cost-effectively, and consistently, every single time.

Popular LLM models are advanced and sophisticated. But everyone has access to them. You have no advantage because you an LLM. Your advantage and intellectual property is in how you manage and feed context to the LLM. Though not new, how you collect, store and retrieve the data that forms the context is the real engineering.

Monday, February 9, 2026

What is (Agentic) AI Memory ?

I have seen a lot of posts on X and LinkedIn on the importance of Agentic AI memory. What exactly is it ? Is it just another name for RAG ? Why is it different from any other application memory ? In this blog, I try to answer these questions.

What do people mean when they say "AI Memory" ?

Most production LLM interactions rely on external memory systems. Everything called “memory” today is mostly external.

At their core LLMs are stateless functions. You make a request with a prompt and some context data and it provides you with a response

In real systems, AI memory usually means:

Storing past interactions, user preferences, decisions, goals, or facts.
Retrieving relevant parts later
Feeding a compressed version back into the prompt

So yes — at its core:

Memory = save → retrieve → summarize → inject into context

Nothing magical. But is that all ? seems just like a regular cache ? Read on.

Is this just RAG (Retrieval Augmented Generation) ?

They are related but not the same.

RAG (Retrieval Augmented Generation)

Purpose:

Bring external knowledge into the LLM
Docs, PDFs, financial data, code, policies

Typical traits:

retrieval is stateless per query
Large text chunks
Query-driven retrieval
“What additional data can we provide to LLM to help answer this question?”

Agent / User Memory

Purpose:

Maintain continuity
Personalization
Learning user intent and preferences over time

Typical traits:

Long-lived
Highly structured
Small, distilled facts
“What can I provide to LLM so it remembers this user?”

Think of it this way:

They often use can use the same retrieval tools, but they serve different roles.

Where is the memory ?

Option 1: Agent process memory

Any suitable data structure like a HashMap.
Suitable for cases where the Agent loop is short and no persistence is needed.

Option 2: Redis /Cache

Suitable for session info, recent conversation history, tool results cache, temporary state.

Option 3: PostgreSQL/RDBMS

Suitable when you need durability, auditability, explainability.

Option 4: Vector databases

Suitable for semantic search.

Option 5: AI memory tools

Such as LangGraph memory, LlamaIndex memory, Memgpt. They try to make it easier for agents to store and retrieve.

Here is example of data that might be stored in memory:

{

"user_id": "123",

"fact": "User prefers concise python code",

"source": "conversation_turn_5",

"timestamp": "2026-02-09"

}

The mental model for AI memory

Short term memory

This is about recent interactions. It is data relevant to the current topic being discussed. For example, the user prefers conservative answers.

Long term memory

This is stored externally, perhaps even to persistent storage. It is retrieved and inserted into context selectively. For example, the user is a vegetarian or the user's risk tolerance is low.

Memory and the LLM

The LLM takes as input only messages. Agent has to read the data from memory and insert it into the text message. This is what they refer to as context.

You do not want add large amount of arbitrary data as context because:

text is converted to token and token cost spirals
LLM attention degrades with noise
Latency increases
Reasoning quality declines

Real Agentic Memory

At the start of the blog, I asked "is this just a regular cache ?".

To be useful in the agentic way, what is stored in the memory needs to evolve. Older or maybe irrelevant data in the memory needed to be "forgotten" or evicted based on intelligence (not standard algorithms like FIFO, LIFO etc). Updates and evictions need to happen based on recent interactions. If the historical information is too long and should not be evicted, it might need to be compressed.

Agentic systems require more dynamic memory evolution than typical CRUD applications. In the case of long running agents, the quality of data in the memory has to get better with interactions over time.

How exactly that can be implemented is beyond the scope of this blog and could be a topic for a future one.

Considerations

Memory != Raw History

Bad Use : Here are the last 47 conversations ......

Better Use : We were talking about my retirement goals with this income and number of years to retire.

Summarize and abstract to extract intelligence - as opposed to dumping large quantity of data.

In conclusion

AI memory is structured state, sometimes summarized that is retrieved when needed and included as LLM input as "context".

Conceptually, it is similar to RAG but they apply to different use cases.

Better and smaller contexts beat large contexts and large memory.

Agentic AI Memory adds value only when

The system changes behavior ( for the better ) because of it
It produces better response, explanations, reasonings
It saves time

These ideas are not purely theoretical. While building Vestra — an AI agent focused on personal financial planning and modeling — I’ve had to think deeply about what should be remembered, what should be abstracted, and what should be discarded. In financial reasoning especially, raw history is far less useful than structured, evolving state.

But yes, Agentic memory will be different than what we know as memory in regular apps — in the ways it is updated, evicted, and retrieved.