Showing posts with label Large Language Models. Show all posts
Showing posts with label Large Language Models. Show all posts

Thursday, April 9, 2026

Why Deterministic AI Agents Are The Wrong Goal ?

1. The Probabilistic Machine

In the rush to build "enterprise-grade" AI agents, many teams are chasing a seductive idea: what if we could make AI fully deterministic — predictable, repeatable, always correct?

It sounds reasonable. That's how traditional software works. But it starts with a fundamental misunderstanding of what these systems actually are.

Large Language Models are not knowledge databases or rules engines. At their core, they are statistical machines — predicting the probability of the next token in a sequence, over and over, until a response takes shape. When you craft a precise prompt or inject carefully curated context, you are not overriding that mechanism. You are nudging it. Good context narrows the probability distribution and raises the likelihood of a useful answer. But it doesn't change the fundamental nature of the system. The model remains probabilistic. You can engineer the inputs to the ceiling — you cannot engineer your way out of uncertainty.

2. The Enterprise Trap

This creates a real tension — because the properties that make LLMs powerful are almost perfectly opposed to what enterprises want from software.

Enterprises don't want suggestions. They want answers. They don't want "usually correct" — they want auditable, repeatable, defensible outputs. When something goes wrong, someone needs to explain exactly why the system did what it did.

A probabilistic system doesn't give you that cleanly. And so the instinct is to reach for determinism — to constrain the model until it behaves like a well-behaved service. Temperature to zero. Rigid output schemas. Exhaustive prompt engineering. Rules stacked on rules.

This isn't irrational. Repeatability, auditability, compliance — these are legitimate needs. But chasing determinism in the model itself is solving the wrong problem. You end up with a system that is neither reliably deterministic nor making full use of what the model can actually do. You've neutered the intelligence without gaining the guarantees you wanted.

The goal shouldn't be a deterministic model. It should be a reliable system. Those are not the same thing — and confusing them is where most enterprise AI projects go wrong.

3. What Actually Works: Hybrid Architecture

The real breakthrough of modern AI systems wasn't just model quality. It was a design philosophy.

Tools like ChatGPT and Claude succeeded not because they eliminated uncertainty, but because they made uncertainty part of the interaction. They don't say "here is the correct answer." They say "here's a strong answer — want me to refine it?" That subtle shift changes everything. The human stays in the loop. The model doesn't pretend to be infallible. And because of that, users actually trust the output enough to act on it.

This points toward the pattern that wins in production: a hybrid architecture where determinism and intelligence each live where they belong.

The structure looks like this. A deterministic shell handles everything that must be correct and repeatable — workflows, APIs, validation rules, policy enforcement. A probabilistic core handles everything that requires reasoning — summarization, analysis, decision support, generation. And control points sit between them — confidence thresholds, structured outputs, human approvals — managing the boundary between the two.

The LLM proposes. The system validates. The human, when needed, decides.

Determinism doesn't disappear. It moves to where it actually belongs.

4. Vestra: A Concrete Example

To make this real, consider Vestra — an AI-powered investment analysis agent I've been building.

A naive approach would let the LLM do everything: fetch market data, apply financial rules, generate investment decisions. That system would fail badly. It would hallucinate stock picks, misinterpret regulations, produce outputs that couldn't survive a compliance review.

So Vestra is deliberately split into three layers.

The deterministic shell ingests user portfolio data, computes financial metrics, and enforces business rules. This is traditional code. It behaves identically every time.

The probabilistic core — the LLM — receives only clean, verified data from that shell. It doesn't touch raw market feeds. It reasons over what it's given: evaluating the portfolio against the user's goals, time horizon, and macro context, and generating high-level strategic insights. Rebalance allocations. Increase international exposure. Add diversified index funds.

The control points manage what happens next. The LLM is constrained to return structured JSON, so downstream code can validate and process the output safely. And Vestra never acts autonomously — it presents recommendations to the user with clear disclosure that they are AI-generated. The user accepts, rejects, or refines them, and is encouraged to seek professional advice for significant decisions.

The LLM doesn't fetch data. It doesn't execute trades. It doesn't make final calls.

Determinism lives in the code. Intelligence lives in the model. Control lives with the user.

5. The Broader Principle

There's a persistent belief that keeping humans in the loop is a temporary crutch — something to tolerate until the models get good enough to go fully autonomous. That belief keeps getting disproven in production.

Human involvement isn't a limitation to engineer around. It's a design pattern that makes systems more accurate and more trusted. Users who can steer, refine, and push back on AI outputs consistently get better results than users handed a black-box decision. The interaction is where the value compounds.

That said, not every system should work this way. Where the stakes are high but the timeline allows deliberation — investment analysis, legal drafting, code review, customer support — interactivity is a feature, not a compromise. The human brings judgment the model lacks; the model brings breadth the human can't match.

But in real-time fraud detection, high-throughput automation, or millisecond trading systems, that loop collapses. There's no time for human approval. These systems need hard rules and hard thresholds, with AI informing the design rather than driving the execution.

The mistake isn't choosing the wrong architecture. It's assuming one architecture fits everything.

6. The Real Opportunity

A year or two ago, the bold prediction was that autonomous agents would replace entire workflows by now. That hasn't happened — not because the models aren't capable enough, but because the systems around them weren't designed for it. Fully autonomous agents keep failing in the same ways: confidently wrong outputs, no graceful degradation, users who don't trust them enough to act on what they produce.

What's actually working is quieter and less glamorous. Systems that surface their own uncertainty. Systems where the human is a genuine collaborator, not an afterthought. Systems where the experience is designed so carefully that raw model capability almost becomes secondary.

The opportunity isn't in building perfect AI agents. It's in building systems that help humans navigate imperfection — that make strong suggestions, explain their reasoning, accept feedback, and improve through iteration.

Progress doesn't come from eliminating uncertainty. It comes from designing systems that help people work with it.

Chasing deterministic AI agents may feel like building the future. But the real future belongs to systems that are interactive, collaborative, and intelligently imperfect.

And that's not a limitation. That's the breakthrough.



Saturday, March 7, 2026

AI Context Explained: The Real Engineering Behind Modern AI Systems

Most discussions about large language models focus on prompts — how to phrase instructions to get better responses. But in real AI systems, prompts are only a small part of the story.

What actually determines the quality of an AI system is context: the information available to the model when it generates a response. This includes prompts, conversation history, retrieved documents, tool outputs, and sometimes structured application state. Designing how this information is assembled and provided to the model is what many engineers now call context engineering.

Providing the right context to the LLM is the only reliable way to get accurate, production-grade answers. In this blog, I explore what context actually is, the hidden dangers of massive context windows, and how it should be used in Agentic AI.

Context

Context refers to all the information that is not in the user's immediate question, but is required to help the LLM generate a relevant, highly specific answer. It is the data that gives the LLM situational awareness.

Consider this basic prompt:

Prompt: What is a good stock mutual fund to invest in? Response (Abbreviated): 1. T. Rowe Price Global Technology Fund (PRGTX) 2. Wasatch Ultra Growth Fund (WAMCX).

For many investors, both of these are far too aggressive, high-risk, and expensive. Let's change the prompt slightly to inject some context:

Prompt: What is a good stock mutual fund to invest in? I am 56 years old, nearing retirement. I prefer low-risk, low-cost, highly diversified funds. Response (Abbreviated): 1. Vanguard Target Retirement 2035 Fund (VTTHX) 2. Fidelity ZERO Total Market Index Fund (FZROX) 3. Vanguard Total Bond Market Index Fund (VBTLX).

This response is drastically different and entirely appropriate for a conservative investor. The phrase "56 years old, nearing retirement. I prefer low-risk, low-cost, highly diversified funds" is the context. Without it, asking the LLM the same question multiple times will yield scattered, generic, or even dangerous financial advice.

How is Context Passed to the Model?

Whether you use native provider APIs (OpenAI, Google) or orchestration frameworks like LangChain, context is not a separate magical parameter. It is embedded directly into the input messages.

A raw API call looks like this:

# Python
client.responses.create(
    model="gpt-5.2",
    messages=[ 
        {"role": "system", "content": "You are a financial advisor."}, # system prompt
        {"role": "user", "content": "What is a good fund to invest in?"} # User prompt or query
        {"role": "user", "content": "I am 56 and prefer low risk."}, # Context   
 ],
    temperature=0.0
)

Everything the LLM knows is stuffed into that messages array. In an agentic system, it is all about getting the right information into that array at the right time.

Context generally falls into three categories:

  • Static: Data that rarely changes (e.g., "User is a male, NY Yankees fan, foodie").

  • Dynamic: Data that evolves as the agent runs and interacts with tools (e.g., the results of a real-time stock price lookup).

  • Long-Lived: Data that spans across multiple sessions or days (e.g., "User already rejected the Vanguard recommendation yesterday").



In practice, building AI systems is often less about “prompt engineering” and more about deciding what information should be included in the model’s context at the moment of inference.

The data comes from a variety of sources. Cache would have the most recent data. The databases have the system of record. The log have events as they occurs. Context engineering involves getting the right data from the right place at the right time.

Note that context can go stale. If you feed in stale context, you get less the accurate answers. Keeping it current is part of the engineering. 

The Illusion of the Infinite Context Window


The context window refers to how much of your context the LLM retains for a conversation. The more it can remember, the better. Right ?

LLM providers aggressively advertise their context window sizes. Bigger appears better, but that is a dangerous trap for developers.

The context window simply represents the hard cap on how much text an LLM can "see" at once. Look at the landscape in early 2026:

  • Meta Llama 4 Scout: ~10 Million tokens

  • Gemini 3 Pro: ~1.0M - 2.0M tokens

  • OpenAI GPT-5.2: ~400,000 tokens

  • Claude 4.5 Sonnet: ~1.0M tokens

  • DeepSeek-R1: ~164,000 tokens

However, the context window is a model attribute, not an agent capability. Research in 2025 and 2026 has consistently proven that models severely degrade well before hitting their upper limits. This phenomenon is known as Context Rot.

Just because a model can accept 1 million tokens (about 8 full novels) doesn't mean it pays equal attention to all of them. Studies show that when a context window passes 50% capacity, models begin to heavily favor tokens at the very beginning or the very end of the prompt, completely ignoring critical constraints buried in the middle.

The industry is now focusing on the Maximum Effective Context Window (MECW). A model might advertise 1 million tokens, but its MECW—the point where accuracy actually drops off a cliff—might be only 130k tokens.

The Agent Loop

Because of Context Rot, you cannot just dump an entire database into the LLM and expect it to figure things out. This is why we build Agents.

An LLM is a stateless text predictor. An Agent is a software loop that uses the LLM as a reasoning engine to manage its own context. Agents operate in a continuous cycle: Observe → Think → Act.

Imagine building an AI-based investment analysis product. The agent doesn't just ask the LLM one massive question. It loops:

  1. Observe: The user asks, "Should I adjust my portfolio for the upcoming rate cuts?"

  2. Think (LLM): The model realizes it lacks context. It outputs a tool-call: get_user_portfolio() and get_risk_tolerance().

  3. Act (Code): Agent code queries a PostgreSQL database to fetch the financial profile.

  4. Update Context: It appends appends only the relevant portfolio metrics into the messages array.

  5. Loop: The agent sends this newly enriched, highly specific context back to the LLM to generate the final advice.

In this loop, the context is actively mutating. The agent is continuously pruning the messages array, summarizing old turns, and injecting fresh tool outputs to keep the token count well within the Maximum Effective Context Window.

Prompt Engineering vs. Context Engineering

If prompt engineering is about how you ask the question, context engineering is about giving the model a little more data before it attempts to answer.

To use an operating system analogy: The LLM is the CPU, the Prompt is the executable command, and the Context Window is the RAM and context is the data in RAM.

  • Prompt Engineering is writing a better command. It is user-facing, static, and brittle.

  • Context Engineering is managing the data in RAM. It is developer-facing, dynamic, and systemic.

As we move toward enterprise-grade AI, prompts are no longer enough. Context engineering involves building the infrastructure that feeds the model. It encompasses Retrieval-Augmented Generation (RAG) to find specific documents, Episodic Memory Graphs to track user decisions over time, and Context Pruning to prevent token overflow.

The Frontier: Context Graphs

While context engineering today is mostly about managing lists of messages, the future of enterprise AI lies in Context Graphs. Current LLM context is linear—a flat, chronological scroll of "User said X, Agent did Y." This works for chat, but it fails for complex enterprise workflows. Real-world business data isn't a timeline; it's a web of relationships.

Enter the Context Graph. Instead of dumping raw logs into the window, advanced agents now build and maintain a dynamic graph structure. Nodes represent entities (User, File, Decision, Error). Edges represent causality or relationships (e.g., User Upload caused Error 500, which triggered Retry Logic).

This structure transforms the context from a "temporary scratchpad" into an organizational brain. If a human auditor later asks, "Why did the agent reject this loan application?", a linear log forces the LLM to re-read thousands of lines of text to guess the reason. A Context Graph simply traverses the edge: Loan Application -> {rejected\_because}}  -> Risk Score > 80.

For enterprise applications, this is the missing link. It allows agents to reason across disconnected data points (e.g., linking a Slack message from Tuesday to a Code Commit on Friday) without needing a massive, expensive context window to hold all the noise in between.

Conclusion

A perfectly engineered prompt might get you a clever answer once. But a well-engineered context pipeline ensures your Agent gets the accurate answer securely, cost-effectively, and consistently, every single time.

Popular LLM models are advanced and sophisticated. But everyone has access to them. You have no advantage because you an LLM. Your advantage and intellectual property is in how you manage and feed context to the LLM. Though not new, how you collect, store and retrieve the data that forms the context is the real engineering.