Code & Cluster
Thursday, April 9, 2026
Why Deterministic AI Agents Are The Wrong Goal ?
Saturday, March 7, 2026
AI Context Explained: The Real Engineering Behind Modern AI Systems
Most discussions about large language models focus on prompts — how to phrase instructions to get better responses. But in real AI systems, prompts are only a small part of the story.
What actually determines the quality of an AI system is context: the information available to the model when it generates a response. This includes prompts, conversation history, retrieved documents, tool outputs, and sometimes structured application state. Designing how this information is assembled and provided to the model is what many engineers now call context engineering.
Providing the right context to the LLM is the only reliable way to get accurate, production-grade answers. In this blog, I explore what context actually is, the hidden dangers of massive context windows, and how it should be used in Agentic AI.
Context
Context refers to all the information that is not in the user's immediate question, but is required to help the LLM generate a relevant, highly specific answer. It is the data that gives the LLM situational awareness.
Consider this basic prompt:
Prompt: What is a good stock mutual fund to invest in? Response (Abbreviated): 1. T. Rowe Price Global Technology Fund (PRGTX) 2. Wasatch Ultra Growth Fund (WAMCX).
For many investors, both of these are far too aggressive, high-risk, and expensive. Let's change the prompt slightly to inject some context:
Prompt: What is a good stock mutual fund to invest in? I am 56 years old, nearing retirement. I prefer low-risk, low-cost, highly diversified funds. Response (Abbreviated): 1. Vanguard Target Retirement 2035 Fund (VTTHX) 2. Fidelity ZERO Total Market Index Fund (FZROX) 3. Vanguard Total Bond Market Index Fund (VBTLX).
This response is drastically different and entirely appropriate for a conservative investor. The phrase "56 years old, nearing retirement. I prefer low-risk, low-cost, highly diversified funds" is the context. Without it, asking the LLM the same question multiple times will yield scattered, generic, or even dangerous financial advice.
How is Context Passed to the Model?
Whether you use native provider APIs (OpenAI, Google) or orchestration frameworks like LangChain, context is not a separate magical parameter. It is embedded directly into the input messages.
A raw API call looks like this:
client.responses.create(
model="gpt-5.2",
messages=[
{"role": "system", "content": "You are a financial advisor."}, # system prompt
{"role": "user", "content": "What is a good fund to invest in?"} # User prompt or query
{"role": "user", "content": "I am 56 and prefer low risk."}, # Context ],
temperature=0.0
)
Everything the LLM knows is stuffed into that messages array. In an agentic system, it is all about getting the right information into that array at the right time.
Context generally falls into three categories:
Static: Data that rarely changes (e.g., "User is a male, NY Yankees fan, foodie").
Dynamic: Data that evolves as the agent runs and interacts with tools (e.g., the results of a real-time stock price lookup).
Long-Lived: Data that spans across multiple sessions or days (e.g., "User already rejected the Vanguard recommendation yesterday").
The Illusion of the Infinite Context Window
LLM providers aggressively advertise their context window sizes. Bigger appears better, but that is a dangerous trap for developers.
The context window simply represents the hard cap on how much text an LLM can "see" at once. Look at the landscape in early 2026:
Meta Llama 4 Scout: ~10 Million tokens
Gemini 3 Pro: ~1.0M - 2.0M tokens
OpenAI GPT-5.2: ~400,000 tokens
Claude 4.5 Sonnet: ~1.0M tokens
DeepSeek-R1: ~164,000 tokens
However, the context window is a model attribute, not an agent capability. Research in 2025 and 2026 has consistently proven that models severely degrade well before hitting their upper limits. This phenomenon is known as Context Rot.
Just because a model can accept 1 million tokens (about 8 full novels) doesn't mean it pays equal attention to all of them. Studies show that when a context window passes 50% capacity, models begin to heavily favor tokens at the very beginning or the very end of the prompt, completely ignoring critical constraints buried in the middle.
The industry is now focusing on the Maximum Effective Context Window (MECW). A model might advertise 1 million tokens, but its MECW—the point where accuracy actually drops off a cliff—might be only 130k tokens.
The Agent Loop
Because of Context Rot, you cannot just dump an entire database into the LLM and expect it to figure things out. This is why we build Agents.
An LLM is a stateless text predictor. An Agent is a software loop that uses the LLM as a reasoning engine to manage its own context. Agents operate in a continuous cycle: Observe → Think → Act.
Imagine building an AI-based investment analysis product. The agent doesn't just ask the LLM one massive question. It loops:
Observe: The user asks, "Should I adjust my portfolio for the upcoming rate cuts?"
Think (LLM): The model realizes it lacks context. It outputs a tool-call:
get_user_portfolio()andget_risk_tolerance().Act (Code): Agent code queries a PostgreSQL database to fetch the financial profile.
Update Context: It appends appends only the relevant portfolio metrics into the
messagesarray.Loop: The agent sends this newly enriched, highly specific context back to the LLM to generate the final advice.
In this loop, the context is actively mutating. The agent is continuously pruning the messages array, summarizing old turns, and injecting fresh tool outputs to keep the token count well within the Maximum Effective Context Window.
Prompt Engineering vs. Context Engineering
If prompt engineering is about how you ask the question, context engineering is about giving the model a little more data before it attempts to answer.
To use an operating system analogy: The LLM is the CPU, the Prompt is the executable command, and the Context Window is the RAM and context is the data in RAM.
Prompt Engineering is writing a better command. It is user-facing, static, and brittle.
Context Engineering is managing the data in RAM. It is developer-facing, dynamic, and systemic.
As we move toward enterprise-grade AI, prompts are no longer enough. Context engineering involves building the infrastructure that feeds the model. It encompasses Retrieval-Augmented Generation (RAG) to find specific documents, Episodic Memory Graphs to track user decisions over time, and Context Pruning to prevent token overflow.
The Frontier: Context Graphs
While context engineering today is mostly about managing lists of messages, the future of enterprise AI lies in Context Graphs. Current LLM context is linear—a flat, chronological scroll of "User said X, Agent did Y." This works for chat, but it fails for complex enterprise workflows. Real-world business data isn't a timeline; it's a web of relationships.
Enter the Context Graph. Instead of dumping raw logs into the window, advanced agents now build and maintain a dynamic graph structure. Nodes represent entities (User, File, Decision, Error). Edges represent causality or relationships (e.g., User Upload caused Error 500, which triggered Retry Logic).
This structure transforms the context from a "temporary scratchpad" into an organizational brain. If a human auditor later asks, "Why did the agent reject this loan application?", a linear log forces the LLM to re-read thousands of lines of text to guess the reason. A Context Graph simply traverses the edge: Loan Application -> {rejected\_because}} -> Risk Score > 80.
For enterprise applications, this is the missing link. It allows agents to reason across disconnected data points (e.g., linking a Slack message from Tuesday to a Code Commit on Friday) without needing a massive, expensive context window to hold all the noise in between.
Conclusion
A perfectly engineered prompt might get you a clever answer once. But a well-engineered context pipeline ensures your Agent gets the accurate answer securely, cost-effectively, and consistently, every single time.
Popular LLM models are advanced and sophisticated. But everyone has access to them. You have no advantage because you an LLM. Your advantage and intellectual property is in how you manage and feed context to the LLM. Though not new, how you collect, store and retrieve the data that forms the context is the real engineering.
Monday, February 9, 2026
What is (Agentic) AI Memory ?
What do people mean when they say "AI Memory" ?
At their core LLMs are stateless functions. You make a request with a prompt and some context data and it provides you with a response
In real systems, AI memory usually means:
- Storing past interactions, user preferences, decisions, goals, or facts.
- Retrieving relevant parts later
- Feeding a compressed version back into the prompt
So yes — at its core:
Memory = save → retrieve → summarize → inject into context
Nothing magical. But is that all ? seems just like a regular cache ? Read on.
Is this just RAG (Retrieval Augmented Generation) ?
Purpose:
- Bring external knowledge into the LLM
- Docs, PDFs, financial data, code, policies
Typical traits:
- retrieval is stateless per query
- Large text chunks
- Query-driven retrieval
- “What additional data can we provide to LLM to help answer this question?”
Agent / User Memory
- Maintain continuity
- Personalization
- Learning user intent and preferences over time
Typical traits:
- Long-lived
- Highly structured
- Small, distilled facts
- “What can I provide to LLM so it remembers this user?”
Think of it this way:
They often use can use the same retrieval tools, but they serve different roles.
Where is the memory ?
Suitable for cases where the Agent loop is short and no persistence is needed.
Such as LangGraph memory, LlamaIndex memory, Memgpt. They try to make it easier for agents to store and retrieve.
The mental model for AI memory
Memory and the LLM
You do not want add large amount of arbitrary data as context because:
- text is converted to token and token cost spirals
- LLM attention degrades with noise
- Latency increases
- Reasoning quality declines
Real Agentic Memory
To be useful in the agentic way, what is stored in the memory needs to evolve. Older or maybe irrelevant data in the memory needed to be "forgotten" or evicted based on intelligence (not standard algorithms like FIFO, LIFO etc). Updates and evictions need to happen based on recent interactions. If the historical information is too long and should not be evicted, it might need to be compressed.
Agentic systems require more dynamic memory evolution than typical CRUD applications. In the case of long running agents, the quality of data in the memory has to get better with interactions over time.
How exactly that can be implemented is beyond the scope of this blog and could be a topic for a future one.
Considerations
Summarize and abstract to extract intelligence - as opposed to dumping large quantity of data.
In conclusion
Conceptually, it is similar to RAG but they apply to different use cases.
Better and smaller contexts beat large contexts and large memory.
Agentic AI Memory adds value only when
- The system changes behavior ( for the better ) because of it
- It produces better response, explanations, reasonings
- It saves time
These ideas are not purely theoretical. While building Vestra — an AI agent focused on personal financial planning and modeling — I’ve had to think deeply about what should be remembered, what should be abstracted, and what should be discarded. In financial reasoning especially, raw history is far less useful than structured, evolving state.
But yes, Agentic memory will be different than what we know as memory in regular apps — in the ways it is updated, evicted, and retrieved.
Thursday, January 15, 2026
Unique ID generation
Unique ID generation is a deceptively simple task in application development. While it can seem trivial in a monolithic system, getting it right in a distributed system presents a few challenges. In this blog, I explore the commonly used ID generation techniques, where they fail and how to improve on them.
Why are unique Ids needed in applications ?
They are needed to prevent duplicates. Say the system creates users or products or any other entity. You could use name at the key for uniqueness constraint. But there could be two users or products with the same name. For users, email is an option but that might not be available for other kinds of entities.
Sometimes you are consuming messages from other systems over platforms like Kafka. Processing the same message multiple times can lead to errors. Duplicate delivery can happen for a variety of reasons that can be out of control of the consumer. Senders therefore include a uniqueId with the message so that consumer can ignore ids already processed (Idempotency) .
They can be useful for ordering. Ordering is useful to determine which event happened before others. It can be useful if you want to query for the most recent events or ignore older events.
What should the size and form of an id be ?
Should it be a number or should it be alphanumeric? Should it be 32 bit, 64 bit or larger.
With 32 bit the maximum numeric id you can generate is ~ 4 billion. Sufficient for most systems but not enough if your product is internet scale. With 64 bit, you can generate id into the hundreds of trillions.
But size is not the only issue. It is not the case that ids will generated in sequence in one place one after the other. In any system of even moderate scale, the unique ids will need to be generated from multiple nodes.
On the topic of form, numerics ids are generally preferred as the take up less storage and can be easily ordered and indexed.
In the rest of the blog, I go over some unique id generation techniques I have come across.
ID GenerationTechniques
1. Auto increment feature of the database
If your services uses a database, this becomes an obvious choice as every database supports auto increment.
With postgresql, you would set up the table as
CREATE TABLE users (id SERIAL PRIMARY KEY, name TEXT not null);
With mysql,
CREATE TABLE users (id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255) not null);
Every insert into the table generates an id in the id column.
INSERT INTO users(name) VALUES ('JOHN') ;
INSERT INTO users(name) VALUES ('MARY') ;
id | name |
-----+---------
1 | JOHN |
2 | MARY |
While this is simple to setup and use, it is appropriate for simple crud applications with low usage. The disadvantages are
- It requires an insert to the db to get an id
- If the insert fails for other reasons, you can have gaps
- For distributed applications, it is a single point of failure
The single point of failure issue can be solved setting up multiple databases in a multi master arrangement as show below. In fact, here we are handing out ids in batches of hundred to reduce the load on the database.
2. UUID
These are 128 bit alpha numeric strings that are quite easy to generate and can be suitable for ids because they are unique. An example UUID is e3d6d0e9-b99f-4704-b672-b7f3cdaf5618
The advantages of UUID are:
- easy to generate. Just add UUID generator to the service.
- low probability of collision
- Each replica will generate unique id
- 128 bits for each id will eventually take up space
- No ordering. Ids are random
3. Flickr ticket server
The Flickr ticket server is an improvement of the auto increment feature. In the example we have in 1, the id is tied to the users table. You get a new id only when you insert into the user table. if you needed unique ids for another entity, you would need to add autoincrement column to that table. But what if you needed unique ids in increasing order across tables or tables as would be the case in a distributed system ?
We could create a generic table
CREATE TABLE idgenerator (id SERIAL PRIMARY KEY);
This can work but it would keep accumulating rows of ids which would also be store elsewhere.
What they did at flicker was this
create table ticketserver (id int auto_increment primary key, stub char(1));
When they need an id , the do
replace into tickerserver(stub) values ('a') ;
select LAST_INSERT_ID();
This table always have only 1 row, because everytime you need an id, you are replacing the previous row. The above code is Mysql specific.
For SQL that will work on postgresql you can do it little differently
create table ticketserver( stub char(1) primary key, id bigint notnull default 0)
INSERT into ticketserver(stub,id)
VALUES('a', COALESE((SELECT ID from ticketserver where stub = 'a'),0)
ON CONFLICT(stub) DO UPDATE
SET id = ticketserver.id + 1
returning id
4. Snow flake twitter approach
This Twitter approach that does not rely on the database or any third party service. It can be generated in code just following the specification.
It is a 64 bit id.
The left most bit is a sign bit for future use.
The next 41 bits are used for the timestamp. Current time in ms since the epoch Nov 4 2010 1:42:54 UTC. But you can use any epoch.
The next 5 bits are for datacenter id - that gives you 2 power 5 or 32 data centers.
The next 5 bits are for machine id. 32 machines per datacenter.
The last 12 bits are a sequence id - giving you 2 power 12 = 4096 ids per ms ( per datacenter per machine)
.jpg)
The value of 2 power 41 - 1 is 2,199,023,255,551. With 41 bits for timestamp , you get a lot ids. This will last almost 70 years from epoch.
You can change things to reduce the size from 64 bits if needed. You may not need a datacenter id or you can decide to use fewer bits for the timestamp.
The advantages of this approach are
- Decentralized generation. No dependency on DB. Lower latency
- Time based ordering
- Higher throughput
- Datacenter id and machine id can help in debugging
- clocks need to be synchronized
- datacenters/machines need unique id
- epoch needs to be choose wisely
Other considerations
Summary
ID generation evolves with your systems scale. When you are starting out, it is normal to keep things simple and go with auto increment. But sooner or later you will need to scale (good problem to have). But the Flicker and Twitter methods are solid. I personally like the Twitter approach as it has no dependency on the database. It offers an excellent balance of decentralization, ordering and efficiency but requires clock synchronization. Whatever approach you choose, you need to ensure that aligns with your systems consistency requirements, scaling needs and tolerance for operational complexity.
Reference
Sunday, November 30, 2025
Review : Facebook Memcache Paper
Introduction
Facebook (Meta) application handles billions of requests per second. The responses are built using thousands of items that needs to retrieved at low latencies to ensure good user experience. Low latency is achieved by retrieving the items from a cache. Once their business grew, a single node cache like Memcached would obviously not be able to handle the load. This paper is about how they solved the problem by building a distributed cache on top of Memcached.
In 2025, some might say the paper is a little dated ( 2013). But I think it is still interesting for several reasons. It is one of the early papers from a time when cloud and distributed systems exploded. I see it in the same category as the Dynamo paper from Amazon. While better technology is available today, this paper teaches important concepts. More importantly the paper shows how to take technology available and make more out it. In this case, they took a single node cache and built a distributed cache on top of it.
This is my summary of the paper Scaling Memcache at Facebook.
Requirements at Facebook
- Allow near real time communication
- aggregate content on the fly
- access and update popular shared content
- scale to millions of user requests per second
The starting point was single node Memcached servers.
The target was a general purpose memory distributed key value store - called Memcache, that would be used by a variety of application use cases.
In the paper and in this blog, Memcached refers to the popular open source in memory key value store and Memcache is the distributed cached that Facebook built.
Observations:
Read volumes are several orders of magnitude higher than write volumes.
Data fetched from multiple sources - HDFS, MySql etc
Memcached supports simple primitives - set, get, delete
Details
Memcache is a demand filled look aside cache. There can be thousands of memcached servers within a memcache cluster.
When an application needs to read data, it tries to get it from memcache. If not found in Memcache, it gets the data from the original source and populates Memcache.
When the application needs to write data, it writes to original source and the key in Memcache is invalidated.
Wide fan out: When the front end servers scale, the backend cache needs to scale too.
Items are distributed across Memcached servers using consistent hashing.
To handle a request, front end server might need to get data from many Memcached servers.
Front end servers use a Memcache client to talk to memcached servers. Client is either a library or a proxy called mcrouter. Memcached servers do not communicate with each other.
Invalidation of the Memcached key is done by code running on the storage. It is not done by the client.
Communication
UDP is used for get requests. (Surprised ? clearly an optimization). Direct from client in webserver to Memcached. Dropped requests are treated as a cache miss.
TCP via mcrouter used for set and delete requests. Using mcrouter helps manage connections to storage.
Client implements flow control to limit load on backend components
Leases
Leases were implemented to address stale sets and thundering herd. Stale sets are caused by updating the cache with invalid values. Requiring a lease lets the system check that update is valid.Thundering herd happens when there is heavy read write activity on the same key at the same time. By handing out leases only every so often say 10 sec, they slow things down.
Memcache pools
This is a general purpose caching layer used by different applications, different workloads with different requirements. To avoid interference, the clusters servers are partitioned into pools. For example, a pool for keys that are accessed frequently and cannot tolerate cache miss. Another pool for infrequently accessed keys. Each pool can scaled separately depending on requirement.
Failures
For small failures, the requests are directed to a set dedicated backup servers called gutters. When a large number of servers in the cluster down, the entire cluster is considered offline and traffic is routed to another cluster.
Topology
A frontend cluster is a group of webservers with Memcache.
A region is frontend cluster plus storage.
This keeps the failure domains small.
Storage is the final source of truth. Use mcsql and mcrouter to invalidate cache.
When going across data centers and across geo regions, the storage master in one region replicates to replicas in other regions. On an update, when the Memcached needs to be invalidated by the storage, it is not a problem if the update is in the master region. But if the update is in the replica region, then the read after write might read a stale data as the replica might not have caught up. In replica regions, markers are used to ensure that only data in sync with the master is read.
Summary
In summary, this paper show how Facebook took a single node Memcached and scaled it to its growing needs. No new fundamental theories are applied or discussed. But this demonstrated innovation and trade offs in engineering to scale and grow in product in production that needs to scale to meet user demand.
Key point is that separation cache and storage allowed each to be scaled separately. Kept focus on monitoring, debugging and operational efficiency. Gradual rollout and rollback of features kept a running system running.
Some might say that Memcache is a hack. But some loss in architectural purity is worth it -- if your users and business stay happy.
Memcache and Facebook were developed together with application and system programmers working together to evolve and scale the system. This does not happen when teams work in isolation.
Sunday, October 5, 2025
Data Storage For Analytics And AI
For a small or medium sized company storing all the data in relational database like Postgresql or MySQL is sufficient. Perhaps if analytics is needed they might also use a columnar store, more like a data warehouse.
If your business grows to handle large volumes of unstructured data—maybe logs from your e-commerce site, emails, support tickets, images, or customer audio—storing everything in a single RDBMS becomes impossible. These new data types require specialized architectures designed for scale, flexibility, and advanced analytics (like Machine Learning and Generative AI).
Here is a guide to the key data storage paradigms you will encounter:
1. Relational Database Management Systems (RDBMS)
This needs no introduction.
Primary Use Case: Online Transaction Processing (OLTP). Applications requiring fast, frequent reads and writes, and ACID compliance.
Data Structure: Data is modeled at normalized rows and columns. Explicit relationships are enforced using foreign keys. In most cases storage is implemented as a B+ tree.
2. Data Warehouse (DW)
JOIN queries.3. Data Lake
4. Data Lakehouse
5. NOSQL Database
6. Vector Database
Summary
In essence, the modern enterprise no longer relies on a single data storage solution. The journey usually starts with the RDBMS for transactional integrity, moves to the Data Warehouse for structured BI, and expands into the Data Lake to capture all raw, unstructured data necessary for Machine Learning and discovery.
The Data Lakehouse is the cutting-edge step, unifying these functions by bringing governance and performance directly to the lake. Vector Databases bridge the gap between unstructured data and the world of Generative AI.
Note that there is some overlap between the categories. For example Postgresql supports JSONB and vector storage, making it useful for some NoSql and AI use cases. Some products that started of as data lakes added features to become lakehouses.
Saturday, September 13, 2025
What Does Adding AI To Your Product Even Mean?
Introduction
I have been asked this question multiple times: My management sent out a directive to all teams to add AI to the product. But I have no idea what that means ?
In this blog I discuss what adding AI actually entails, moving beyond the hype to practical applications and what are some things you might try.
At its core, adding AI to a product means using an AI model, either the more popular large language model (LLM) or a traditional ML model to either
- predict answers
- generate new data - text, image , audio etc
The effect of that is it enable the product to
- do a better job of responding to queries
- automate repetitive tasks
- personalize responses
- extract insights
- Reduce manual labor
It's about making your product smarter, more efficient, and more valuable by giving it capabilities it didn't have before.
Any domain where there is a huge domain of published knowledge (programming, healthcare) or vast quantities of data (e-commerce, financial services, health, manufacturing etc), too large for the human brain to comprehend, AI has a place and will outperform what we currently do.
So how do you go about adding AI ?
1. Requirements
2. Model
The recent explosion of interest in AI is largely due to Large Language Models (LLMs) like ChatGPT. At its core, the LLM is a text prediction engine. Give it some text and it will give you text that likely to follow.
But beyond text generation, LLMs have been been trained with a lot of published digital data and they retain associations between text. On top of it, they are trained with real world examples of questions and answers. For example, the reason they do such a good job at generating "programming code" is because they are trained with real source code from github repositories.
What model to use ?
The choices are:
- Commercial LLMs like ChatGpt, Claude, Gemini etc
- Open source LLMs like Llama, Mistral, DeepSeek etc
- Traditional ML models
3. Agent
- Accepts requests either from a UI or another service
- Makes requests to the model on behalf of your system
- Makes multiple API calls to systems to fetch data
- May search the internet
- May save state to a database at various times
- In the end, returns a response or start some process to finish a task
4. Data pipeline
A generic AI model can only do so much. Even without additional training, just adding your data to the prompts can yield better results.
The data pipeline is what makes the data in your databases, logs, ticket systems, github, Jira etc available to the models and agents.
- get the data from source
- clean it
- format it
- transform it
- use it in either prompts or to further train the model
5. Monitoring
Now let us seem how these concepts translate into some very simple real-world applications across different industries.
Examples
1. Healthcare: Enhancing Diagnostics and Patient Experience
Adding AI can mean:
Personalized Treatment Pathways: An AI Agent can analyze vast amounts of research papers, clinical trial data, and individual patient responses to suggest the most effective treatment plan tailored to a specific patient's profile.
Example: For a person with high cholesterol, an AI agent can come up with a personalized diet and exercise plan.
2. Finance: Personalized Investing
Adding AI could mean:
Personalized Financial Advice: Here, an AI Agent can serve as a "advisor" to offer highly tailored investment portfolios and financial planning advice.
Example: A banking app's AI agent uses an LLM to understand your financial goals and then uses its "tools" to connect to your accounts, pull real-time market data, and recommend trades on your behalf. It can then use its LLM to explain in simple terms why it made a specific trade or rebalanced your portfolio.
3. E-commerce: Customer Experience
Adding AI could mean:
Personalized shopping: AI models can find the right product at the right price with the right characteristics for user requirement
Example: Instead of me shopping and comparing for hours, AI does it for me and makes a recommendation on the final product to purchase.
In Conclusion
Adding AI to your product to make it better means using the proven power of AI models
- To better answer customer request with insights
- To automate repetitive time consuming task
- To make predictions that were hard earlier
- To gain insights into vast bodies of knowledge
Start small. Focus on one specific business problem you want to solve, and build from there.



.jpg)
