The agentic AI ecosystem has developed a terminology problem.
Depending on which framework, startup, or conference talk you're listening to, the future of AI agents is supposedly built around:
- Tools
- Skills
- MCP
- CLI environments
The real question isn't which abstraction to use. It's: where should orchestration live? That one question cuts through most of the noise.
Because once you understand that, tools, skills, MCP, and CLI all start to make sense.
To illustrate this, let's imagine we're building a fictional AI Operations Agent.
The agent can:
- investigate incidents
- analyze logs and metrics
- review deployments
- create Jira tickets
- optimize cloud costs
- assist with root cause analysis
As we'll see, the architecture evolves dramatically as the system grows.
The Fundamental Problem
At its core, an LLM cannot actually perform operations work.
It cannot:
- query Datadog
- inspect GitHub commits
- retrieve deployment history
- restart services
- create Jira tickets
All it can do is reason over information provided in its context.
Everything useful in agentic AI comes from connecting the model to external capabilities.
The question becomes:
How should those capabilities be exposed?
The Tool-Based Agent
Most agent systems start with tools.
A tool is simply a primitive capability exposed to the model.
For our operations agent, the tools might look like:
def query_logs(service_name: str):
...
def query_metrics(service_name: str):
...
def get_recent_deployments(service_name: str):
...
def create_jira_ticket(summary: str):
...
The architecture is straightforward.
User
↓
LLM decides next action
↓
Agent Runtime
↓
Tool Execution
↓
Results returned to LLM
Suppose a user asks:
Why is checkout-service returning elevated 500 errors?
The interaction loop might look like this:
- The LLM determines logs are required.
- The runtime executes
query_logs(). - Results are returned.
- The LLM determines deployment history is also needed.
- The runtime executes
get_recent_deployments(). - Results are returned.
- The LLM generates a root cause hypothesis.
This works surprisingly well.
Initially.
The Hidden Scaling Problem
The challenge appears when the system grows.
At first the agent has only a handful of tools.
query_logs()
query_metrics()
query_deployments()
create_ticket()
Six months later:
query_logs()
query_metrics()
query_traces()
query_alerts()
query_runbooks()
restart_service()
rollback_deployment()
create_ticket()
query_cost_data()
analyze_utilization()
deploy_hotfix()
...
Now something subtle has happened.
The LLM is no longer merely selecting tools.
It is orchestrating an operational workflow.
It must decide:
- which tools to invoke
- in what order
- when enough information has been gathered
- how to recover from failures
- what actions should be taken
In other words:
Tool-based systems push orchestration into the model.
This turns out to be both powerful and dangerous.
The Real Problem Is Not Tools
Many people look at this situation and conclude: We need better tools.
But tools are not actually the problem.
The problem has three dimensions. First, reliability: too much execution logic is living inside a probabilistic system that may sequence it differently on every run. Second, cost: every invocation pays LLM inference tokens to rediscover a workflow that never changes. Third, latency: each tool-selection step is a round-trip to the model — a waterfall of LLM calls where a single deterministic function would do.
Consider an incident investigation. The sequence:
query logs → query metrics → retrieve deployment history → retrieve recent alerts
is often deterministic. There is little value in forcing the LLM to rediscover that workflow every time.
is completely deterministic. You already know exactly what to fetch. Forcing the LLM to reason its way to that sequence on every invocation is not intelligence — it's waste. Unreliable, slow, and expensive waste.
The industry is slowly realizing that not all orchestration belongs inside the model. This realization leads directly to skills.
The Two Orchestration Layers
Every agent system, no matter how it's described, is really making decisions at two different layers:
Semantic orchestration — deciding what needs to happen. Which capability is relevant. What the user actually meant. This is inherently ambiguous, and it's the one job LLMs are genuinely good at.
Execution orchestration — deciding how it happens. Ordering, parallelism, retries, error handling, state. This is inherently deterministic, and it's the one job LLMs are bad at — slow, expensive, and inconsistent across runs.
Most of the confusion in this space comes from putting both layers in the same place. Tools, skills, MCP, and CLI environments are best understood as different ways of drawing the line between these two layers — not as competing abstractions. Keep this distinction in mind; every section below maps back to it.
Skills: Coarse-Grained Semantic Capabilities
More precisely: a skill is a subgraph with a semantic boundary. It encapsulates three things the LLM should not have to think about: execution order, parallelism, and retry policy. The planner sees a named capability. The skill owns how that capability runs.
For our operations platform, skills might include: IncidentAnalysisSkill, DeploymentReviewSkill, CapacityPlanningSkill, CostOptimizationSkill.
Notice the difference. Tools describe primitive operations. Skills describe business-level capabilities and the execution policy that implements them.
The Planner LLM Still Matters
At this point, some people assume skills eliminate the need for an LLM.
They don't.
Without an LLM, mapping human requests to skills becomes extremely difficult.
Consider:
Why did checkout-service start failing after lunch?
Is this:
- an incident investigation?
- a deployment review?
- a performance analysis?
The ambiguity lives in natural language.
The LLM remains extremely useful for:
- intent understanding
- semantic routing
- contextual interpretation
The architecture increasingly becomes:
User Request
↓
Planner LLM
↓
Skill Selection
↓
Skill Execution
↓
Results
↓
Planner LLM
↓
Response
Skills Move Execution Orchestration Out Of The Model
Skills are the clearest example of this split in action: semantic orchestration stays with the planner, execution orchestration moves into the skill. This is the architectural shift many modern systems are making.
For example:
class IncidentAnalysisSkill: async def execute(self, service_name): # Fetch concurrently — no reason to serialize logs, metrics, deployments, alerts = ( await asyncio.gather( query_logs(service_name), query_metrics(service_name), get_recent_deployments(service_name), get_recent_alerts(service_name) ) ) return { "logs": logs, "metrics": metrics, "deployments": deployments, "alerts": alerts }
Notice what happened. The planner LLM still decides: I need IncidentAnalysisSkill. But once selected, deterministic execution logic takes over — including decisions the LLM would never make well, like whether to fetch in parallel or how to handle a partial failure.
Note: this example is deliberately simplified. In production, skills typically need streaming partial results, per-tool timeouts, and circuit breakers. The point is that all of that complexity lives in the skill — not in the model.
Where MCP Fits
MCP doesn't touch this split at all — it just makes the capabilities on the execution side portable across runtimes. This is where many online discussions become confused.
MCP is not competing with skills.
MCP solves a completely different problem.
The ecosystem currently suffers from massive fragmentation.
Every framework invents:
- different tool schemas
- different invocation mechanisms
- different discovery protocols
MCP attempts to standardize that layer.
The easiest way to think about MCP is:
MCP is trying to become the interoperability layer for AI capabilities.
Closer to OpenAPI for agent ecosystems — a shared interface contract that defines how capabilities are described, discovered, and invoked, without caring what carries the bytes underneath.
What MCP Exposes
Most coverage of MCP focuses only on tools. But the protocol defines three primitives, and the distinction matters:
Tools — callable functions the model can invoke. This is what most people mean when they talk about MCP.
Resources — structured data the model can read: log files, database records, API responses. Resources are pull-based; the model requests them rather than the server pushing them.
Prompts — reusable prompt templates with parameters, surfaced by the server. This lets server authors encode domain knowledge — e.g., a Datadog MCP server can expose a prompt template for structured incident analysis, not just raw metric endpoints.
For our operations agent, a well-designed Datadog MCP server doesn't just expose query_metrics(). It might also expose a resource for the current service topology and a prompt template that structures how the model should reason about an anomaly. That's a qualitatively richer integration than a tool-only mental model suggests.
from mcp.server import Server
server = Server("datadog-server")
@server.tool()
async def query_service_metrics(service):
...
@server.resource("topology://services")
async def get_service_topology():
...
@server.prompt()
async def analyze_anomaly(service, window):
return f"Given metrics for {service} over {window}"
MCP Does Not Solve Orchestration
This is probably the biggest misconception surrounding MCP today.
MCP standardizes:
- discovery
- schemas
- invocation
It does not solve:
- planning
- retries
- workflow state
- execution policy
- skill design
- orchestration
You still need those layers.
MCP makes capabilities portable.
It does not tell you how to use them.
One practical implication: an MCP server built today for Claude can work tomorrow with any other MCP-compatible runtime. That portability is the actual value proposition — not the protocol mechanics themselves.
Why Coding Agents Love CLI Environments
Now let's discuss the most interesting trend in the ecosystem. CLI environments push the line further than either tools or skills: here, the model itself participates in execution orchestration by writing commands, which is exactly why sandboxing becomes non-negotiable.
Coding and operations agents are increasingly bypassing structured tools altogether.
Instead of exposing:
query_git_diff()
run_tests()
deploy_service()
they expose:
git
docker
kubectl
terraform
pytest
The shell becomes the capability layer.
The agent generates commands directly.
kubectl logs checkout-service
git diff
terraform plan
This is a radically different philosophy.
Instead of carefully defining capabilities, you expose a real execution environment.
Why CLI Agents Feel More Powerful
Unix environments already solved composability decades ago. Commands naturally chain together. The shell already provides: composition, state, interoperability, flexibility.
Instead of creating hundreds of tool schemas, the environment itself becomes the API. This is why modern coding agents often feel dramatically more capable than traditional tool-calling agents.
The Tradeoff
Of course, flexibility comes with risk. A structured tool:
restart_service(service_name) is easier to constrain than:
rm -rf /
This is not a theoretical concern. A CLI agent operating against a production Kubernetes cluster with broad permissions is one hallucinated flag away from an outage. The blast radius of a mistake is bounded only by what the execution environment allows — not by what the tool schema permits. With structured tools, you constrain at the interface. With CLI, you constrain at the environment.
This is why sandboxed execution environments have become infrastructure in their own right. Tools like E2B, Daytona, and Modal provide ephemeral, isolated runtimes specifically built for agent workloads: short-lived containers with scoped filesystem access, network egress controls, and resource limits. The architecture shifts from "give the agent a shell" to "give the agent a shell inside a box it can't break out of." This also changes the observability story — a sandboxed run produces a complete audit trail of every command executed, which structured tool-calling often doesn't.
The deeper implication for our orchestration framework: CLI agents are the one place where the semantic/execution boundary genuinely blurs. The model is no longer just selecting a capability — it is participating in execution orchestration by writing the commands themselves. That is both the source of their power and the reason the safety layer beneath them has to be more sophisticated than a permission flag.Where The Industry Appears To Be Heading
If you step back from the terminology wars, a pattern begins to emerge.
Modern agent systems increasingly look like this:
LLM
↓
Semantic Planning
↓
Skills
↓
Deterministic Workflows
↓ ↓
Tools MCP
↓ ↓
Local Systems External Systems
↓
CLI Runtime
Notice what happened.
The LLM is not "the system."
It has become one component inside a larger architecture.
Reasoning remains probabilistic.
Execution becomes deterministic.
Interoperability becomes protocol-driven.
Capabilities become reusable infrastructure.
Final Thoughts
Tools, Skills, MCP, and CLI environments are different answers to where the line falls between semantic and execution orchestration — not competing abstractions.
They exist at different layers of the stack.
Tools expose primitive capabilities.
Skills expose semantic capabilities.
MCP standardizes interoperability.
CLI environments expose raw execution contexts.
The real architectural question is not:
Which abstraction wins?
It is:
Where should orchestration live?
The most successful systems increasingly place semantic orchestration inside LLMs and execution orchestration inside deterministic software.
That may ultimately be the most important lesson the industry has learned over the past few years.
The future of agentic AI looks less like autonomous chatbots wandering through APIs and more like carefully engineered systems where probabilistic reasoning and deterministic execution work together



.jpg)
.jpg)