PRE-ALPHA — v0.1.0 PUBLISHED ON PYPI

Most agent frameworks assume nothing ever crashes mid-task.

Kestrion is an open-source Python framework for AI agents built around one idea: state is never mutated directly — it's derived by replaying an immutable log of events. That single decision is what makes crash recovery, human-approval gates, and full execution traces the default behavior, not something you build yourself on top of someone else's framework.

kubectl_agent.py — actual run, unedited
# An agent proposes scaling up a deployment, then needs approval
# before it's allowed to touch the cluster.
$ python3 examples/kubectl_agent.py
status=waiting_on_human current_node=apply_change
pending approval: {'tool': 'apply_manifest', 'kwargs': {'yaml': 'replicas: 3'}}
# The run is now parked. No thread is blocked, nothing is held
# in memory. A second, independent process resumes it later —
# this is the actual crash-recovery test, not a simulation.
$ engine.resume(run_id) # from a fresh process
status=completed
apply_result={'applied': True, 'yaml': 'replicas: 3'}
total events for this run: 13

Prototyping an agent is easy. Running it in production is where things break.

Several frameworks can get a working agent loop running in an afternoon. The gap shows up after that — when the agent has to survive contact with a real environment.

The process restarts mid-run.
A pod gets rescheduled, a deploy rolls out, a laptop sleeps. What happens to the agent that was three tool calls into a task when that happens?
A tool call needs a human first.
Before an agent runs a database migration or applies a Kubernetes manifest, someone needs to approve it — and that approval might arrive five seconds or five hours later, from an entirely different process.
You need to know what happened.
Not just "the agent finished" — which tool calls were made, in what order, what each step cost, and why the agent took the path it took.
You need more than one agent.
Ten thousand concurrent sessions, not one notebook cell — shared rate limits, no thread blocked per run, and a store that doesn't fall over under load.

Everything that happens is an event. State is just a replay.

Every LLM call, tool call, and transition is recorded as an immutable event with its own timestamp and cost. The agent's current state is never mutated in place — it's derived by folding that event log. This is the single design decision underneath every other claim Kestrion makes about durability.

PROVEN
Execution engine
Drives the node graph, checkpoints on every transition, enforces approval gates centrally
PROVEN
Event log + checkpoint store
SQLite-backed today, behind a protocol that swaps to Postgres without engine changes
PROVEN
Agent / @tool decorators
Declarative API on top of the raw engine primitives — signature introspection generates JSON schemas automatically
PROVEN
LLM providers
Anthropic, OpenAI, and Ollama behind one protocol — swap providers without touching agent code
PLANNED
MCP integration
Connect to external MCP servers as tools, or expose a Kestrion agent as one
PLANNED
Scheduler
Rate-limited, concurrent execution across many agent runs sharing one provider quota
PLANNED
CLI + deploy
kestrion deploy --target k8s generates the manifest to run agents as stateless pods

Most frameworks are strong at authoring. Few make production behavior the default.

This is the honest version, including where Kestrion is not uniquely positioned. AWS Strands and Bedrock AgentCore make a similar production-grade pitch, backed by far more engineering capacity. The bet here is an independent, MCP-native alternative for teams that don't want cloud lock-in — not an unchallenged advantage.

LangGraph CrewAI AWS Strands Kestrion
State model Mutable, opt-in checkpoints Mostly in-memory Event-driven hooks Event-sourced by default
Crash recovery Possible, needs setup Not native Tracing, not full replay Default, independently verified
Approval gates Manual graph wiring No native primitive Hooks can cancel a call Enforced by the engine itself
Storage backend In-memory or custom In-memory Bedrock-default Pluggable: SQLite to Postgres

What's actually built, and what's still a plan.

Kestrion is pre-alpha at 0.1.0, published on PyPI. The claims above are backed by working code — 35 passing tests — where marked built below; everything else is designed but not yet implemented.

BUILT
Core execution engine — state types, the run loop, checkpointing, approval gating.
BUILT
SQLite checkpoint store — behind a protocol designed for Postgres to replace it without touching the engine.
BUILT
Agent / @tool decorator API — function signatures auto-convert to JSON schemas; Agent(provider=..., tools=[...]) wraps the engine in an LLM tool-calling loop.
BUILT
Three LLM providers — Anthropic, OpenAI, and Ollama, each an optional install extra behind one shared protocol.
BUILT
Worked example — a Kubernetes scaling agent demonstrating pause-on-approval and resume-after-restart.
PLANNED
MCP client/server, scheduler, CLI, Postgres store, docs site.
# published on PyPI — each provider is an optional extra
pip install kestrion[anthropic] # or [openai], [ollama], or [all]

The actual code you'd write.

Every line below maps to code that exists and is covered by the test suite — not aspirational API design.

quickstart.py
from kestrion.agent.agent import Agent
from kestrion.agent.decorators import tool
from kestrion.llm.anthropic_provider import AnthropicProvider
@tool
def get_cluster_state() -> dict:
    """Read current deployment replica counts."""
    return {'deployment': 'checkout-api', 'replicas': 2}
@tool(requires_approval=True)
def apply_manifest(yaml: str) -> dict:
    """kubectl apply a manifest against the cluster."""
    return {'applied': True}
agent = Agent(
    provider=AnthropicProvider(model="claude-sonnet-4-6"),
    tools=[get_cluster_state, apply_manifest],
    store="sqlite:///agent_runs.db",
)
result = await agent.run("Scale up checkout-api if it's under 3 replicas")
status=waiting_on_human # paused before the mutating call
What you can build today
Tool-calling agents where some actions are safe to auto-run and others need a human gate first. Agents that survive a crash mid-task — agent.resume(run_id) works from a completely independent process. Long-running approval workflows: start a run, let it sit paused for hours, approve and resume it from anywhere with access to the same store. Multi-turn tool use, where the agent keeps calling tools and reasoning over results until it has a final answer.
Known gaps, honestly
No MCP integration yet — every tool today is a hand-written Python function via @tool. No concurrency control for running many agents at once. No CLI or deploy story. Agent.approve() is still a stub — approving a paused run means manually setting scratch state and saving a checkpoint by hand. SQLite only; Postgres support is designed but not implemented.

Reference — install, resume, and develop locally.

kestrion 0.1.0 — pypi.org/project/kestrion

Install

Each LLM provider is an optional extra. Install only what you use.

pip install kestrion[anthropic] pip install kestrion[openai] pip install kestrion[ollama] pip install kestrion[all] # every provider

Resuming a paused run

Works from a completely independent process — the actual crash-recovery guarantee, not just a convenience method.

state.scratch["_approved_tools"] = { "apply_manifest": True } # persist as a checkpoint, then: result = await agent.resume(run_id)

Local development

Clone, install in editable mode with dev extras, run the suite.

git clone https://github.com/VinayakDubey07/kestrion.git cd kestrion python3 -m venv .venv source .venv/bin/activate pip install -e ".[dev]" pytest tests/ -v

35 tests, all passing.

Examples

examples/kubectl_agent — the original worked example, demonstrating pause-on-approval and resume-after-restart using the raw Engine/Node primitives directly, useful for understanding what Agent builds on top of.

Links

GitHub repository  ·  PyPI package  ·  Full README  ·  Issues

Licensed under Apache 2.0.