PRE-ALPHA — v0.1.0 PUBLISHED ON PYPI

Most agent frameworks assume nothing ever crashes mid-task.

Kestrion is an open-source Python framework for AI agents built around one idea: state is never mutated directly — it's derived by replaying an immutable log of events. That single decision is what makes crash recovery, human-approval gates, and full execution traces the default behavior, not something you build yourself on top of someone else's framework.

View on GitHub pip install kestrion →

kubectl_agent.py — actual run, unedited

# An agent proposes scaling up a deployment, then needs approval

# before it's allowed to touch the cluster.

$ python3 examples/kubectl_agent.py

status=waiting_on_human current_node=apply_change

pending approval: {'tool': 'apply_manifest', 'kwargs': {'yaml': 'replicas: 3'}}

# The run is now parked. No thread is blocked, nothing is held

# in memory. A second, independent process resumes it later —

# this is the actual crash-recovery test, not a simulation.

$ engine.resume(run_id) # from a fresh process

status=completed

apply_result={'applied': True, 'yaml': 'replicas: 3'}

total events for this run: 13

The problem

Prototyping an agent is easy. Running it in production is where things break.

Several frameworks can get a working agent loop running in an afternoon. The gap shows up after that — when the agent has to survive contact with a real environment.

The process restarts mid-run.

A pod gets rescheduled, a deploy rolls out, a laptop sleeps. What happens to the agent that was three tool calls into a task when that happens?

A tool call needs a human first.

Before an agent runs a database migration or applies a Kubernetes manifest, someone needs to approve it — and that approval might arrive five seconds or five hours later, from an entirely different process.

You need to know what happened.

Not just "the agent finished" — which tool calls were made, in what order, what each step cost, and why the agent took the path it took.

You need more than one agent.

Ten thousand concurrent sessions, not one notebook cell — shared rate limits, no thread blocked per run, and a store that doesn't fall over under load.

How it works

Everything that happens is an event. State is just a replay.

Every LLM call, tool call, and transition is recorded as an immutable event with its own timestamp and cost. The agent's current state is never mutated in place — it's derived by folding that event log. This is the single design decision underneath every other claim Kestrion makes about durability.

PROVEN

Execution engine

Drives the node graph, checkpoints on every transition, enforces approval gates centrally

PROVEN

Event log + checkpoint store

SQLite-backed today, behind a protocol that swaps to Postgres without engine changes

PROVEN

Agent / @tool decorators

Declarative API on top of the raw engine primitives — signature introspection generates JSON schemas automatically

PROVEN

LLM providers

Anthropic, OpenAI, and Ollama behind one protocol — swap providers without touching agent code

PLANNED

MCP integration

Connect to external MCP servers as tools, or expose a Kestrion agent as one

PLANNED

Scheduler

Rate-limited, concurrent execution across many agent runs sharing one provider quota

PLANNED

CLI + deploy

kestrion deploy --target k8s generates the manifest to run agents as stateless pods

Comparison

Most frameworks are strong at authoring. Few make production behavior the default.

This is the honest version, including where Kestrion is not uniquely positioned. AWS Strands and Bedrock AgentCore make a similar production-grade pitch, backed by far more engineering capacity. The bet here is an independent, MCP-native alternative for teams that don't want cloud lock-in — not an unchallenged advantage.

	LangGraph	CrewAI	AWS Strands	Kestrion
State model	Mutable, opt-in checkpoints	Mostly in-memory	Event-driven hooks	Event-sourced by default
Crash recovery	Possible, needs setup	Not native	Tracing, not full replay	Default, independently verified
Approval gates	Manual graph wiring	No native primitive	Hooks can cancel a call	Enforced by the engine itself
Storage backend	In-memory or custom	In-memory	Bedrock-default	Pluggable: SQLite to Postgres

Current status

What's actually built, and what's still a plan.

Kestrion is pre-alpha at 0.1.0, published on PyPI. The claims above are backed by working code — 35 passing tests — where marked built below; everything else is designed but not yet implemented.

BUILT

Core execution engine — state types, the run loop, checkpointing, approval gating.

BUILT

SQLite checkpoint store — behind a protocol designed for Postgres to replace it without touching the engine.

BUILT

Agent / @tool decorator API — function signatures auto-convert to JSON schemas; Agent(provider=..., tools=[...]) wraps the engine in an LLM tool-calling loop.

BUILT

Three LLM providers — Anthropic, OpenAI, and Ollama, each an optional install extra behind one shared protocol.

BUILT

Worked example — a Kubernetes scaling agent demonstrating pause-on-approval and resume-after-restart.

PLANNED

MCP client/server, scheduler, CLI, Postgres store, docs site.

# published on PyPI — each provider is an optional extra

pip install kestrion[anthropic] # or [openai], [ollama], or [all]

Quickstart

The actual code you'd write.

Every line below maps to code that exists and is covered by the test suite — not aspirational API design.

quickstart.py

from kestrion.agent.agent import Agent

from kestrion.agent.decorators import tool

from kestrion.llm.anthropic_provider import AnthropicProvider

@tool

def get_cluster_state() -> dict:

"""Read current deployment replica counts."""

return {'deployment': 'checkout-api', 'replicas': 2}

@tool(requires_approval=True)

def apply_manifest(yaml: str) -> dict:

"""kubectl apply a manifest against the cluster."""

return {'applied': True}

agent = Agent(

provider=AnthropicProvider(model="claude-sonnet-4-6"),

tools=[get_cluster_state, apply_manifest],

store="sqlite:///agent_runs.db",

)

result = await agent.run("Scale up checkout-api if it's under 3 replicas")

status=waiting_on_human # paused before the mutating call

What you can build today

Tool-calling agents where some actions are safe to auto-run and others need a human gate first. Agents that survive a crash mid-task — agent.resume(run_id) works from a completely independent process. Long-running approval workflows: start a run, let it sit paused for hours, approve and resume it from anywhere with access to the same store. Multi-turn tool use, where the agent keeps calling tools and reasoning over results until it has a final answer.

Known gaps, honestly

No MCP integration yet — every tool today is a hand-written Python function via @tool. No concurrency control for running many agents at once. No CLI or deploy story. Agent.approve() is still a stub — approving a paused run means manually setting scratch state and saving a checkpoint by hand. SQLite only; Postgres support is designed but not implemented.

Docs

Reference — install, resume, and develop locally.

kestrion 0.1.0 — pypi.org/project/kestrion

Install

Each LLM provider is an optional extra. Install only what you use.

pip install kestrion[anthropic] pip install kestrion[openai] pip install kestrion[ollama] pip install kestrion[all] # every provider

Resuming a paused run

Works from a completely independent process — the actual crash-recovery guarantee, not just a convenience method.

state.scratch["_approved_tools"] = { "apply_manifest": True } # persist as a checkpoint, then: result = await agent.resume(run_id)

Local development

Clone, install in editable mode with dev extras, run the suite.

git clone https://github.com/VinayakDubey07/kestrion.git cd kestrion python3 -m venv .venv source .venv/bin/activate pip install -e ".[dev]" pytest tests/ -v

35 tests, all passing.

Examples

examples/kubectl_agent — the original worked example, demonstrating pause-on-approval and resume-after-restart using the raw Engine/Node primitives directly, useful for understanding what Agent builds on top of.

Links

GitHub repository · PyPI package · Full README · Issues

Licensed under Apache 2.0.