Most agent frameworks assume nothing ever crashes mid-task.
Kestrion is an open-source Python framework for AI agents built around one idea: state is never mutated directly — it's derived by replaying an immutable log of events. That single decision is what makes crash recovery, human-approval gates, and full execution traces the default behavior, not something you build yourself on top of someone else's framework.
Prototyping an agent is easy. Running it in production is where things break.
Several frameworks can get a working agent loop running in an afternoon. The gap shows up after that — when the agent has to survive contact with a real environment.
Everything that happens is an event. State is just a replay.
Every LLM call, tool call, and transition is recorded as an immutable event with its own timestamp and cost. The agent's current state is never mutated in place — it's derived by folding that event log. This is the single design decision underneath every other claim Kestrion makes about durability.
Most frameworks are strong at authoring. Few make production behavior the default.
This is the honest version, including where Kestrion is not uniquely positioned. AWS Strands and Bedrock AgentCore make a similar production-grade pitch, backed by far more engineering capacity. The bet here is an independent, MCP-native alternative for teams that don't want cloud lock-in — not an unchallenged advantage.
| LangGraph | CrewAI | AWS Strands | Kestrion | |
|---|---|---|---|---|
| State model | Mutable, opt-in checkpoints | Mostly in-memory | Event-driven hooks | Event-sourced by default |
| Crash recovery | Possible, needs setup | Not native | Tracing, not full replay | Default, independently verified |
| Approval gates | Manual graph wiring | No native primitive | Hooks can cancel a call | Enforced by the engine itself |
| Storage backend | In-memory or custom | In-memory | Bedrock-default | Pluggable: SQLite to Postgres |
What's actually built, and what's still a plan.
Kestrion is pre-alpha at 0.1.0, published on PyPI. The claims above are backed by working code — 35 passing tests — where marked built below; everything else is designed but not yet implemented.
The actual code you'd write.
Every line below maps to code that exists and is covered by the test suite — not aspirational API design.
Reference — install, resume, and develop locally.
Install
Each LLM provider is an optional extra. Install only what you use.
pip install kestrion[anthropic] pip install kestrion[openai] pip install kestrion[ollama] pip install kestrion[all] # every providerResuming a paused run
Works from a completely independent process — the actual crash-recovery guarantee, not just a convenience method.
state.scratch["_approved_tools"] = { "apply_manifest": True } # persist as a checkpoint, then: result = await agent.resume(run_id)Local development
Clone, install in editable mode with dev extras, run the suite.
git clone https://github.com/VinayakDubey07/kestrion.git cd kestrion python3 -m venv .venv source .venv/bin/activate pip install -e ".[dev]" pytest tests/ -v35 tests, all passing.
Examples
examples/kubectl_agent — the original worked example, demonstrating pause-on-approval and resume-after-restart using the raw Engine/Node primitives directly, useful for understanding what Agent builds on top of.