SR2

Context engineering for AI agents

Agents get worse the longer conversations go. SR2 manages the full lifecycle of your LLM's context window — so your agent stays sharp on turn 30.

74%
Token Reduction
139μs
Compile Overhead
100%
KV-Cache Hit Rate
688
Tests Passing
$ pip install sr2

Context windows fill up. Agents break down.

Every turn adds tokens. System prompts, tool results, conversation history — it all compounds. By turn 20, your agent is spending 10x what it should and losing coherence as critical context gets pushed out.

Nobody manages this today. SR2 does — automatically, through config, not code.

Token usage over 30 turns
Naive
847K tokens
SR2
220K tokens
74% reduction ↓

Three-layer pipeline

SR2 compiles your context through three composable layers before every LLM call. Each layer is config-driven and independently tunable.

Layer 1

Core

System prompts, tool definitions, and static context. Optimized for KV-cache prefix reuse across turns.

Layer 2

Memory

Persistent facts, summaries, and retrieved knowledge. Survives compaction and carries forward across sessions.

Layer 3

Conversation

Three-zone message management with active, buffer, and archive zones. Compacts and summarizes automatically.

Everything your agent's context needs

Config-driven context management that works with any LLM provider.

Three-Zone Conversations

Active, buffer, and archive zones with configurable boundaries. Messages flow through zones as conversation grows, with automatic compaction at each boundary.

KV-Cache Optimization

100% prefix hit rate through stable context ordering. Core and memory layers maintain consistent prefixes, so your provider's KV-cache stays warm across turns.

Compaction & Summarization

Automatic message compaction preserves key information while reducing token count. LLM-powered summarization captures what matters from archived messages.

Graceful Degradation

When context pressure hits critical, SR2 progressively drops lower-priority content while preserving essential context. Your agent stays functional, never crashes.

🧠

Memory System

Extract and persist facts across conversation turns and sessions. Semantic retrieval surfaces relevant memories when they matter most.

Tool State Machine

Track tool call lifecycles and manage their context impact. Completed tool results can be compacted while preserving their outcomes.

📊

Prometheus Metrics

Built-in observability with token usage, cache hit rates, compaction events, and context pressure gauges. Know exactly what's happening inside your context window.

📝

Config-Driven

Define your context strategy in YAML or dict. No subclassing, no framework lock-in. Change behavior by changing config, not code.

🔒

Apache 2.0

Fully open source. Use it in production, fork it, extend it. No vendor lock-in, no usage fees, no strings attached.

Measured, not promised

Real numbers from 688 tests across 71 files.

74%
Token reduction
Over 30 conversation turns
139μs
Avg compile overhead
0.028% of a 500ms LLM call
100%
KV-cache prefix hits
Stable ordering across turns

Token Growth: Naive vs SR2-Managed

Naive SR2-Managed
Claude Sonnet
$0.29
saved per session
GPT-4o
$0.24
saved per session

Up and running in minutes

Install SR2 and start managing your agent's context window.

main.py
from sr2 import ContextEngine

# Define your context strategy in config
config = {
    "model": "claude-sonnet-4-20250514",
    "context_window": 200_000,
    "conversation": {
        "active_turns": 10,
        "buffer_turns": 5,
        "compaction": "summarize",
    },
}

# Create the engine
engine = ContextEngine(config)

# Add messages as your agent runs
engine.add_message("user", "Analyze this dataset...")

# Compile context for the LLM call
context = engine.compile()
# → Optimized, compacted, cache-friendly messages