Academy Persistent State

Demonstrates how Academy agents can persist state and resume after restarts.

Why Persistent State?

Checkpoint and resume flow

Scientific workflows often run for hours, days, or weeks. Persistent state enables:

Resumability: Continue from where you left off after crashes or restarts
Checkpointing: Save progress after each major step
Auditability: Track what was done and when
Efficiency: Don’t repeat completed work

Running the Example

cd Capabilities/local-agents/AgentsPersistent
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Run workflow (will checkpoint to .workflow_state/)
python main.py

# To see resumption in action:
# 1. Start workflow, Ctrl+C to interrupt mid-way
# 2. Run again - it resumes from last checkpoint

# Start fresh (ignore existing checkpoints)
python main.py --reset

# Custom state directory
python main.py --state-dir ./my-experiment-state

Example Output

First run (complete):

============================================================
ACADEMY PERSISTENT STATE EXAMPLE
============================================================
State directory: .workflow_state
Reset: False
Task: discover catalysts for sustainable ammonia synthesis
------------------------------------------------------------
Starting fresh workflow
------------------------------------------------------------
17:00:00 [academy.persist] [Workflow] Starting task: discover...
17:00:00 [academy.persist] [Workflow] Starting fresh workflow
17:00:00 [academy.persist] [Workflow] Executing step 1/5: literature_review
17:00:01 [academy.persist] [Workflow] Step 1 complete, checkpoint saved
17:00:01 [academy.persist] [Workflow] Executing step 2/5: data_collection
...
17:00:03 [academy.persist] [Workflow] All steps completed
------------------------------------------------------------
WORKFLOW COMPLETED
------------------------------------------------------------
Status: completed
Steps completed: 5
============================================================

Interrupted run + resume:

# First run (interrupted with Ctrl+C after step 2)
$ python main.py
...
17:00:00 [academy.persist] [Workflow] Executing step 2/5: data_collection
^C
------------------------------------------------------------
WORKFLOW INTERRUPTED
------------------------------------------------------------
Progress saved at step 2/5
Run again to resume from this point
============================================================

# Second run (resumes from step 3)
$ python main.py
...
Resuming from step 2/5
------------------------------------------------------------
17:00:10 [academy.persist] [Workflow] Resuming from step 2
17:00:10 [academy.persist] [Workflow] Skipping completed step 1: literature_review
17:00:10 [academy.persist] [Workflow] Skipping completed step 2: data_collection
17:00:10 [academy.persist] [Workflow] Executing step 3/5: computation
...

Key Concepts

1. State Management Class

class AgentState:
    """Manages persistent state for an Academy agent."""

    def __init__(self, agent_name: str, state_dir: Path):
        self.state_file = state_dir / f"{agent_name}_state.json"
        self._load()  # Load existing state if present

    def set(self, key: str, value) -> None:
        """Set a state value and save to disk."""
        self._state[key] = value
        self.save()  # Auto-persist on change

    def get(self, key: str, default=None):
        """Get a state value."""
        return self._state.get(key, default)

2. Checkpointing After Each Step

@action
async def execute_workflow(self, task: str) -> dict:
    current_step = self._state.get("current_step", 0)

    for i, (step_name, step_func) in enumerate(steps):
        if i < current_step:
            logger.info(f"Skipping completed step {i+1}: {step_name}")
            continue

        # Execute step
        result = await step_func(task)

        # Checkpoint progress
        self._state.set("current_step", i + 1)
        self._state.set("step_results", step_results)

3. Graceful Interruption Handling

try:
    results = await workflow.execute_workflow(task)
except asyncio.CancelledError:
    status = await workflow.get_status()
    print(f"Progress saved at step {status['current_step']}")
    print("Run again to resume")

State File Format

State is stored as JSON for transparency and debugging:

{
  "created_at": "2026-02-04T17:00:00.000000",
  "updated_at": "2026-02-04T17:00:02.500000",
  "task": "discover catalysts for sustainable ammonia synthesis",
  "current_step": 3,
  "step_results": [
    {
      "step": 1,
      "name": "literature_review",
      "result": "Reviewed 47 papers...",
      "completed_at": "2026-02-04T17:00:00.500000"
    },
    ...
  ],
  "status": "in_progress"
}

Production Considerations

Aspect	Development	Production
Storage	Local filesystem	Object store (S3), database
Format	JSON	Consider binary for large state
Locking	None	Use file locks for concurrent access
Backup	Manual	Automated snapshots

Comparison to AgentsCheckpoint

Both examples demonstrate persistent state, but with different approaches:

Feature	AgentsPersistent (this)	AgentsCheckpoint
Framework	Academy	LangGraph
State format	Custom JSON	LangGraph checkpoints
Distribution	Multi-machine ready	Single process
Best for	Distributed workflows	LLM reasoning chains

Prerequisites:

AgentsAcademyBasic - Academy fundamentals
AgentsRemoteTools - Remote tool pattern

Next steps:

AgentsFederated - Cross-institutional collaboration
AgentsHybrid - Add LLM reasoning to persistent workflows
Long-Lived Agents - Extended agent lifecycles

Alternatives:

AgentsCheckpoint - LangGraph checkpointing (different approach)

Requirements

Python 3.10+
academy-py