Long-Lived Agent with Checkpoint/Resume

An agent that persists across sessions through automatic checkpointing.

What It Does

Demonstrates patterns for agents that run for days to months:

Automatic checkpointing: Save state after each task
Resume from interruption: Restart exactly where you left off
Signal handling: Graceful shutdown on Ctrl+C or SIGTERM
Multi-step workflows: Track tasks, progress, and findings

Running the Example

Start a new workflow:

cd Capabilities/long-lived-agents/AgentsCheckpoint
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python main.py --goal "Analyze catalyst performance"

Interrupt with Ctrl+C, then resume:

python main.py --resume

List available checkpoints:

python main.py --list

LLM Configuration

Supports OpenAI, FIRST (HPC inference), Ollama (local), or mock mode.

See LLM Configuration for details.

Checkpoint Architecture

Agent state with identity, goals, tasks, memory, and checkpoint persistence

Key Patterns

Automatic checkpointing after each task:

def complete_task(self, task_id: str, result: str):
    task.status = TaskStatus.COMPLETED
    task.result = result
    self.steps_completed += 1
    self.save_checkpoint("task_completed")

Graceful shutdown on signals:

def handle_signal(signum, frame):
    print("Received signal, saving checkpoint...")
    STATE.save_checkpoint("signal_interrupt")
    exit(0)

signal.signal(signal.SIGINT, handle_signal)
signal.signal(signal.SIGTERM, handle_signal)

Resume interrupted work:

def get_next_task(self) -> Optional[Task]:
    # First, resume any in-progress tasks
    for task in self.tasks:
        if task.status == TaskStatus.IN_PROGRESS:
            return task
    # Then get next pending task
    for task in self.tasks:
        if task.status == TaskStatus.PENDING:
            return task
    return None

Tools

Tool	Description
`save_checkpoint`	Manually save state
`remember_finding`	Store important finding
`recall_finding`	Retrieve stored finding
`get_progress`	Show workflow progress
`mark_task_complete`	Complete a task with result

Checkpoint Files

Checkpoints are stored in checkpoints/:

checkpoints/
  agent_20240115_103045.json       # Latest state
  agent_20240115_103045_0001.json  # Backup after checkpoint 1
  agent_20240115_103045_0002.json  # Backup after checkpoint 2

Example Session

$ python main.py --goal "Analyze catalyst data"

============================================================
LONG-LIVED AGENT WITH CHECKPOINT/RESUME
============================================================
Agent ID: agent_20240115_103045
Goal: Analyze catalyst data
Progress: 0/4
============================================================

--- Task: t1_analyze ---
Description: Analyze existing data
Result: Analysis complete: Mean=42.3, Std=5.7

--- Task: t2_simulate ---
Description: Run computational simulation
^C
[SIGNAL] Received signal 2, saving checkpoint...
[SIGNAL] Checkpoint saved. Exiting.

$ python main.py --resume

Resuming agent agent_20240115_103045...
Progress: 1/4

--- Task: t2_simulate ---
Description: Run computational simulation
(continues from where it left off)

Production Considerations

This example demonstrates the patterns. Production systems should add:

Distributed storage: Store checkpoints in S3, GCS, or shared filesystem
Checkpoint compression: Compress large state for efficiency
Retention policy: Clean up old checkpoint backups
Health monitoring: Alert if agent stalls or fails repeatedly
Idempotent tasks: Ensure tasks can be safely re-run

Requirements

Python 3.10+
LangGraph 1.0+
OpenAI API key, FIRST token, Ollama, or run in mock mode