HPC Job Submission Agent
An agent that submits and monitors batch jobs on HPC systems.
| Tech: LangGraph | Code: View on GitHub |
What It Does
- User requests a computation (e.g., “run a DFT calculation”)
- Agent submits a batch job to the HPC scheduler
- Agent monitors job status until completion
- Agent retrieves and interprets the results
This demonstrates federated agent execution — an agent running locally (or on one system) that orchestrates computations on remote HPC resources.
The Code
@tool
def submit_job(script_name: str, nodes: int = 1, walltime_hours: int = 1) -> str:
"""Submit a batch job to the HPC cluster."""
# In production: SSH to cluster or call scheduler API
return f"Job {job_id} submitted to queue"
@tool
def check_job_status(job_id: int) -> str:
"""Check the status of a submitted job."""
# In production: Query SLURM/PBS scheduler
return f"Job {job_id}: RUNNING"
@tool
def get_job_output(job_id: int) -> str:
"""Retrieve output from a completed job."""
# In production: Fetch stdout/stderr from cluster
return "DFT calculation results..."
agent = create_react_agent(llm, [submit_job, check_job_status, get_job_output])
Running the Example
cd Capabilities/federated-agents/AgentsHPCJob
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python main.py
Custom task:
python main.py --task "Run an MD simulation and report the results"
LLM Configuration
Supports OpenAI, FIRST (HPC inference), Ollama (local), or mock mode.
See LLM Configuration for details on configuring LLM backends, including Argonne’s FIRST service.
Sample Job Scripts
The data/ directory contains example SLURM job scripts:
data/
├── dft_catalyst.sh # DFT calculation for Cu-catalyst CO2 adsorption
└── md_simulation.sh # MD simulation of Cu nanoparticle in water
Tools
| Tool | Description |
|---|---|
submit_job |
Submit a batch job to the cluster |
check_job_status |
Check if job is QUEUED, RUNNING, or COMPLETED |
get_job_output |
Retrieve stdout/results from completed job |
list_jobs |
List all submitted jobs |
cancel_job |
Cancel a queued or running job |
Production Integration
This example simulates HPC interaction. In production, the tools would connect to real schedulers:
SLURM:
def submit_job(script):
result = subprocess.run(["sbatch", script], capture_output=True)
job_id = parse_job_id(result.stdout)
return job_id
def check_status(job_id):
result = subprocess.run(["squeue", "-j", str(job_id)], capture_output=True)
return parse_status(result.stdout)
REST API (e.g., via Globus Compute):
def submit_job(script):
response = requests.post(f"{HPC_API}/jobs", json={"script": script})
return response.json()["job_id"]
Key Points
- Job lifecycle: Submit → Queue → Run → Complete → Retrieve
- Asynchronous: Jobs run independently; agent polls for status
- Federated: Agent and HPC system are separate; connected via scheduler
Related Examples
- AgentsCalculator - Start here if new to LangGraph agents
- AgentsRemoteTools - Academy pattern for remote tool invocation
- CharacterizeChemicals - Academy-based molecular property agent
Requirements
- Python 3.10+
- LangGraph 0.2+
- OpenAI API key, FIRST token, Ollama, or run in mock mode