CharacterizeChemicals
An LLM-planned, tool-executing molecular property agent.
| Tech: Academy | Code: View on GitHub |
This project demonstrates a full LLM-planned computational chemistry agent built on the Academy agent framework. Given one or more SMILES strings, the agent:
- Uses a local LLM (default:
microsoft/Phi-3.5-mini-instruct) to plan a multi-step workflow. - Executes real computational tools (RDKit, xTB, xTB GBSA) according to that plan.
- Parses quantum-chemical outputs (energies, dipole, HOMO–LUMO gap, solvation).
- Returns a structured properties dictionary per molecule.
- Logs the entire plan and every step executed for transparency and debugging.
Main entry point:
python run_chem_agent.py
Features
RDKit Descriptors
logPTPSA- Internal descriptors (e.g., exact MolWt)
- 3D geometry generation via RDKit embedding
xTB (GFN2-xTB) Quantum Properties
Extracted from real xTB output:
E_scc_hartree,E_scc_eVE_total_hartree,E_total_eVHOMO_LUMO_gap_eVdipole_moment_D(from the “molecular dipole”full:line)
xTB GBSA Solvation
solvation_free_energy_kcal_per_mol
Parsed from theGsolvterm in the SUMMARY block.
Transparency and Debugging
- Raw LLM-generated JSON plan printed before execution
- Parsed execution DAG printed
- Per-step logging of inputs and outputs
- xTB command lines logged
- Plan validation + stalled-plan detection
- Final aggregated properties for each molecule
Installation
The easiest way to install RDKit and xTB is with conda.
1. Create conda env
conda create -n chem-agent python=3.11 rdkit xtb -c conda-forge
conda activate chem-agent
2. Install Python dependencies
pip install torch transformers accelerate academy-py huggingface_hub
3. Verify xTB installation
xtb --version
If this fails, your agent will not be able to run xTB-dependent steps.
Usage
Run with default settings:
python run_chem_agent.py
Command-line arguments
python run_chem_agent.py [--model MODEL] [--smiles SMILES ...] [--props PROPERTIES ...] [--accuracy-profile {fast,balanced,high}]
Arguments
--model, -m
Hugging Face model ID for the planner LLM. Default:
microsoft/Phi-3.5-mini-instruct
--smiles, -s
One or more SMILES strings. Default:
CCO c1ccccc1 CC(=O)O
--props, -p
Desired properties (used as hints for the planner). Default:
logP dipole_moment solvation_free_energy
--accuracy-profile, -a
Planner behavior hint:
fast | balanced | high
Default: balanced
Examples
Default run:
python run_chem_agent.py
Single molecule:
python run_chem_agent.py --smiles "CCO"
Custom planner model:
python run_chem_agent.py --model Qwen/Qwen2.5-7B-Instruct
Multiple molecules:
python run_chem_agent.py -s CCO "c1ccccc1" "CC(=O)O"
Architecture Overview
1. Planner LLM → JSON Plan
The LLM receives:
- tool catalog
- SMILES string
- requested properties
- accuracy profile
It outputs a structured plan like:
{
"steps": [
{
"id": "s1_rdkit",
"tool": "rdkit_descriptors",
"inputs": { "smiles": "CCO", "descriptor_set": ["logP", "TPSA"] },
"depends_on": []
},
{
"id": "s2_xtb",
"tool": "xtb_opt",
"inputs": { "smiles": "CCO", "level": "GFN2-xTB" },
"depends_on": ["s1_rdkit"]
},
{
"id": "s3_solv",
"tool": "solvation_energy_from_xtb",
"inputs": {
"geometry_path": "step:s2_xtb.optimized_geometry",
"solvent": "water"
},
"depends_on": ["s2_xtb"]
}
]
}
The agent prints the raw plan and parsed version.
2. Execution DAG
The executor:
- resolves references like
"step:s2_xtb.optimized_geometry" - executes steps in dependency order
- logs before/after each step
- aborts on dependency cycles or stalled execution
3. Tool Layer
RDKit
- SMILES → molecule
- descriptor calculation
- 3D embedding for xTB
xTB optimization
xtb structure.xyz --opt --gfn 2
Parsers extract:
- SCC energy
- total energy
- HOMO-LUMO gap
- dipole moment
xTB GBSA solvation
xtb structure.xyz --gfn 2 --gbsa water
Parses:
- Gsolv (Eh → kcal/mol)
Property Aggregation
Final results include:
logPdipole_moment_Dsolvation_free_energy_kcal_per_molE_total_hartree,E_total_eVE_scc_hartree,E_scc_eVHOMO_LUMO_gap_eV
Returned as:
{
"status": "success",
"molecule_smiles": "...",
"properties": {...},
"plan_used": {...},
"provenance": {...}
}
Logging and Debugging
You will see:
- LLM planner call timing
- Raw JSON plan and parsed DAG
- Inputs/outputs for every tool step
- All xTB command lines
- Parsing warnings (missing fields, regex mismatches)
- Final property summary
This makes it easy to identify:
- bad plans,
- geometry failures,
- xTB crashes,
- solver issues.
Caveats
- Planner LLM can be slow on CPU/MPS; reduce
max_new_tokensor use a smaller model. - xTB may crash on malformed geometries; RDKit embedding quality matters.
- Regex parsers tuned for xTB 6.7.1; future versions may differ.
- This is an experimental prototype — production systems should add:
- retries,
- plan validation,
- caching,
- error propagation and recovery.
Related Examples
- AgentsAcademyBasic - Start here if new to Academy
- AgentsHybrid - Combining Academy with LangGraph
- AgentsHPCJob - LangGraph-based HPC job submission
Attribution
This project uses:
- Academy — https://github.com/academy-agents/academy
- RDKit — https://www.rdkit.org
- xTB — https://github.com/grimme-lab/xtb
- Transformers / Accelerate / torch
- Phi-3.5-mini-instruct — https://huggingface.co/microsoft/Phi-3.5-mini-instruct