Rule Discovery

Multi-agent simulation where autonomous agents discover hidden rules governing a simulated world through experimentation and peer-to-peer communication.

Overview

Overview

This project demonstrates automated scientific discovery: multiple AI agents independently run experiments, observe outcomes, identify patterns, and share findings with each other. The goal is to discover the hidden rules that govern how objects behave in a simulated world.

Key concepts:

The Problem

A simulated world contains objects with properties and hidden rules that determine experimental outcomes.

Objects have four properties:

Available experiments:

Example hidden rules (unknown to agents):

- Metal objects conduct electricity
- Glass objects shatter when dropped
- Rubber objects bounce when thrown
- Wood objects burn when exposed to fire

Example experiment and outcome:

Experiment: "Drop the small red glass sphere into water"
Result: "The small red glass sphere shatters into pieces"

How It Works

Agent Architecture

Each agent runs an asynchronous loop combining LLM reasoning with statistical pattern mining:

┌─────────────────────────────────────────────────────────────────┐
│  1. PROPOSE EXPERIMENT (LLM)                                    │
│     LLM suggests what to try next, encouraged to explore        │
│     diverse material/experiment combinations                    │
│     → "Apply electricity to the small blue metal cube"          │
├─────────────────────────────────────────────────────────────────┤
│  2. RUN EXPERIMENT (World Simulator)                            │
│     World applies hidden rules and returns outcome              │
│     → "The small blue metal cube conducts electricity"          │
├─────────────────────────────────────────────────────────────────┤
│  3. PARSE & RECORD (Structured Mining)                          │
│     Extract: material=metal, experiment=electricity,            │
│              outcome=conducts                                   │
│     Track correlation: metal+electricity → conducts (count: 5)  │
├─────────────────────────────────────────────────────────────────┤
│  4. GENERATE HYPOTHESES (Statistical)                           │
│     Find patterns with high evidence and confidence             │
│     → "Metal objects conducts when electricity" (100%, n=5)     │
├─────────────────────────────────────────────────────────────────┤
│  5. SHARE WITH PEERS (Probabilistic)                            │
│     Send structured observations to random peer                 │
│     Peer merges observations into their own dataset             │
└─────────────────────────────────────────────────────────────────┘

Hypothesis Mining

Instead of asking the LLM to formulate hypotheses (which can be unreliable), we use statistical pattern mining:

  1. Parse observations into structured form: (material, experiment_type, outcome)
  2. Count correlations: How often does glass + drop → shatter?
  3. Calculate confidence: shatter_count / total_glass_drop_experiments
  4. Generate rules: If confidence > 60% and evidence count >= 2, create hypothesis

This approach produces clean, accurate hypotheses like:

- Metal objects conducts when electricity (100% confidence, 5 observations)
- Glass objects shatters when dropped (100% confidence, 8 observations)
- Rubber objects bounces when throw (100% confidence, 12 observations)

Communication

Agents share structured observations (not just text) with peers:

Installation

pip install -r requirements.txt

Usage

Quick Start (Mock Mode)

python hidden_rule_discovery.py --agents 4 --steps 30 --difficulty easy

With Real LLM

Anthropic Claude:

export ANTHROPIC_API_KEY=your-key-here
LLM_PROVIDER=anthropic LLM_MODEL=claude-3-haiku-20240307 \
python hidden_rule_discovery.py --agents 4 --steps 30

OpenAI:

export OPENAI_API_KEY=your-key-here
LLM_PROVIDER=openai_compatible LLM_BASE_URL=https://api.openai.com/v1 \
LLM_MODEL=gpt-4o-mini python hidden_rule_discovery.py --agents 4 --steps 30

Argonne ALCF / vLLM / Other OpenAI-compatible:

LLM_PROVIDER=openai_compatible \
LLM_BASE_URL=https://your-endpoint/v1 \
LLM_API_KEY=your-key \
LLM_MODEL=meta-llama/Llama-3-70b-instruct \
python hidden_rule_discovery.py --agents 4 --steps 30

Command Line Options

Options:
  --provider, -p    LLM provider: mock, anthropic, openai_compatible
  --model, -m       Model name (provider-specific)
  --agents, -a      Number of agents (default: 4)
  --steps, -s       Number of simulation steps (default: 50)
  --difficulty, -d  easy | medium | hard (default: medium)
  --comm-prob, -c   Probability of sharing per step (default: 0.3)
  --no-comm         Disable communication between agents
  --seed            Random seed for reproducibility
  --log-agents      Enable detailed logging of LLM queries and agent actions

Verbose Logging

To see what agents are doing:

python hidden_rule_discovery.py --agents 2 --steps 10 --log-agents 2>logs.txt

Log output includes:

[Agent 0] LLM_QUERY #1: propose_experiment:
...
[Agent 0] LLM_RESPONSE #1: Drop the small red glass sphere into water
[Agent 0] OBSERVATION: glass/drop_water -> shatters
[Agent 0] HYPOTHESES_MINED: 2 rules: ['Glass objects shatters when dropped', ...]
[Agent 0] MSG_SEND #1: to=peer | My top hypotheses: ...
[Agent 1] MSG_RECV #1: from=Agent 0 | ...
[Agent 1] MERGED_OBS: added 3 observations from Agent 0

Difficulty Levels

Level Rules Description
easy 3 Simple material-based rules
medium 5 Includes color and size interactions
hard 8 Complex multi-property rules

Easy (3 rules)

- Metal objects conduct electricity
- Glass objects shatter when dropped
- Rubber objects bounce when thrown

Medium (5 rules)

- Metal objects conduct electricity
- Wood objects burn when exposed to fire
- Blue objects float in water
- Large objects sink in water
- Glass objects crack in the freezer

Hard (8 rules)

- Metal objects conduct electricity
- Metal objects heat up in sunlight
- Wood objects burn in fire
- Wood objects float in water
- Red objects are fireproof
- Small objects fly far when thrown
- Spheres roll when dropped on the floor
- Pyramids tip over on a scale

Sample Output

======================================================================
Hidden Rule Discovery with LLM Agents
======================================================================

LLM Provider: OpenAI-Compatible(gpt-4o-mini)
Difficulty: easy
Number of hidden rules: 3
Number of agents: 4
Communication probability: 0.3

[Hidden rules - for reference only]
- Metal objects conduct electricity
- Glass objects shatter when dropped
- Rubber objects bounce when thrown

--- Step 10 ---
Agent 0: Observations: 10, Messages: sent=3, recv=4
  Top hypotheses:
    - Metal objects conducts when electricity (100%)
    - Glass objects shatters when dropped (100%)

--- Step 30 ---
...

======================================================================
FINAL EVALUATION
======================================================================

DISCOVERED HYPOTHESES:
  [TRUE] Metal objects conducts when electricity... (100%) -> matches: Metal objects conduct electricity
  [TRUE] Glass objects shatters when dropped... (100%) -> matches: Glass objects shatter when dropped
  [TRUE] Rubber objects bounces when throw... (100%) -> matches: Rubber objects bounce when thrown

RULES FOUND (3/3):
  + Metal objects conduct electricity
  + Glass objects shatter when dropped
  + Rubber objects bounce when thrown

RULES MISSED (0):

SCORE: 100/100

Benchmark Results

We benchmarked multiple LLMs on the Argonne ALCF inference endpoints, varying the number of agents from 1 to 128.

Best Scores by Model

Model Easy (3 rules) Medium (5 rules) Hard (8 rules)
Mixtral-8x22B 66% 80% 37%
Llama-3.1-8B 66% 40% 37%
Llama-3.1-70B 66% 40% 37%
Llama-3.1-405B 66% 40% 12%
Gemma-3-27B 33% 20% 25%

Performance by Agent Count

Results by Difficulty

Best Scores Comparison

Best Scores

Key Findings

  1. Mixtral MoE wins on medium difficulty (80%) - best overall performance
  2. Model size doesn’t correlate with performance - 8B performs as well as 405B for Llama models
  3. 405B actually worse on hard (12% vs 37%) - possibly over-thinking simple patterns
  4. Gemma underperforms across all difficulties despite being 27B
  5. More agents helps on medium - Mixtral peaks at 16 agents with 80%
  6. Easy plateaus at 66% (2/3 rules) - one rule consistently missed across models

Models Tested

Models tested on Argonne ALCF Sophia cluster:

Communication & Scaling Experiments

We conducted extensive experiments to understand how communication strategies and agent count affect rule discovery.

Impact of Communication

Communication between agents significantly improves discovery rates. The best strategy is N1_E2: sharing with 1 peer every 2 experiments.

Communication Impact

Key finding: With 8 agents, communication improves scores from 40% to 60% on medium difficulty.

Variance Across Seeds

Communication introduces beneficial variance - some runs discover more rules than deterministic no-communication runs.

Seed Variance

Key finding: No-communication is deterministic (always 33%), while communication produces variance (33-66%) with higher mean (46%).

Hard Difficulty Improvements

We identified and fixed several issues that were limiting Hard difficulty performance.

Before vs After Improvements

Before vs After

Issues Fixed

  1. Missing “scale” action - The “place on a scale” experiment was missing from agent prompts, making “Pyramids tip over on a scale” impossible to discover

  2. Forced experiment diversity - Agents now cycle through all 8 experiment types instead of relying on LLM preferences (which biased toward “exciting” experiments like fire/electricity)

  3. Lower evidence threshold - Reduced from 2 to 1 observations required, allowing rare property+experiment combinations to be captured

  4. Targeted follow-up - 30% chance to re-explore properties that showed interesting outcomes

Scaling with Experiments

Hard Scaling

Best result: 87% (7/8 rules) with 640 experiments (8 agents × 80 each)

Summary Dashboard

Summary Dashboard

Hard Difficulty Results

Experiments Before After Improvement
160 31% 56% +25%
320 37% 62% +25%
480 50% 75% +25%
640 - 87% Best

Remaining Challenge

The color-based rule “Red objects are fireproof” remains the hardest to discover:

Files

File Description
hidden_rule_discovery.py Main simulation with agents, world, and evaluation
llm_providers.py LLM provider abstraction (Mock, Anthropic, OpenAI-compatible)
compare_models.py Script to compare different LLMs and agent counts
compare_comm_strategies.py Compare communication strategies (no_comm vs N1_E2, etc.)
vary_agents.py Test scaling with different agent counts
multi_seed_test.py Variance analysis across random seeds
plot_results.py Generate visualization graphs
results.json Cached benchmark results
figures/ Generated graphs and visualizations
requirements.txt Python dependencies

How Agents Discover Rules

  1. Exploration: LLM proposes diverse experiments (prompted to try under-tested combinations)
  2. Observation: World simulator returns deterministic outcomes based on hidden rules
  3. Pattern Detection: Statistical mining identifies correlations (e.g., “every time I drop glass, it shatters”)
  4. Hypothesis Formation: High-confidence patterns become hypotheses
  5. Knowledge Sharing: Agents share structured observations, pooling evidence
  6. Collective Discovery: Combined observations lead to faster, more complete discovery

Extending

Adding New Rules

Modify create_world() in hidden_rule_discovery.py:

rules = [
    Rule(
        condition="material == metal",      # Property condition
        experiment_type="electricity",      # Which experiment triggers it
        outcome="conducts electricity",     # Result text
        natural_language="Metal objects conduct electricity"  # Human description
    ),
    # Add more rules...
]

Adding New LLM Providers

  1. Create a class inheriting from LLMProvider in llm_providers.py
  2. Implement complete(system: str, user: str, max_tokens: int) -> str
  3. Add to create_llm_provider() factory

LLM Usage

The system uses an LLM for two tasks:

  1. Experiment Proposal - The LLM suggests which experiment to run next, encouraged to explore diverse combinations
  2. Final Evaluation - The LLM compares discovered hypotheses against true rules for scoring

Hypothesis generation uses statistical mining instead of LLM - this produces more reliable, cleaner rules than asking the LLM to reason about patterns.

Dependencies