Lightweight Social Game

TradeCraft creates controlled competition/cooperation where agents must trade for hidden targets instead of solving isolated QA tasks.

Structured ToM Traces

Each turn elicits item-wise value reports V0/V1/V2, enabling direct measurement of self-utility, opponent modeling, and second-order beliefs.

Behavior-Level Evidence

Beyond verbal claims, results connect ToM traces to proposal fairness and accept/reject behavior through quantitative fitting.

Why TradeCraft Matters

Theory of Mind is often reported in LLMs, but most evaluations test explicit verbal reasoning rather than whether inferred beliefs are actually used for action selection. TradeCraft addresses this gap by placing agents in a multiplayer environment with publicly visible inventories and private goals, requiring strategic negotiation and resource exchange.

Core Question

Do LLM agents use ToM to make better proposals and accept/reject decisions, and how does this usage change across model generations?

Across GPT-4o, o3, o4-mini, and GPT-5 self-play, explicit ToM elicitation systematically changes behavior. Increasing ToM order produces opposite trends across models, revealing that higher-order mental-state reasoning is integrated into policy in heterogeneous ways.

Trade, Decide, and Craft Under Hidden Goals

Step 01 Proposal One player offers and requests items with an optional message.
Step 02 Decision The target player accepts or rejects using inventory, message, and goal inference.
Step 03 Craft All players craft simultaneously under Minecraft/Little-Alchemy rule sets.

All hands are public; target items are private. Winning requires both planning and social reasoning.

TradeCraft pipeline and elicited ToM report design
Figure. Pipeline and ToM reporting interface: each proposal/decision stage asks agents to output item-wise values over public inventories.

Environment Design

01

Rule Variability

Supports Minecraft Java 1.20 and Little Alchemy 2 style crafting; recipes and game modes are configurable.

02

Partial Information

Inventories are observable while each player's target item remains private, forcing belief modeling.

03

Tool-Augmented Agents

Agents can call `item_info` and `possible_recipes_from_hands` while keeping a turn-level dialogue memory.

04

Task Difficulty

40 predefined Minecraft 1v1 instances vary by crafting chain length and minimum trade requirements.

From Verbal Claims to Quantitative Belief Traces

Zero-Order ToM

V0: Self Utility

Item-wise value (0-10) for achieving the agent's own hidden target under current public inventories.

First-Order ToM

V1: Opponent Utility

The agent's estimate of opponent V0, used to measure goal inference quality via KL divergence to opponent reports.

Second-Order ToM

V2: Opponent-on-Self

The agent's estimate of how the opponent values the agent's items, revealing strategic image and bargaining effects.

Per-Result Deep-Dive Subpages

Subpage 01 | Figure 4

Belief Calibration

Detailed walkthrough of KL-based first-order belief accuracy under V1/V2 elicitation, including protocol assumptions and interpretation scope.

  • Full setup: models, groups, instance split, and no-policy-instruction design.
  • Metric definition and turn-level reading guide.
  • Implications for ToM expression vs policy integration.
Open Subpage

Subpage 02 | Figure 5

Proposal Strategy Shift

Detailed analysis of offer/request value geometry with m-r trajectories and bootstrap confidence intervals across ToM orders.

  • Construction of proposal value points from V0 reports.
  • Interpretation of m (exchange-rate proxy) and r (coupling strength).
  • Model-generation-specific strategy transitions.
Open Subpage

Subpage 03 | Section 4.2

Decision Utility Fitting

Full derivation and interpretation of utility-based accept/reject modeling, including equations, fitting procedure, Macro-F1 table, and coefficient semantics.

  • Formal definitions of G0/L0, G1/L1, G2/L2 and utility equations.
  • LBFGS + grid-search + episode-wise CV protocol.
  • How coefficient patterns reflect strategic preference structure.
Open Subpage

Subpage 04 | Sections 5-7

Integrated Discussion

Cross-result synthesis of implications, limitations, and future directions with terminology aligned to policy-level ToM integration.

  • Connects Figures 4-6 into one coherent narrative.
  • Clarifies scope boundaries and external validity constraints.
  • Outlines extensions toward human-in-the-loop and long-horizon outcomes.
Open Subpage

Web-GUI Snapshots Across Turn Phases

TradeCraft proposal phase interface

Proposal Phase

Proposer sets offer/request bundles and optional persuasion message for targeted negotiation.

TradeCraft response phase interface

Response Phase

Decision maker inspects proposal and chooses accept/reject with live game messages.

TradeCraft possible crafts phase interface

Possible Crafts

Rule-aware suggestions expose combinational options to support long-horizon planning.

TradeCraft apply crafting phase interface

Apply Crafting

Craft plans are validated against selected rule constraints and inventory balances.

BibTeX

@inproceedings{tradecraft2026,
  title     = {TradeCraft: Exploring Theory of Mind in LLM Agents' Strategic Decision-Making and Communication},
  author    = {Anonymous Authors},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026},
  note      = {Under review}
}