Figure 4: Belief Calibration
As model capability increases, first-order belief calibration improves (lower KL), and second-order elicitation can mildly regularize first-order estimates.
Discussion Subpage 04
This page corresponds to Sections 5-7 in the paper. It synthesizes what the three result analyses jointly imply about Theory-of-Mind in LLM agents: when beliefs are merely expressed versus when they are integrated into actual decision policy.
Synthesis
As model capability increases, first-order belief calibration improves (lower KL), and second-order elicitation can mildly regularize first-order estimates.
ToM-order effects on proposal fairness are heterogeneous: earlier and later model generations can move in opposite strategic directions.
Accept/reject behavior is predictable from low-dimensional ToM-valued gain/loss terms, and higher-order terms reshape coefficient geometry.
Calibration, exchange behavior, and decision weights are linked but not equivalent. Good belief reports do not automatically imply cooperative/fair strategy.
Terminology used in this project page:
ToM expression capability = ability to report plausible belief traces;
Policy-level ToM integration = consistent use of those beliefs to guide proposal and accept/reject actions.
Implications
Standard ToM evaluation often tests direct question answering. TradeCraft instead evaluates whether inferred beliefs are operationalized in a dynamic, strategic, multi-agent loop where negotiation and crafting constraints jointly matter.
Limitations
Future Work
TradeCraft provides evidence that the role of ToM in LLM decision-making is model-dependent and can change qualitatively with scaling and training. The benchmark is designed to make this transition observable through linked belief, proposal, and decision measurements.