Integrated Discussion, Limitations, and Outlook

This page corresponds to Sections 5-7 in the paper. It synthesizes what the three result analyses jointly imply about Theory-of-Mind in LLM agents: when beliefs are merely expressed versus when they are integrated into actual decision policy.

Paper Sections 5, 6, 7 Cross-Figure Synthesis Implications + Limits

How Figures 4-6 Connect

Figure 4: Belief Calibration

As model capability increases, first-order belief calibration improves (lower KL), and second-order elicitation can mildly regularize first-order estimates.

Figure 5: Proposal Behavior

ToM-order effects on proposal fairness are heterogeneous: earlier and later model generations can move in opposite strategic directions.

Figure 6 + Table 1: Decision Policy

Accept/reject behavior is predictable from low-dimensional ToM-valued gain/loss terms, and higher-order terms reshape coefficient geometry.

Joint Message

Calibration, exchange behavior, and decision weights are linked but not equivalent. Good belief reports do not automatically imply cooperative/fair strategy.

Terminology used in this project page:
ToM expression capability = ability to report plausible belief traces;
Policy-level ToM integration = consistent use of those beliefs to guide proposal and accept/reject actions.

What This Adds Beyond Standard ToM Benchmarks

From Ability Probing to Policy Probing

Standard ToM evaluation often tests direct question answering. TradeCraft instead evaluates whether inferred beliefs are operationalized in a dynamic, strategic, multi-agent loop where negotiation and crafting constraints jointly matter.

  • Same base model can produce different strategic profiles under different ToM-order elicitation settings.
  • Higher-order ToM can add a strategic layer related to reputation, bargaining position, and informational consequences.
  • Model-generation differences are qualitative, not only scalar performance gaps.

Scope Boundaries in the Current Study

Current Constraints (Paper Section 6)

  • The environment is intentionally lightweight; realism and tractability are balanced rather than maximized.
  • ToM orders analyzed up to second order.
  • ToM is elicited by prompts instead of fully learned belief modules.
  • Main evaluation is self-play, which may not capture all human-agent dynamics.

Next Research Directions

Extensions Proposed in the Paper

  • Expand task diversity: more players, alternative rulesets, richer communication protocols.
  • Stress-test generalization under distribution shift and rule perturbation.
  • Connect ToM traces to longer-horizon outcomes such as win probability and exploitability.
  • Increase human-in-the-loop evaluation to study ad-hoc collaboration and robustness.

Final Takeaway (Section 7)

TradeCraft provides evidence that the role of ToM in LLM decision-making is model-dependent and can change qualitatively with scaling and training. The benchmark is designed to make this transition observable through linked belief, proposal, and decision measurements.