Belief Calibration Under ToM Elicitation

This page expands the analysis behind Figure 4 in the paper: whether agents form accurate first-order beliefs about opponents, and whether eliciting second-order ToM signals indirectly improves first-order calibration.

Paper Section 4 (Experiment) Figure 4 Belief Accuracy via KL
Figure 4 KL divergence between V1 and opponent V0
Figure 4. Turn-wise KL divergence between an agent's first-order ToM estimate (V1) and the opponent's zero-order report (V0) under V1-group and V2-group settings.

Experimental Setup

The protocol follows Section 4 of the paper and keeps the decision policy unconstrained by explicit strategy instructions.

  • Models: GPT-4o, o3, o4-mini, GPT-5 (medium reasoning effort).
  • Evaluation regime: self-play (same-model vs same-model).
  • Task set: first half of predefined instances (20 out of 40).
  • Opponent baseline: same-model agent outputting only V0.
  • Compared groups: V0-group, V1-group, V2-group.
  • No policy instruction is added for how ToM reports should be used; usage is implicit.

Why this matters: the setup isolates whether richer ToM traces change behavior/calibration without hand-coding a reasoning policy.

Belief Error as Distribution Distance

KL Divergence

For each turn, the paper compares a player's first-order estimate of opponent item values (V1) with the opponent's actual zero-order self-valuation (V0). Lower KL indicates closer alignment between inferred and reported opponent preferences.

\[ D_{\mathrm{KL}}(P\|Q)=\sum_i P(i)\log\frac{P(i)}{Q(i)} \]

\[ P:=\mathrm{normalize}(V1),\quad Q:=\mathrm{normalize}(V0_{\text{opponent}}) \]

Lower \(D_{\mathrm{KL}}\) means the first-order belief is closer to the opponent's reported self-valuation.

The curve is plotted by turn index, enabling us to inspect whether belief quality improves from interaction history rather than only aggregate averages.

Observed Trend Weaker models show larger KL overall.
Turn Dynamics KL often decreases over turns for stronger models.
V2 Effect V2-group can slightly improve first-order calibration.
Main Claim Belief calibration differs significantly by model generation.

What Figure 4 Demonstrates

Model Separation

GPT-4o has substantially higher KL, indicating difficulty in accurately estimating opponent goals/utilities. o3 and GPT-5 show tighter alignment.

Turn-Level Adaptation

In stronger models, KL tends to drop as the game progresses, suggesting they update opponent models from observed trade history.

Indirect Regularization

Asking agents to also provide second-order estimates (V2) can sharpen first-order estimates, even though the metric directly evaluates only V1.

Behavioral Relevance

This calibration signal is linked to downstream proposal/decision analyses in Sections 4.1 and 4.2, rather than being treated as an isolated score.

Interpretation, Scope, and Caveats

Interpretation

Figure 4 supports a key separation emphasized later in the paper: producing plausible ToM text is not equivalent to policy-level integration. Belief calibration quality appears to be a necessary condition for strategic ToM use, but not a sufficient one.

  • Necessary signal: lower KL means the model's opponent model is closer to observed opponent self-valuation.
  • Not sufficient: calibration alone does not guarantee fair proposals or cooperative acceptance behavior.
  • Bridge to later sections: proposal slope/correlation and logistic weights test whether these beliefs are translated into action.

Limitations in This Analysis

  • Self-play only in the main experiment; human-agent calibration is future work.
  • Prompt-based elicitation of ToM reports rather than learned latent belief modules.
  • ToM order evaluated up to second-order in this study.