The Dossier · Live Record

Cross-play, on the record.

Deception × detection across 6 models, logged live from the study runner.

Exhibit A · Case summary

Runner idle

Games logged

166

snapshot v2

Leakage rate

—

intent caught by monitor

Monitor AUC

—

impostor ID from reasoning

Compute

$11.33

of $40.00 cap

Budget consumed28.3%

Fig. 1 — Cross-play win-rate matrix

Read — Rows: impostor model. Columns: villager panel. Cell = impostor win rate on a bone → ember scale. Marginals rank deception (right) and detection (bottom).

imp ╲ vil

GPT- 5.5

Opus 4.8

Gemini 3.5 Flash

DeepSeek V4 Pro

Llama 4 Maverick

Grok 4.3

detect ↓

GPT- 5.5

Opus 4.8

Gemini 3.5 Flash

DeepSeek V4 Pro

Llama 4 Maverick

Grok 4.3

deceive →

Fig. 2 — Marginal rankings

Read — Deception aggregates each model's impostor win rate across all panels; detection aggregates the crew win rate when it sits on the panel.

Exhibit B · Deception

Best deceivers

impostor win rate, marginalised across all panels

01
GPT-5.557.1%
02
Opus 4.860.0%
03
Gemini 3.5 Flash55.0%
04
DeepSeek V4 Pro42.1%
05
Llama 4 Maverick38.1%
06
Grok 4.345.0%

Exhibit C · Detection

Best detectors

crew win rate when acting as the villager panel

01
GPT-5.594.7%
02
Opus 4.864.0%
03
Gemini 3.5 Flash52.6%
04
DeepSeek V4 Pro34.8%
05
Llama 4 Maverick30.0%
06
Grok 4.327.3%

Exhibit · Interrogation logs

Note — Each turn is numbered. Public statements are on the record; private reasoning is sealed — declassify it line by line, or reveal the whole log.

Evidence index

40 logged interrogations

Interrogation log