The Dossier · Live Record
Cross-play, on the record.
Deception × detection across 6 models, logged live from the study runner.
Exhibit A · Case summary
Runner idleGames logged
166
snapshot v2
Leakage rate
—
intent caught by monitor
Monitor AUC
—
impostor ID from reasoning
Compute
$11.33
of $40.00 cap
Budget consumed28.3%
Fig. 1 — Cross-play win-rate matrix
Read — Rows: impostor model. Columns: villager panel. Cell = impostor win rate on a bone → ember scale. Marginals rank deception (right) and detection (bottom).
imp ╲ vil
GPT- 5.5
Opus 4.8
Gemini 3.5 Flash
DeepSeek V4 Pro
Llama 4 Maverick
Grok 4.3
detect ↓
GPT- 5.5
57
Opus 4.8
60
Gemini 3.5 Flash
55
DeepSeek V4 Pro
42
Llama 4 Maverick
38
Grok 4.3
45
deceive →
95
64
53
35
30
27
Fig. 2 — Marginal rankings
Read — Deception aggregates each model's impostor win rate across all panels; detection aggregates the crew win rate when it sits on the panel.
Exhibit B · Deception
Best deceivers
impostor win rate, marginalised across all panels
- 01GPT-5.557.1%
- 02Opus 4.860.0%
- 03Gemini 3.5 Flash55.0%
- 04DeepSeek V4 Pro42.1%
- 05Llama 4 Maverick38.1%
- 06Grok 4.345.0%
Exhibit C · Detection
Best detectors
crew win rate when acting as the villager panel
- 01GPT-5.594.7%
- 02Opus 4.864.0%
- 03Gemini 3.5 Flash52.6%
- 04DeepSeek V4 Pro34.8%
- 05Llama 4 Maverick30.0%
- 06Grok 4.327.3%
Exhibit · Interrogation logs
Note — Each turn is numbered. Public statements are on the record; private reasoning is sealed — declassify it line by line, or reveal the whole log.
Evidence index
40 logged interrogations
Interrogation log