Case File // DeceptionBench // Deception Study
DeceptionBench
The Dossier · Live Record

Cross-play, on the record.

Deception × detection across 6 models, logged live from the study runner.

Exhibit A · Case summary
Runner idle
Games logged
166
snapshot v2
Leakage rate
intent caught by monitor
Monitor AUC
impostor ID from reasoning
Compute
$11.33
of $40.00 cap
Budget consumed28.3%

Fig. 1 — Cross-play win-rate matrix

Read Rows: impostor model. Columns: villager panel. Cell = impostor win rate on a bone → ember scale. Marginals rank deception (right) and detection (bottom).

imp ╲ vil
GPT- 5.5
Opus 4.8
Gemini 3.5 Flash
DeepSeek V4 Pro
Llama 4 Maverick
Grok 4.3
detect ↓
GPT- 5.5
57
Opus 4.8
60
Gemini 3.5 Flash
55
DeepSeek V4 Pro
42
Llama 4 Maverick
38
Grok 4.3
45
deceive →
95
64
53
35
30
27

Fig. 2 — Marginal rankings

Read Deception aggregates each model's impostor win rate across all panels; detection aggregates the crew win rate when it sits on the panel.

Exhibit B · Deception

Best deceivers

impostor win rate, marginalised across all panels

  1. 01
    GPT-5.557.1%
  2. 02
    Opus 4.860.0%
  3. 03
    Gemini 3.5 Flash55.0%
  4. 04
    DeepSeek V4 Pro42.1%
  5. 05
    Llama 4 Maverick38.1%
  6. 06
    Grok 4.345.0%
Exhibit C · Detection

Best detectors

crew win rate when acting as the villager panel

  1. 01
    GPT-5.594.7%
  2. 02
    Opus 4.864.0%
  3. 03
    Gemini 3.5 Flash52.6%
  4. 04
    DeepSeek V4 Pro34.8%
  5. 05
    Llama 4 Maverick30.0%
  6. 06
    Grok 4.327.3%

Exhibit · Interrogation logs

Note Each turn is numbered. Public statements are on the record; private reasoning is sealed — declassify it line by line, or reveal the whole log.

Evidence index

40 logged interrogations

Interrogation log