Tag
LLM safety
2 articles

Research/May 12
Policy Invariance as a Better LLM Judge Test
This paper argues that accuracy alone is not enough to trust LLM safety judges, and proposes policy invariance as a reliability test.

Research/Apr 20
ASMR-Bench Tests Sabotage Detection in ML Code
ASMR-Bench probes whether auditors can spot subtle sabotage in ML research codebases, and the answer so far is: not reliably.