Tag
evaluation
2 articles

Research/May 12
Policy Invariance as a Better LLM Judge Test
This paper argues that accuracy alone is not enough to trust LLM safety judges, and proposes policy invariance as a reliability test.

Research/Apr 6
BAS scores LLM confidence for abstain decisions
BAS evaluates whether LLM confidence helps decide when to answer or abstain, exposing overconfident errors that standard metrics can miss.