Tag

transitivity violations

1 articles

How to Trust LLM Judges, Per Input

A diagnostic toolkit shows LLM judges can look stable on average while still being unreliable on individual inputs.