Tag
1 articles
A harness probes how LLM judges change under formatting, paraphrasing, verbosity, and flipped labels.