Tag
benchmark
Benchmarking is how teams check whether models, agents, and compilers hold up under fixed tasks and real constraints. It covers long-horizon reasoning, data-viz workflows, code safety, and performance, while also exposing how much a score can be distorted by the test itself.
4 articles

LongMemEval-V2 tests agent memory in web workflows
A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.

When LLMs Stop Following Procedural Steps
A diagnostic benchmark shows LLMs lose procedural fidelity as step counts grow, even when the arithmetic stays simple.

ASMR-Bench Tests Sabotage Detection in ML Code
ASMR-Bench probes whether auditors can spot subtle sabotage in ML research codebases, and the answer so far is: not reliably.

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning
LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.