Tag

benchmark

Benchmarking is how teams check whether models, agents, and compilers hold up under fixed tasks and real constraints. It covers long-horizon reasoning, data-viz workflows, code safety, and performance, while also exposing how much a score can be distorted by the test itself.

4 articles

Research/May 13

LongMemEval-V2 tests agent memory in web workflows

A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.

Research/May 4

When LLMs Stop Following Procedural Steps

A diagnostic benchmark shows LLMs lose procedural fidelity as step counts grow, even when the arithmetic stays simple.

Research/Apr 20

ASMR-Bench Tests Sabotage Detection in ML Code

ASMR-Bench probes whether auditors can spot subtle sabotage in ML research codebases, and the answer so far is: not reliably.

Research/Apr 16

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning

LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.