Tag

LLM evaluation

LLM evaluation examines whether models reason, judge, and stay consistent beyond producing a plausible answer. It spans long-horizon benchmarks like LongCoT, ASR quality assessment, and agreement with human labels on tasks where accuracy alone misses real failure modes.

4 articles

Research/May 6

DeepTest 2026 benchmarks an LLM car manual assistant

DeepTest’s first LLM testing competition compared four tools on car manual retrieval, showing how to benchmark automotive assistants.

Industry News/May 5

Why Databricks RAG Is a Platform Play, Not a Feature

Databricks treats RAG as an end-to-end platform problem, and that is the right way to build it.

Research/Apr 24

LLMs for ASR Evaluation: Beyond WER

This paper tests decoder-based LLMs as ASR evaluators and finds they beat WER on human agreement, with 92–94% on one task.

Research/Apr 16

LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning

LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.