Research
AI research papers, breakthroughs, and technical deep dives. From academic publications to lab findings shaping the future of AI.

TurboQuant and the SEO Shift for Small Sites
TurboQuant is a rumored Google search system that could widen the pool of pages ranked, giving smaller sites a better shot.

TurboQuant vs FP8: vLLM’s first broad test
vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

LLMbda calculus gives agents safety rules
A formal calculus for AI agents models conversations and enforces information-flow rules for safer LLM-based programming.

A simpler beamspace denoiser for mmWave MIMO
A beamspace denoiser for mmWave MIMO that avoids heavy math, models low-resolution ADC noise, and targets FPGA-friendly deployment.

Why AI benchmark wins in cyber should scare defenders
AI cyber benchmarks now show autonomous capability is advancing faster than defenders are planning for.

Why Linux security needs a patch-wave mindset
Linux security is entering a patch-wave era, and teams must treat rapid remediation as the default.

Judge Reliability Harness Stress-Tests LLM Judges
A harness probes how LLM judges change under formatting, paraphrasing, verbosity, and flipped labels.

Taming Black-Box LLM Inference Scheduling
A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.

AISafetyBenchExplorer maps AI safety benchmarks
A catalog of 195 AI safety benchmarks shows how fragmented measurement and weak governance make safety evaluation hard to compare.

Ollama flaw can leak process memory remotely
A critical Ollama bug can leak process memory remotely, exposing keys, prompts, and user data across exposed servers.

Why coding benchmarks are finally telling the truth
BenchLM’s coding leaderboard says LiveCodeBench and SWE-bench Pro are the only signals that still matter.

Pion keeps LLM weights’ spectrum fixed
Pion is a new LLM optimizer that updates weights with orthogonal transforms, preserving singular values instead of adding gradients directly.

LongMemEval-V2 tests agent memory in web workflows
A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.

AlphaGRPO teaches multimodal models to self-correct
AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

Why LLM agents are becoming real vulnerability hunters
LLM agents are now useful for finding real software vulnerabilities, not just writing code.

Why GPT-5.5 Should Be Your Default Coding LLM in 2026
GPT-5.5 should be the default coding LLM in 2026 because it leads the benchmark stack and sets the performance bar.

How Memory Shapes Autonomous LLM Agents
A survey of how memory is built, measured, and used in autonomous LLM agents, with a focus on design choices and open problems.

Policy Invariance as a Better LLM Judge Test
This paper argues that accuracy alone is not enough to trust LLM safety judges, and proposes policy invariance as a reliability test.

SAGA makes AI agent GPU scheduling workflow-aware
SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.

PARNESS automates scientific research workflows
PARNESS is a paper harness for automated scientific research, with dynamic workflows, full-text indexing, and cross-run knowledge accumulation.

VibeServe asks if AI agents can build LLM serving
VibeServe explores whether AI agents can assemble bespoke LLM serving systems, but the provided notes do not include benchmark results.

Why Agentic RAG Is Better Than Static RAG for Real Work
Agentic RAG beats static RAG for complex, multi-step questions, but it costs more and needs tighter controls.

Conformal Path Reasoning for safer KGQA
CPR adds path-level conformal calibration to KGQA, aiming for tighter answer sets with coverage guarantees.

Normalizing Trajectory Models for 4-Step Generation
NTM turns few-step generation into an exact-likelihood flow model and hits strong text-to-image results in four steps.

AutoTTS lets LLMs discover test-time scaling
AutoTTS turns test-time scaling into an environment search problem, letting LLMs discover cheaper reasoning strategies automatically.

Microsoft’s GoalCover finds fine-tuning gaps
Microsoft Research’s GoalCover spots missing capabilities in fine-tuning data before training, and improved Qwen-3-14B reward scores.

BAMI tackles GUI grounding bias without retraining
BAMI is a training-free method that improves GUI grounding by reducing precision and ambiguity bias in complex interfaces.

UniPool shares MoE experts across layers
UniPool replaces per-layer MoE experts with one shared pool, cutting redundancy and improving validation loss in five LLaMA-scale models.

ActCam adds joint camera and motion control
ActCam is a zero-shot way to steer both actor motion and camera path in video generation without training a new model.

Why Solana Developer Hiring Should Stop Treating Skills as Static
Solana developer hiring should treat skills as a moving target, not a fixed checklist.