Research

AI research papers, breakthroughs, and technical deep dives. From academic publications to lab findings shaping the future of AI.

May 15

TurboQuant and the SEO Shift for Small Sites

TurboQuant is a rumored Google search system that could widen the pool of pages ranked, giving smaller sites a better shot.

May 15

TurboQuant vs FP8: vLLM’s first broad test

vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

May 15

LLMbda calculus gives agents safety rules

A formal calculus for AI agents models conversations and enforces information-flow rules for safer LLM-based programming.

May 15

A simpler beamspace denoiser for mmWave MIMO

A beamspace denoiser for mmWave MIMO that avoids heavy math, models low-resolution ADC noise, and targets FPGA-friendly deployment.

May 15

Why AI benchmark wins in cyber should scare defenders

AI cyber benchmarks now show autonomous capability is advancing faster than defenders are planning for.

May 14

Why Linux security needs a patch-wave mindset

Linux security is entering a patch-wave era, and teams must treat rapid remediation as the default.

May 14

Judge Reliability Harness Stress-Tests LLM Judges

A harness probes how LLM judges change under formatting, paraphrasing, verbosity, and flipped labels.

May 14

Taming Black-Box LLM Inference Scheduling

A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.

May 14

AISafetyBenchExplorer maps AI safety benchmarks

A catalog of 195 AI safety benchmarks shows how fragmented measurement and weak governance make safety evaluation hard to compare.

May 14

Ollama flaw can leak process memory remotely

A critical Ollama bug can leak process memory remotely, exposing keys, prompts, and user data across exposed servers.

May 13

Why coding benchmarks are finally telling the truth

BenchLM’s coding leaderboard says LiveCodeBench and SWE-bench Pro are the only signals that still matter.

May 13

Pion keeps LLM weights’ spectrum fixed

Pion is a new LLM optimizer that updates weights with orthogonal transforms, preserving singular values instead of adding gradients directly.

May 13

LongMemEval-V2 tests agent memory in web workflows

A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.

May 13

AlphaGRPO teaches multimodal models to self-correct

AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

May 12

Why LLM agents are becoming real vulnerability hunters

LLM agents are now useful for finding real software vulnerabilities, not just writing code.

May 12

Why GPT-5.5 Should Be Your Default Coding LLM in 2026

GPT-5.5 should be the default coding LLM in 2026 because it leads the benchmark stack and sets the performance bar.

May 12

How Memory Shapes Autonomous LLM Agents

A survey of how memory is built, measured, and used in autonomous LLM agents, with a focus on design choices and open problems.

May 12

Policy Invariance as a Better LLM Judge Test

This paper argues that accuracy alone is not enough to trust LLM safety judges, and proposes policy invariance as a reliability test.

May 12

SAGA makes AI agent GPU scheduling workflow-aware

SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.

May 12

PARNESS automates scientific research workflows

PARNESS is a paper harness for automated scientific research, with dynamic workflows, full-text indexing, and cross-run knowledge accumulation.

May 12

VibeServe asks if AI agents can build LLM serving

VibeServe explores whether AI agents can assemble bespoke LLM serving systems, but the provided notes do not include benchmark results.

May 12

Why Agentic RAG Is Better Than Static RAG for Real Work

Agentic RAG beats static RAG for complex, multi-step questions, but it costs more and needs tighter controls.

May 11

Conformal Path Reasoning for safer KGQA

CPR adds path-level conformal calibration to KGQA, aiming for tighter answer sets with coverage guarantees.

May 11

Normalizing Trajectory Models for 4-Step Generation

NTM turns few-step generation into an exact-likelihood flow model and hits strong text-to-image results in four steps.

May 11

AutoTTS lets LLMs discover test-time scaling

AutoTTS turns test-time scaling into an environment search problem, letting LLMs discover cheaper reasoning strategies automatically.

May 11

Microsoft’s GoalCover finds fine-tuning gaps

Microsoft Research’s GoalCover spots missing capabilities in fine-tuning data before training, and improved Qwen-3-14B reward scores.

May 8

BAMI tackles GUI grounding bias without retraining

BAMI is a training-free method that improves GUI grounding by reducing precision and ambiguity bias in complex interfaces.

May 8

UniPool shares MoE experts across layers

UniPool replaces per-layer MoE experts with one shared pool, cutting redundancy and improving validation loss in five LLaMA-scale models.

May 8

ActCam adds joint camera and motion control

ActCam is a zero-shot way to steer both actor motion and camera path in video generation without training a new model.

May 8

Why Solana Developer Hiring Should Stop Treating Skills as Static

Solana developer hiring should treat skills as a moving target, not a fixed checklist.

You've reached the end