Category

Research

AI research papers, breakthroughs, and technical deep dives. From academic publications to lab findings shaping the future of AI.

TurboQuant and the SEO Shift for Small Sites
May 15

TurboQuant and the SEO Shift for Small Sites

TurboQuant is a rumored Google search system that could widen the pool of pages ranked, giving smaller sites a better shot.

TurboQuant vs FP8: vLLM’s first broad test
May 15

TurboQuant vs FP8: vLLM’s first broad test

vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

LLMbda calculus gives agents safety rules
May 15

LLMbda calculus gives agents safety rules

A formal calculus for AI agents models conversations and enforces information-flow rules for safer LLM-based programming.

A simpler beamspace denoiser for mmWave MIMO
May 15

A simpler beamspace denoiser for mmWave MIMO

A beamspace denoiser for mmWave MIMO that avoids heavy math, models low-resolution ADC noise, and targets FPGA-friendly deployment.

Why AI benchmark wins in cyber should scare defenders
May 15

Why AI benchmark wins in cyber should scare defenders

AI cyber benchmarks now show autonomous capability is advancing faster than defenders are planning for.

Why Linux security needs a patch-wave mindset
May 14

Why Linux security needs a patch-wave mindset

Linux security is entering a patch-wave era, and teams must treat rapid remediation as the default.

Judge Reliability Harness Stress-Tests LLM Judges
May 14

Judge Reliability Harness Stress-Tests LLM Judges

A harness probes how LLM judges change under formatting, paraphrasing, verbosity, and flipped labels.

Taming Black-Box LLM Inference Scheduling
May 14

Taming Black-Box LLM Inference Scheduling

A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.

AISafetyBenchExplorer maps AI safety benchmarks
May 14

AISafetyBenchExplorer maps AI safety benchmarks

A catalog of 195 AI safety benchmarks shows how fragmented measurement and weak governance make safety evaluation hard to compare.

Ollama flaw can leak process memory remotely
May 14

Ollama flaw can leak process memory remotely

A critical Ollama bug can leak process memory remotely, exposing keys, prompts, and user data across exposed servers.

Why coding benchmarks are finally telling the truth
May 13

Why coding benchmarks are finally telling the truth

BenchLM’s coding leaderboard says LiveCodeBench and SWE-bench Pro are the only signals that still matter.

Pion keeps LLM weights’ spectrum fixed
May 13

Pion keeps LLM weights’ spectrum fixed

Pion is a new LLM optimizer that updates weights with orthogonal transforms, preserving singular values instead of adding gradients directly.

LongMemEval-V2 tests agent memory in web workflows
May 13

LongMemEval-V2 tests agent memory in web workflows

A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.

AlphaGRPO teaches multimodal models to self-correct
May 13

AlphaGRPO teaches multimodal models to self-correct

AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

Why LLM agents are becoming real vulnerability hunters
May 12

Why LLM agents are becoming real vulnerability hunters

LLM agents are now useful for finding real software vulnerabilities, not just writing code.

Why GPT-5.5 Should Be Your Default Coding LLM in 2026
May 12

Why GPT-5.5 Should Be Your Default Coding LLM in 2026

GPT-5.5 should be the default coding LLM in 2026 because it leads the benchmark stack and sets the performance bar.

How Memory Shapes Autonomous LLM Agents
May 12

How Memory Shapes Autonomous LLM Agents

A survey of how memory is built, measured, and used in autonomous LLM agents, with a focus on design choices and open problems.

Policy Invariance as a Better LLM Judge Test
May 12

Policy Invariance as a Better LLM Judge Test

This paper argues that accuracy alone is not enough to trust LLM safety judges, and proposes policy invariance as a reliability test.

SAGA makes AI agent GPU scheduling workflow-aware
May 12

SAGA makes AI agent GPU scheduling workflow-aware

SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.

PARNESS automates scientific research workflows
May 12

PARNESS automates scientific research workflows

PARNESS is a paper harness for automated scientific research, with dynamic workflows, full-text indexing, and cross-run knowledge accumulation.

VibeServe asks if AI agents can build LLM serving
May 12

VibeServe asks if AI agents can build LLM serving

VibeServe explores whether AI agents can assemble bespoke LLM serving systems, but the provided notes do not include benchmark results.

Why Agentic RAG Is Better Than Static RAG for Real Work
May 12

Why Agentic RAG Is Better Than Static RAG for Real Work

Agentic RAG beats static RAG for complex, multi-step questions, but it costs more and needs tighter controls.

Conformal Path Reasoning for safer KGQA
May 11

Conformal Path Reasoning for safer KGQA

CPR adds path-level conformal calibration to KGQA, aiming for tighter answer sets with coverage guarantees.

Normalizing Trajectory Models for 4-Step Generation
May 11

Normalizing Trajectory Models for 4-Step Generation

NTM turns few-step generation into an exact-likelihood flow model and hits strong text-to-image results in four steps.

AutoTTS lets LLMs discover test-time scaling
May 11

AutoTTS lets LLMs discover test-time scaling

AutoTTS turns test-time scaling into an environment search problem, letting LLMs discover cheaper reasoning strategies automatically.

Microsoft’s GoalCover finds fine-tuning gaps
May 11

Microsoft’s GoalCover finds fine-tuning gaps

Microsoft Research’s GoalCover spots missing capabilities in fine-tuning data before training, and improved Qwen-3-14B reward scores.

BAMI tackles GUI grounding bias without retraining
May 8

BAMI tackles GUI grounding bias without retraining

BAMI is a training-free method that improves GUI grounding by reducing precision and ambiguity bias in complex interfaces.

UniPool shares MoE experts across layers
May 8

UniPool shares MoE experts across layers

UniPool replaces per-layer MoE experts with one shared pool, cutting redundancy and improving validation loss in five LLaMA-scale models.

ActCam adds joint camera and motion control
May 8

ActCam adds joint camera and motion control

ActCam is a zero-shot way to steer both actor motion and camera path in video generation without training a new model.

Why Solana Developer Hiring Should Stop Treating Skills as Static
May 8

Why Solana Developer Hiring Should Stop Treating Skills as Static

Solana developer hiring should treat skills as a moving target, not a fixed checklist.

You've reached the end