AWS details RFT with LLM-as-a-judge for Nova
AWS outlines reinforcement fine-tuning with LLM-as-a-judge, plus a legal contract review case study using Amazon Nova and SageMaker AI.

AWS explains how reinforcement fine-tuning can use an LLM judge to score model outputs and improve alignment.
On 30 Apr 2026, AWS published a guide to reinforcement fine-tuning (RFT) with LLM-as-a-judge for Amazon Nova models on AWS and Amazon SageMaker AI. The post says the method can outperform base models and supervised fine-tuning in a legal contract review case study, where a GPT OSS 120B judge helped train a model to flag risks, assessments, and actions from contract text.
| 項目 | 數值 |
|---|---|
| Publish date | 30 Apr 2026 |
| Judge model in case study | GPT OSS 120B |
| Production timeout recommendation | 15 minutes |
| Provisioned concurrency guidance | ~100 |
What changed
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
AWS frames LLM-as-a-judge as a more flexible reward signal than simple rule-based scoring. Instead of checking only for substring matches or fixed labels, the judge can score outputs on correctness, tone, safety, relevance, and domain nuance.

The post breaks the workflow into six steps: choose a judge type, define criteria, pick and configure the judge model, refine the prompt, align reward metrics with production evaluation, and build a reward Lambda that can handle scale and failures.
- Rubric-based judging scores one response against predefined criteria.
- Preference-based judging compares two responses and picks the better one.
- Boolean pass/fail scoring is recommended for rubric judges.
- Reward functions should mix LLM judgments with deterministic checks for format, length, language, and safety.
- Lambda guidance includes exponential backoff, parallel calls, neutral rewards on error, and a 15-minute timeout.
For model choice, AWS says larger judges fit complex reasoning and multi-dimensional scoring, while smaller models can work for common tasks such as math, coding, or general chat if prompts are tight enough. The post also stresses structured outputs, clear scoring rules, and edge-case handling so reward signals stay parseable and stable.
Why it matters
For developers, the appeal is faster alignment without hand-labeling every sample. An LLM judge can surface why a response failed, which helps teams debug reward logic and spot hidden misalignment before deployment.

The legal contract review example shows the practical angle: a small labeled dataset was enough to train a system that evaluates contract clauses against internal guidance, prior contracts, and local law. That matters for teams building domain tools where quality depends on nuanced judgment, not just exact text matches.
AWS also ties reward design to production metrics, arguing that training signals should mirror the same accuracy, safety, and compliance checks used after launch. That reduces the risk of optimizing for the wrong target.
The key question now is not whether RFT works, but which tasks are better served by an LLM judge than by cheaper rules or human review.
// Related Articles
- [MODEL]
MiniMax-M1 brings 1M-token open reasoning model
- [MODEL]
Gemini Omni Video Review: Text Rendering Beats Rivals
- [MODEL]
Why Xiaomi’s MiMo-V2.5-Pro Changes Coding Agents More Than Chatbots
- [MODEL]
OpenAI’s Realtime Audio Models Target Live Voice
- [MODEL]
Anthropic发布10款金融AI Agent
- [MODEL]
Why Claude’s “Infinite” Context Window Still Won’t Make AI Autonomous