[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-how-to-fine-tune-llms-with-sft-lora-and-rlhf-en":3,"article-related-how-to-fine-tune-llms-with-sft-lora-and-rlhf-en":31,"series-research-a7495002-c056-4f43-a567-2b844f4ba52d":85},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"a7495002-c056-4f43-a567-2b844f4ba52d","how-to-fine-tune-llms-with-sft-lora-and-rlhf-en","How to Fine-Tune LLMs with SFT, LoRA, and RLHF","\u003Cp data-speakable=\"summary\">This guide shows how to fine-tune an \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> with supervised data, LoRA, and alignment methods.\u003C\u002Fp>\u003Cp>If you are a developer starting with model adaptation, this guide walks you from supervised fine-tuning to parameter-efficient tuning and alignment. By the end, you will have a working training setup, a LoRA-based adapter workflow, and a clear path to RLHF or DPO for safer outputs.\u003C\u002Fp>\u003Ch2>Before you start\u003C\u002Fh2>\u003Cul>\u003Cli>Hugging Face account and access to the [Transformers docs](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex) and [Hugging Face Hub](https:\u002F\u002Fhuggingface.co\u002F) account.\u003C\u002Fli>\u003Cli>GitHub access to the [PEFT repo](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) and [TRL repo](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl).\u003C\u002Fli>\u003Cli>Python 3.10+.\u003C\u002Fli>\u003Cli>PyTorch 2.1+.\u003C\u002Fli>\u003Cli>CUDA 12+ if you plan to train on NVIDIA GPUs.\u003C\u002Fli>\u003Cli>At least 16 GB GPU VRAM for small LoRA runs, or a cloud GPU instance with comparable memory.\u003C\u002Fli>\u003Cli>A prepared instruction dataset in JSONL or CSV format.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Step 1: Prepare your training dataset\u003C\u002Fh2>\u003Cp>Your first outcome is a clean supervised dataset that matches the behavior you want the model to learn. For instruction tuning, each record should pair a prompt with a target response, and for preference tuning you should also keep chosen and rejected answers.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780121881469-ao1d.png\" alt=\"How to Fine-Tune LLMs with SFT, LoRA, and RLHF\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>import json\n\nwith open(\"train.jsonl\") as f:\n    rows = [json.loads(line) for line in f]\n\nprint(rows[0])\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>You should see a structured example with fields such as prompt, response, or preference labels. If the first record looks malformed, fix the schema before training so the tokenizer and trainer do not fail later.\u003C\u002Fp>\u003Ch2>Step 2: Run a supervised fine-tuning baseline\u003C\u002Fh2>\u003Cp>Your second outcome is a baseline model that learns your task directly from labeled examples. Start with supervised fine-tuning, because it gives you a measurable reference before you add adapters or alignment layers.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780121875676-edpm.png\" alt=\"How to Fine-Tune LLMs with SFT, LoRA, and RLHF\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments\n\nmodel_name = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name)\n\nargs = TrainingArguments(\n    output_dir=\".\u002Fsft-output\",\n    per_device_train_batch_size=1,\n    num_train_epochs=1,\n)\n\ntrainer = Trainer(model=model, args=args, train_dataset=rows)\ntrainer.train()\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>You should see training loss decrease over the first few steps and an output directory containing checkpoints. If loss stays flat or spikes, inspect your prompt formatting and make sure the labels are aligned with the target text.\u003C\u002Fp>\u003Ch2>Step 3: Add LoRA adapters for efficient tuning\u003C\u002Fh2>\u003Cp>Your third outcome is a smaller, cheaper training run that updates only adapter weights instead of the full model. LoRA is the practical choice when you want to iterate quickly or train on limited \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa> memory.\u003C\u002Fp>\u003Cpre>\u003Ccode>from peft import LoraConfig, get_peft_model, TaskType\n\nconfig = LoraConfig(\n    task_type=TaskType.CAUSAL_LM,\n    r=8,\n    lora_alpha=16,\n    lora_dropout=0.05,\n)\nmodel = get_peft_model(model, config)\nmodel.print_trainable_parameters()\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>You should see a much smaller trainable parameter count than the base model. If the trainable count is still too high, confirm that PEFT wrapped the right module and that you did not accidentally unfreeze the full network.\u003C\u002Fp>\u003Ch2>Step 4: Align outputs with preference training\u003C\u002Fh2>\u003Cp>Your fourth outcome is a model that follows human preferences more closely instead of only imitating labels. This is where RLHF-style workflows and DPO come in, with DPO often being the simpler path for teams that already have preference pairs.\u003C\u002Fp>\u003Cpre>\u003Ccode>from trl import DPOTrainer, DPOConfig\n\nconfig = DPOConfig(output_dir=\".\u002Fdpo-output\")\ntrainer = DPOTrainer(\n    model=model,\n    ref_model=None,\n    args=config,\n    train_dataset=rows,\n)\ntrainer.train()\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>You should see preference optimization logs and a saved aligned checkpoint. If the model starts producing shorter or safer answers, that is a common sign the preference objective is taking effect.\u003C\u002Fp>\u003Ch2>Step 5: Evaluate and package the tuned model\u003C\u002Fh2>\u003Cp>Your fifth outcome is a model artifact you can ship, compare, and reproduce. Run a small evaluation set that checks instruction following, refusal behavior, and domain accuracy, then save the adapter or merged weights for deployment.\u003C\u002Fp>\u003Cpre>\u003Ccode>model.save_pretrained(\".\u002Ffinal-adapter\")\ntokenizer.save_pretrained(\".\u002Ffinal-adapter\")\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>You should see a model folder with adapter weights or merged weights plus tokenizer files. If the saved directory is incomplete, verify that both model and tokenizer were written and that your deployment runtime can load the same base model version.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Metric\u003C\u002Fth>\u003Cth>Before\u002FBaseline\u003C\u002Fth>\u003Cth>After\u002FResult\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Trainable parameters\u003C\u002Ftd>\u003Ctd>Full model\u003C\u002Ftd>\u003Ctd>LoRA adapters only\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>GPU memory use\u003C\u002Ftd>\u003Ctd>Higher with full fine-tuning\u003C\u002Ftd>\u003Ctd>Lower with parameter-efficient tuning\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Output quality\u003C\u002Ftd>\u003Ctd>Base model behavior\u003C\u002Ftd>\u003Ctd>Task-specific and preference-aligned behavior\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Common mistakes\u003C\u002Fh2>\u003Cul>\u003Cli>Using raw chat logs without a schema. Fix: convert examples into consistent prompt-response or chosen-rejected pairs before training.\u003C\u002Fli>\u003Cli>Fine-tuning the full model when LoRA is enough. Fix: start with adapters first, then move to full training only if the task truly needs it.\u003C\u002Fli>\u003Cli>Skipping evaluation after alignment. Fix: compare base, SFT, and DPO outputs on the same test prompts so you can catch regressions early.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What's next\u003C\u002Fh2>\u003Cp>Once this workflow is stable, move on to multi-turn chat formatting, dataset curation for your domain, and multimodal fine-tuning if your use case includes images or other non-text inputs.\u003C\u002Fp>","Learn how to fine-tune a large language model with supervised training, LoRA, and alignment methods like RLHF and DPO.","amazingelearning.com","https:\u002F\u002Famazingelearning.com\u002Fllm-fine-tuning-course-from-supervised-ft-to-rlhf-lora-and-multimodal\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780121881469-ao1d.png","research","en","71ea7637-0a33-4242-9533-622973e1a7de",[17,18,19,20,21,22],"LLM fine-tuning","Supervised fine-tuning","LoRA","RLHF","DPO","Hugging Face",[24,25,26],"Start with supervised fine-tuning to create a reliable baseline.","Use LoRA to reduce GPU memory and training cost.","Add preference optimization such as DPO or RLHF to improve helpfulness and safety.",2,"2026-05-30T06:17:24.967007+00:00","2026-05-30T06:17:24.96+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":32,"relatedLang":44,"relatedPosts":48},[33,35,37,39,42],{"name":19,"slug":34},"lora",{"name":20,"slug":36},"rlhf",{"name":17,"slug":38},"llm-fine-tuning",{"name":40,"slug":41},"supervised fine-tuning","supervised-fine-tuning",{"name":21,"slug":43},"dpo",{"id":15,"slug":45,"title":46,"language":47},"how-to-fine-tune-llms-with-sft-lora-and-rlhf-zh","怎麼做 LLM 微調","zh",[49,55,61,67,73,79],{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"850449f2-e75b-4dbf-97c0-3590c6cbf097","crdts-keep-replicas-in-sync-without-locks-en","CRDTs keep replicas in sync without locks","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781011086602-cokl.png","2026-06-09T13:17:35.890527+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"7c6b6428-ba8d-4c59-840b-cf96a95139e5","post-deterministic-systems-autonomous-infra-en","Post-Deterministic Systems for Autonomous Infra","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781010190497-1grq.png","2026-06-09T13:02:33.235795+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"53ec2203-e127-4bf8-8b3d-2dce8d156a54","causal-learnability-formal-language-tasks-en","Causal methods for measuring task learnability","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780987698514-ky8m.png","2026-06-09T06:47:35.103221+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"55e7197e-f114-4b6c-b3e2-af1a3cd9dfa4","rl-training-hands-off-control-gradually-en","RL Training That Hands Off Control Gradually","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780986801034-gf8m.png","2026-06-09T06:32:33.516452+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":13},"93fc6735-b524-4baf-989f-645c4c47d593","omnigamearena-vlm-game-agent-benchmark-en","OmniGameArena benchmarks VLM game agents better","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780985895695-ugcj.png","2026-06-09T06:17:32.668876+00:00",{"id":80,"slug":81,"title":82,"cover_image":83,"image_url":83,"created_at":84,"category":13},"9f0c9505-6d75-411c-ba46-2382e8f295a5","turboquant-cuts-kv-cache-memory-6x-google-tests-en","TurboQuant cuts KV cache memory 6x in Google tests","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906679116-fqdo.png","2026-06-08T08:17:22.276769+00:00",[86,91,96,101,106,111,116,121,126,131],{"id":87,"slug":88,"title":89,"created_at":90},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":132,"slug":133,"title":134,"created_at":135},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]