[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-how-to-fine-tune-llms-with-sft-lora-and-rlhf-zh":3,"article-related-how-to-fine-tune-llms-with-sft-lora-and-rlhf-zh":31,"series-research-71ea7637-0a33-4242-9533-622973e1a7de":84},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"71ea7637-0a33-4242-9533-622973e1a7de","how-to-fine-tune-llms-with-sft-lora-and-rlhf-zh","怎麼做 LLM 微調","\u003Cp data-speakable=\"summary\">這篇教你用 SFT、LoRA 和偏好對齊方法\u003Ca href=\"\u002Fnews\u002Fwhy-fine-tuning-still-beats-prompt-only-ai-zh\">微調\u003C\u002Fa>大型語言模型。\u003C\u002Fp>\u003Cp>這篇給剛開始做模型適配的\u003Ca href=\"\u002Fnews\u002F5-docker-desktop-features-for-developers-zh\">開發\u003C\u002Fa>者看。照著做完，你會得到一份可訓練的資料集、可重現的 SFT 基線、LoRA 適配器流程，以及一條通往 RLHF 或 DPO 的實作路線。\u003C\u002Fp>\u003Ch2>開始之前\u003C\u002Fh2>\u003Cul>\u003Cli>Hugging Face 帳號，並可讀取 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex\">Transformers 文件\u003C\u002Fa>與 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002F\">Hugging Face Hub\u003C\u002Fa>。\u003C\u002Fli>\u003Cli>GitHub 存取權，並可使用 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft\">PEFT repo\u003C\u002Fa> 與 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\">TRL repo\u003C\u002Fa>。\u003C\u002Fli>\u003Cli>Python 3.10+。\u003C\u002Fli>\u003Cli>PyTorch 2.1+。\u003C\u002Fli>\u003Cli>CUDA 12+，如果你要在 NVIDIA GPU 上訓練。\u003C\u002Fli>\u003Cli>至少 16 GB GPU VRAM，或等效雲端 GPU。\u003C\u002Fli>\u003Cli>已整理好的 instruction 資料集，格式為 JSONL 或 CSV。\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Step 1: 整理訓練資料集\u003C\u002Fh2>\u003Cp>這一步的產出是「可直接餵給訓練器的監督式資料集」。如果你做指令微調，每筆資料要有 prompt 與 target response；如果你做偏好訓練，還要保留 chosen 與 rejected 兩個答案。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780121874501-suc1.png\" alt=\"怎麼做 LLM 微調\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>import json\n\nwith open(\"train.jsonl\") as f:\n    rows = [json.loads(line) for line in f]\n\nprint(rows[0])\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到結構完整的樣本，欄位像是 prompt、response，或 preference labels。若第一筆資料格式不對，先修正 schema，再進入訓練，否則 tokenizer 和 trainer 很容易在後面報錯。\u003C\u002Fp>\u003Ch2>Step 2: 建立 SFT 基線模型\u003C\u002Fh2>\u003Cp>這一步的產出是「可比較的監督式微調基線」。先做 SFT，因為它能先告訴你模型是否真的學會任務，之後再加 LoRA 或對齊方法，才知道改善來自哪一段流程。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780121873492-gg73.png\" alt=\"怎麼做 LLM 微調\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments\n\nmodel_name = \"meta-llama\u002FLlama-3.1-8B-Instruct\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name)\n\nargs = TrainingArguments(\n    output_dir=\".\u002Fsft-output\",\n    per_device_train_batch_size=1,\n    num_train_epochs=1,\n)\n\ntrainer = Trainer(model=model, args=args, train_dataset=rows)\ntrainer.train()\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到 loss 在前幾個 step 下降，並且 output 目錄裡出現 checkpoints。若 loss 持平或突然飆高，先檢查 prompt 格式，再確認 labels 是否和 target text 對齊。\u003C\u002Fp>\u003Ch2>Step 3: 套用 LoRA 適配器\u003C\u002Fh2>\u003Cp>這一步的產出是「只\u003Ca href=\"\u002Fnews\u002F5-openai-product-updates-for-teams-zh\">更新\u003C\u002Fa>少量參數的 LoRA 訓練配置」。LoRA 適合 \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa> 記憶體有限、又想快速迭代的情境，因為它不需要每次都重訓整個模型。\u003C\u002Fp>\u003Cpre>\u003Ccode>from peft import LoraConfig, get_peft_model, TaskType\n\nconfig = LoraConfig(\n    task_type=TaskType.CAUSAL_LM,\n    r=8,\n    lora_alpha=16,\n    lora_dropout=0.05,\n)\nmodel = get_peft_model(model, config)\nmodel.print_trainable_parameters()\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到可訓練參數數量明顯小於 base model。若可訓練參數仍然太多，請確認 PEFT 包到正確模組，並檢查你沒有把整個網路意外解凍。\u003C\u002Fp>\u003Ch2>Step 4: 執行偏好對齊訓練\u003C\u002Fh2>\u003Cp>這一步的產出是「更符合人類偏好的對齊模型」。RLHF 類工作流的重點，是讓模型不只模仿標註答案，還要學會在多個候選回應中偏向更好的那個；若你已經有 preference pairs，DPO 通常是更直接的做法。\u003C\u002Fp>\u003Cpre>\u003Ccode>from trl import DPOTrainer, DPOConfig\n\nconfig = DPOConfig(output_dir=\".\u002Fdpo-output\")\ntrainer = DPOTrainer(\n    model=model,\n    ref_model=None,\n    args=config,\n    train_dataset=rows,\n)\ntrainer.train()\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到 preference optimization 的 log，並且輸出一份對齊後的 checkpoint。若模型開始產生更短、較保守或更穩定的回答，通常代表偏好目標已經開始生效。\u003C\u002Fp>\u003Ch2>Step 5: 評估並封裝模型\u003C\u002Fh2>\u003Cp>這一步的產出是「可部署的最終模型資產」。先用一小組測試題檢查指令遵循、拒答行為與領域正確率，再把 adapter 或合併後的權重存成可部署格式。\u003C\u002Fp>\u003Cpre>\u003Ccode>model.save_pretrained(\".\u002Ffinal-adapter\")\ntokenizer.save_pretrained(\".\u002Ffinal-adapter\")\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到一個模型資料夾，裡面有 adapter 權重或 merged weights，還有 tokenizer 檔案。若資料夾內容不完整，請確認 model 和 tokenizer 都有寫出，並且部署 runtime 使用的是同一個 base model 版本。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>指標\u003C\u002Fth>\u003Cth>基準／優化前\u003C\u002Fth>\u003Cth>結果／優化後\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>可訓練參數\u003C\u002Ftd>\u003Ctd>完整模型\u003C\u002Ftd>\u003Ctd>只更新 LoRA adapters\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>GPU 記憶體占用\u003C\u002Ftd>\u003Ctd>全量微調較高\u003C\u002Ftd>\u003Ctd>參數高效微調較低\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>輸出品質\u003C\u002Ftd>\u003Ctd>Base model 原始行為\u003C\u002Ftd>\u003Ctd>更貼近任務且更符合偏好\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>常見錯誤\u003C\u002Fh2>\u003Cul>\u003Cli>直接拿原始聊天紀錄訓練。修法：先整理成一致的 prompt-response，或 chosen-rejected pair 格式，再開始訓練。\u003C\u002Fli>\u003Cli>明明 LoRA 就夠，卻整個模型全量微調。修法：先用 adapter 跑通流程，只有在任務確實需要時，再考慮 full fine-tuning。\u003C\u002Fli>\u003Cli>做完對齊卻沒做同題比較。修法：把 base、SFT、DPO 的輸出放在同一組測試 prompt 上比對，及早抓回歸。\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>接下來可以看什麼\u003C\u002Fh2>\u003Cp>如果這套流程已經穩定，下一步可以看多輪對話格式化、你的領域資料清理，以及當工作場景包含圖片時的多模態微調。\u003C\u002Fp>","學會用 SFT、LoRA、RLHF 或 DPO 微調大型語言模型，建立可訓練、可對齊、可部署的完整流程。","amazingelearning.com","https:\u002F\u002Famazingelearning.com\u002Fllm-fine-tuning-course-from-supervised-ft-to-rlhf-lora-and-multimodal\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780121874501-suc1.png","research","zh","a7495002-c056-4f43-a567-2b844f4ba52d",[17,18,19,20,21,22],"SFT","LoRA","RLHF","DPO","Hugging Face","PyTorch",[24,25,26],"先把資料整理成固定 schema，訓練流程才不會在中途失敗。","先做 SFT 基線，再加 LoRA 與偏好對齊，才能清楚比較每一步的效果。","完成後要輸出 adapter 或合併權重，並用同一組測試題驗證指令遵循與偏好行為。",8,"2026-05-30T06:17:24.581159+00:00","2026-05-30T06:17:24.574+00:00","0c35a120-52fc-41fc-afa3-d404eb934158",{"tags":32,"relatedLang":43,"relatedPosts":47},[33,35,37,39,41],{"name":18,"slug":34},"lora",{"name":21,"slug":36},"hugging-face",{"name":19,"slug":38},"rlhf",{"name":17,"slug":40},"sft",{"name":20,"slug":42},"dpo",{"id":15,"slug":44,"title":45,"language":46},"how-to-fine-tune-llms-with-sft-lora-and-rlhf-en","How to Fine-Tune LLMs with SFT, LoRA, and RLHF","en",[48,54,60,66,72,78],{"id":49,"slug":50,"title":51,"cover_image":52,"image_url":52,"created_at":53,"category":13},"f374155a-c29e-478c-b7a5-679cad1c51e4","crdts-keep-replicas-in-sync-without-locks-zh","CRDT 讓副本不用鎖也能同步","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781011086259-4p4k.png","2026-06-09T13:17:34.493426+00:00",{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":13},"4b3b5a50-45b7-4238-a38b-160f82e323ff","post-deterministic-systems-autonomous-infra-zh","後決定性分散系：自治基礎設施新框架","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781010194792-5ogb.png","2026-06-09T13:02:32.717551+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":13},"04e45398-9814-4907-b416-fcb5b8d69508","causal-learnability-formal-language-tasks-zh","用因果法量化任務可學性","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780987696075-l4g0.png","2026-06-09T06:47:34.438642+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":13},"75bcc569-5e89-45c8-b809-6f169e929f4b","rl-training-hands-off-control-gradually-zh","RL 先接管再放手","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780986786312-03yo.png","2026-06-09T06:32:32.849589+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":13},"e3ecab4b-7cc7-4246-baf6-e1c170d86ca5","omnigamearena-vlm-game-agent-benchmark-zh","OmniGameArena 讓 VLM 遊戲代理更好比","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780985893022-70pl.png","2026-06-09T06:17:32.189729+00:00",{"id":79,"slug":80,"title":81,"cover_image":82,"image_url":82,"created_at":83,"category":13},"6f25a29c-cbb8-4f53-9af7-1656b394333a","turboquant-cuts-kv-cache-memory-6x-google-tests-zh","TurboQuant 在 Google 測試中省下 6x KV 快取","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906682236-sqe2.png","2026-06-08T08:17:21.878314+00:00",[85,90,95,100,105,110,115,120,125,130],{"id":86,"slug":87,"title":88,"created_at":89},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"53a0dc54-0371-4e40-8d5e-74e94a73840c","geometry-aware-similarity-metrics-for-neural-representations-zh","超越距離測量：用微分幾何重新理解神經網路","2026-03-31T06:01:01.241968+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"fee7d472-a775-4b1d-bbc2-1e8bca1bbf8b","on-the-fly-repulsion-in-the-contextual-space-for-rich-divers-zh","讓AI繪圖更有創意：用排斥力提升生成多樣性","2026-03-31T06:01:25.439673+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"a9901203-d69b-447b-8854-15d14eab32b4","vision-aided-beam-prediction-cnn-eca-zh","影像輔助波束預測升級 CNN","2026-04-01T10:00:25.8073+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"b55e7dd4-0a24-4b3d-804d-b0309a03f498","triple-band-fss-mimo-antenna-sub-6-ghz-zh","三頻 FSS MIMO 天線瞄準 sub-6 GHz","2026-04-01T13:18:36.857305+00:00",{"id":126,"slug":127,"title":128,"created_at":129},"f68290bd-e7f3-4b30-ba22-dcd4e0130a66","openclaw-1299-repos-eight-weeks-analysis-zh","OpenClaw 1299 個 Repo 的資料解讀","2026-04-02T05:03:45.208411+00:00",{"id":131,"slug":132,"title":133,"created_at":134},"ed9f80eb-eb02-4d35-8ad4-0ddf428751dd","beam-coherence-aware-combining-mmwave-mimo-zh","毫米波 MIMO 的雙階合併法","2026-04-02T05:27:26.897188+00:00"]