[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-how-to-build-vintage-llm-testbed-5-steps-zh":3,"tags-how-to-build-vintage-llm-testbed-5-steps-zh":35,"related-lang-how-to-build-vintage-llm-testbed-5-steps-zh":45,"related-posts-how-to-build-vintage-llm-testbed-5-steps-zh":49,"series-research-72828ff9-cbfb-4e10-81b2-9c4c9544b7f1":86},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":19,"translated_content":10,"views":20,"is_premium":21,"created_at":22,"updated_at":22,"cover_image":11,"published_at":23,"rewrite_status":24,"rewrite_error":10,"rewritten_from_id":25,"slug":26,"category":27,"related_article_id":28,"status":29,"google_indexed_at":30,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":31,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":21},"72828ff9-cbfb-4e10-81b2-9c4c9544b7f1","5 步建出 1930 截止 LLM 測試台","\u003Cp data-speakable=\"summary\">用 5 個步驟建立一個 1930 截止的 \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> 測試台，驗證歷史推理與無污染泛化。\u003C\u002Fp>\u003Cp>這篇給 ML 工程師、研究科學家與平台團隊看，目標是把一個只吃 1931 年以前英文文本的實驗管線做出來。照著做完，你會得到可追溯的歷史語料、去時間污染的篩選流程、OCR 清理後的訓練資料、古典指令微調資料，以及一套能和現代基線對照的評估流程。\u003C\u002Fp>\u003Cp>這個做法適合用來研究歷史推理、資料污染、以及模型在截止日期之後的泛化能力。你也會直接看到 OCR 品質、日期過濾與後訓練資料設計，如何和模型規模一樣重要。\u003C\u002Fp>\u003Ch2>開始之前\u003C\u002Fh2>\u003Cul>\u003Cli>Python 3.11+\u003C\u002Fli>\u003Cli>CUDA GPU，至少 28 GB VRAM，適合 13B 模型\u003C\u002Fli>\u003Cli>PyTorch 2.4+\u003C\u002Fli>\u003Cli>Hugging Face 帳號與模型權重存取權\u003C\u002Fli>\u003Cli>Git 2.40+\u003C\u002Fli>\u003Cli>OCR 工具，例如 Tesseract 5+ 或自建文件 OCR 管線\u003C\u002Fli>\u003Cli>可取得歷史語料，包含書籍、報紙、期刊、專利與判例\u003C\u002Fli>\u003Cli>選用：偏好最佳化用的 judge model API key 或本地模型\u003C\u002Fli>\u003C\u002Ful>\u003Cp>參考實作可先看專案頁、論文與 demo，再對照原始碼與權重。第一次出現的公開資源可從 \u003Ca href=\"https:\u002F\u002Ftalkie-lm.com\u002Fchat\">Talkie-1930 demo\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fwww.marktechpost.com\u002F2026\u002F04\u002F27\u002Fmeet-talkie-1930-a-13b-open-weight-llm-trained-on-pre-1931-english-text-for-historical-reasoning-and-generalization-research\u002F\">MarkTechPost 專案介紹\u003C\u002Fa> 與其連結的 \u003Ca href=\"\u002Ftag\u002Fgithub\">GitHub\u003C\u002Fa>、模型資源開始。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777945255070-tzh6.png\" alt=\"5 步建出 1930 截止 LLM 測試台\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Ch2>Step 1: 組裝 1931 前語料清單\u003C\u002Fh2>\u003Cp>目的：建立一份日期明確、可稽核、且法律上可用的歷史語料清單，讓模型的知識截止日清楚可查。\u003C\u002Fp>\u003Cp>先收集公有領域英文文本，來源包含書籍、報紙、期刊、科學雜誌、專利與判例。每筆資料都要保留 metadata，尤其是出版日期、來源類型與掃描來源。再替每個項目建立穩定 ID，確保訓練 \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> 能回溯到原始實物。\u003C\u002Fp>\u003Cpre>\u003Ccode>python build_corpus.py \\\n  --sources books,newspapers,journals,patents,case_law \\\n  --cutoff-date 1930-12-31 \\\n  --output corpus\u002Fmanifests\u002Fpre1931.jsonl\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到一份 manifest，裡面每個文件都被驗證在 1930-12-31 以前。抽查一筆記錄時，source、publication year 與 scan path 都應該存在。\u003C\u002Fp>\u003Ch2>Step 2: 排除時間污染文件\u003C\u002Fh2>\u003Cp>目的：移除不合時代的內容，避免誤標日期、後來重印版或編者註把 1930 之後的知識帶進模型。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777945259007-i5nr.png\" alt=\"5 步建出 1930 截止 LLM 測試台\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>做一個文件層級的篩選器，把日期檢查和 n-gram 或分類器式的時代錯置偵測一起用。把提到 1930 年後事件、技術或人物的頁面標記出來，並排除 metadata 不確定的項目。這一步很關鍵，因為只靠日期欄位，仍可能漏掉污染樣本。\u003C\u002Fp>\u003Cp>實作上，建議保留一個 quarantine 集合，先把可疑文件丟進去，再人工複核後決定是否納入正式語料。若要重現 Talkie-1930 類型的實驗，這個步驟不能省，因為清單正確不代表內容一定乾淨。\u003C\u002Fp>\u003Cp>你應該看到語料規模下降，外加一份 leakage report。好的驗收訊號是，隨機抽樣內容不再出現明顯的 1930 後引用，例如第二次世界大戰、現代電腦或更晚的政治事件。\u003C\u002Fp>\u003Ch2>Step 3: 轉 OCR 並清理掃描頁\u003C\u002Fh2>\u003Cp>目的：把頁面影像轉成可訓練文字，同時把歷史掃描常見的雜訊降到最低。\u003C\u002Fp>\u003Cp>先對每一頁掃描檔做 OCR，再處理斷字、頁首頁尾、邊註與連字形變體。若條件允許，拿一小段人工轉錄資料當基準，對照普通 OCR 與清理後文本的差異。這類歷史資料的問題通常不是只有識別率，而是版面和字形造成的系統性錯誤。\u003C\u002Fp>\u003Cpre>\u003Ccode>python ocr_pipeline.py \\\n  --input scans\u002F \\\n  --engine tesseract \\\n  --cleanup rules\u002Fhistorical_regex.yml \\\n  --output text\u002Focr_cleaned\u002F\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到頁級對齊的文字檔，以及一份品質報告，裡面包含 character error rate、token re\u003Ca href=\"\u002Fnews\u002Fwhy-latent-agents-proves-internalized-debate-zh\">tent\u003C\u002Fa>ion 與 cleanup gains。若清理後樣本仍有斷行、重複頁首或怪異空白，先修 preprocessing rules，再進入訓練。\u003C\u002Fp>\u003Ch2>Step 4: 用歷史 token 訓練 base model\u003C\u002Fh2>\u003Cp>目的：把基礎模型限制在 1930 截止世界的語言分布中，讓它只學歷史語料的模式。\u003C\u002Fp>\u003Cp>使用標準 causal language modeling 設定，但資料流必須保持純歷史。token 數量要精準追蹤，參考專案使用過約 2600 億 token。checkpoint 要定期存，並在保留的 1931 前文本上算 perplexity，確認模型是在學分布，而不是背掃描雜訊。\u003C\u002Fp>\u003Cp>為了可重現，請固定 tokenizer、sequence length、optimizer 與 mixed precision 設定。若你要做現代對照組，請用相同架構與超參數，在當代語料上訓練一個 twin model，這樣\u003Ca href=\"\u002Fnews\u002Fhow-to-compare-music-ai-companies-zh\">比較\u003C\u002Fa>才公平。\u003C\u002Fp>\u003Cp>你應該看到 training loss 穩定下降，且 held-out historical perplexity 持續改善。健康的訊號是，模型能以符合年代的語氣續寫文本，而且不會冒出現代用語。\u003C\u002Fp>\u003Ch2>Step 5: 用古典指令後訓練並評估\u003C\u002Fh2>\u003Cp>目的：讓模型學會跟隨指令，但不把現代聊天習慣或當代知識帶進來。\u003C\u002Fp>\u003Cp>把 pre-1931 來源整理成 instru\u003Ca href=\"\u002Fnews\u002Factian-vectorai-db-claims-22x-faster-search-zh\">cti\u003C\u002Fa>on-response 對，例如禮儀手冊、書信指南、食譜書、字典、百科全書、詩集與寓言集。接著做 supervised fine-tuning，再用 judge model 做 preference optimization。參考管線會再加一輪合成對話，提升指令跟隨能力，同時維持歷史約束。\u003C\u002Fp>\u003Cpre>\u003Ccode>python post_train.py \\\n  --base_model checkpoints\u002Ftalkie_base \\\n  --instruction_data data\u002Fvintage_instructions.jsonl \\\n  --dpo_judge claude-sonnet-4.6 \\\n  --output checkpoints\u002Ftalkie_it\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>你應該看到五分量表上的指令遵循分數上升，對話回應也更實用。評估時，請同時測 anachronistic 與已過濾的 benchmark，再和你的 modern twin 對照，才能看出差距是來自歷史截止、OCR 雜訊，還是主題不匹配。\u003C\u002Fp>\u003Ch2>常見錯誤\u003C\u002Fh2>\u003Cul>\u003Cli>只靠日期欄位過濾。修法：加上時代錯置分類器，並對可疑文件做人工複核。\u003C\u002Fli>\u003Cli>直接拿原始 OCR 文本訓練。修法：先套清理規則，再用人工轉錄子集驗證。\u003C\u002Fli>\u003Cli>後訓練混入現代指令資料。修法：只從 1931 年以前的手冊、百科與類似來源萃取 prompt 與答案。\u003C\u002Fli>\u003C\u002Ful>\u003Cp>另一個常見問題是低估硬體需求。13B 模型要能跑得穩，通常需要仔細調整 batch size 與 \u003Ca href=\"\u002Ftag\u002Fcuda\">CUDA\u003C\u002Fa> 記憶體餘裕；若你要多節點擴充，還要固定資料順序與 checkpoint 命名，才能讓歷史實驗可重現。\u003C\u002Fp>\u003Ch2>接下來可以看什麼\u003C\u002Fh2>\u003Cp>如果這條管線已經跑通，下一步可以擴到更大的歷史語料、加入更適合舊版排版的 OCR 模型，並做預測、時間驚奇與程式泛化的控制實驗。再往下，你就能把 1930 截止模型和現代 LLM 並排比較，觀察哪些能力依賴網路時代知識，哪些其實來自語言建模本身。\u003C\u002Fp>","用 5 個步驟建立一個 1930 截止的 LLM 測試台，驗證歷史推理與無污染泛化。","www.marktechpost.com","https:\u002F\u002Fwww.marktechpost.com\u002F2026\u002F04\u002F27\u002Fmeet-talkie-1930-a-13b-open-weight-llm-trained-on-pre-1931-english-text-for-historical-reasoning-and-generalization-research\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777945255070-tzh6.png",[13,14,15,16,17,18],"LLM","OCR","historical corpus","fine-tuning","DPO","benchmark contamination","zh",0,false,"2026-05-05T01:40:31.318037+00:00","2026-05-05T01:40:31.154+00:00","done","8c059e7b-c87d-4e76-bb9b-00d4fe889012","how-to-build-vintage-llm-testbed-5-steps-zh","research","05451495-1e4d-4e70-855f-f92e68a1a699","published","2026-05-05T09:00:17.991+00:00",[32,33,34],"先建立 1931 前語料清單，再用日期與時代錯置雙重過濾避免污染。","OCR 清理會直接影響訓練效率，不能把原始掃描文字直接丟進模型。","後訓練資料也要維持歷史邊界，才能公平評估 1930 截止模型。",[36,37,39,41,43],{"name":16,"slug":16},{"name":13,"slug":38},"llm",{"name":15,"slug":40},"historical-corpus",{"name":17,"slug":42},"dpo",{"name":14,"slug":44},"ocr",{"id":28,"slug":46,"title":47,"language":48},"how-to-build-vintage-llm-testbed-5-steps-en","How to Build a Vintage LLM Testbed in 5 Steps","en",[50,56,62,68,74,80],{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":27},"667b72b6-e821-4d68-80a1-e03340bc85f1","turboquant-seo-shift-small-sites-zh","TurboQuant 與小站 SEO 變化","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840440690-kcw9.png","2026-05-15T10:20:27.319472+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":27},"381fb6c6-6da7-4444-831f-8c5eed8d685c","turboquant-vllm-comparison-fp8-kv-cache-zh","TurboQuant 與 FP8 實測結果","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839867551-4v9g.png","2026-05-15T10:10:36.034569+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":27},"c15f45ee-a548-4dbf-8152-91de159c1a11","llmbda-calculus-agent-safety-rules-zh","LLMbda 演算替 AI 代理人立安全規則","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825503412-mlbf.png","2026-05-15T06:10:34.832664+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":27},"0c02225c-d6ff-44f8-bc92-884c8921c4a3","low-complexity-beamspace-denoiser-mmwave-mimo-zh","更簡單的毫米波波束域去噪器","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814650361-xtc2.png","2026-05-15T03:10:30.06639+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":27},"9d27f967-62cc-433f-8cdb-9300937ade13","ai-benchmark-wins-cyber-scare-defenders-zh","為什麼 AI 基準賽在資安領域的勝利，應該讓防守方警醒","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807450006-nofx.png","2026-05-15T01:10:29.379041+00:00",{"id":81,"slug":82,"title":83,"cover_image":84,"image_url":84,"created_at":85,"category":27},"bc402dc6-5da6-46fc-9d66-d09cb215f72b","why-linux-security-needs-patch-wave-mindset-zh","為什麼 Linux 安全需要「補丁浪潮」思維","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741449813-s2wn.png","2026-05-14T06:50:24.052583+00:00",[87,92,97,102,107,112,117,122,127,132],{"id":88,"slug":89,"title":90,"created_at":91},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"9f50561b-aebd-46ba-94a8-363198aa7091","openclaw-agents-manipulated-self-sabotage-zh","OpenClaw Agent 會自己搞砸自己","2026-03-28T03:03:18.786425+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"11f22e92-7066-4978-a544-31f5f2156ec6","vega-learning-to-drive-with-natural-language-instructions-zh","Vega：使用自然語言指示進行自駕車控制","2026-03-28T14:54:04.847912+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"a4c7cfec-8d0e-4fec-93cf-1b9699a530b8","drive-my-way-en-zh","Drive My Way：個性化自駕車風格的實現","2026-03-28T14:54:26.207495+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"dec02f89-fd39-41ba-8e4d-11ede93a536d","training-knowledge-bases-with-writeback-rag-zh","用 WriteBack-RAG 強化知識庫提升檢索效能","2026-03-28T14:54:45.775606+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"3886be5c-a137-40cc-b9e2-0bf18430c002","packforcing-efficient-long-video-generation-method-zh","PackForcing：短影片訓練也能生成長影片","2026-03-28T14:55:02.688141+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"72b90667-d930-4cc9-8ced-aaa0f8968d44","pixelsmile-toward-fine-grained-facial-expression-editing-zh","PixelSmile：提升精細臉部表情編輯的新方法","2026-03-28T14:55:20.678181+00:00",{"id":133,"slug":134,"title":135,"created_at":136},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00"]