[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-paperless-ai-document-chat-rag-hybrid-search-zh":3,"article-related-paperless-ai-document-chat-rag-hybrid-search-zh":37,"series-tools-8299ded2-e180-43cf-b78e-96ac23033d26":85},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":21,"translated_content":10,"views":22,"is_premium":23,"created_at":24,"updated_at":24,"cover_image":11,"published_at":25,"rewrite_status":26,"rewrite_error":10,"rewritten_from_id":27,"slug":28,"category":29,"related_article_id":30,"status":31,"google_indexed_at":32,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":33,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":23},"8299ded2-e180-43cf-b78e-96ac23033d26","Paperless-AI：把文件庫變聊天機器人","\u003Cp data-speakable=\"summary\">Paperless-AI 讓 Paperless-ngx 的文件庫能聊天、能自動標籤，也能用 \u003Ca href=\"\u002Ftag\u002Frag\">RAG\u003C\u002Fa> 找出文件裡的答案。\u003C\u002Fp>\u003Cp>說真的，這東西很實用。\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fv8u7\u002Fpaperless-ai\" target=\"_blank\" rel=\"noopener\">Paperless-AI\u003C\u002Fa> 不是在做花俏 Demo。它是把文件管理系統，直接拉進 AI 工作流。\u003C\u002Fp>\u003Cp>如果你手上有幾千份發票、合約、信件，人工翻找真的會崩潰。\u003Ca href=\"https:\u002F\u002Fwww.paperless-ngx.com\u002F\" target=\"_blank\" rel=\"noopener\">Paperless-ngx\u003C\u002Fa> 很會存檔，但它不會幫你回答「這份合約的終止條款在哪」。Paperless-AI 就是在補這個洞。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>項目\u003C\u002Fth>\u003Cth>Paperless-AI 做法\u003C\u002Fth>\u003Cth>意義\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>架構\u003C\u002Ftd>\u003Ctd>Node.js + Express，搭配 Python + FastAPI\u003C\u002Ftd>\u003Ctd>把網頁流程和 AI 運算拆開\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>檢索\u003C\u002Ftd>\u003Ctd>Hybrid search，結合 BM25 與 cosine similarity\u003C\u002Ftd>\u003Ctd>關鍵字不同，還是找得到\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>向量庫\u003C\u002Ftd>\u003Ctd>\u003Ca href=\"https:\u002F\u002Fwww.trychroma.com\u002F\" target=\"_blank\" rel=\"noopener\">ChromaDB\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>存 embeddings，做語意查找\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>本機狀態\u003C\u002Ftd>\u003Ctd>better-sqlite3\u003C\u002Ftd>\u003Ctd>設定和處理狀態都放本地\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>模型來源\u003C\u002Ftd>\u003Ctd>\u003Ca href=\"https:\u002F\u002Fopenai.com\u002F\" target=\"_blank\" rel=\"noopener\">OpenAI\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fazure.microsoft.com\u002Fen-us\u002Fproducts\u002Fai-services\u002Fopenai-service\" target=\"_blank\" rel=\"noopener\">Azure OpenAI\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Follama.com\u002F\" target=\"_blank\" rel=\"noopener\">Ollama\u003C\u002Fa>\u003C\u002Ftd>\u003Ctd>可選雲端或本機推論\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>這個專案到底在解什麼問題\u003C\u002Fh2>\u003Cp>文件系統最常見的痛點，不是存不下。是找不到。你可以把資料塞進伺服器，也可以加上標籤和欄位。但一旦使用者想問「去年那份供應商合約怎麼寫」，搜尋就會卡住。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778062851674-apws.png\" alt=\"Paperless-AI：把文件庫變聊天機器人\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Paperless-AI 的思路很直接。它不把文件當死檔案。它把文件當資料來源，再接上 AI。這樣一來，文件可以自動分類，也可以被摘要，甚至可以直接對話。\u003C\u002Fp>\u003Cp>我覺得這種設計比單純加一個聊天框更合理。因為它先處理檢索，再處理回答。模型如果沒有抓到對的段落，講再多都只是亂掰。\u003C\u002Fp>\u003Cul>\u003Cli>自動分類新文件\u003C\u002Fli>\u003Cli>抽取自訂欄位\u003C\u002Fli>\u003Cli>對整個文件庫問答\u003C\u002Fli>\u003Cli>可自架，資料不必外流\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>混合架構才是重點\u003C\u002Fh2>\u003Cp>Paperless-AI 把 \u003Ca href=\"https:\u002F\u002Fnodejs.org\u002F\" target=\"_blank\" rel=\"noopener\">Node.js\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Fexpressjs.com\u002F\" target=\"_blank\" rel=\"noopener\">Express\u003C\u002Fa> 拿來做協調層。\u003Ca href=\"https:\u002F\u002Fwww.python.org\u002F\" target=\"_blank\" rel=\"noopener\">Python\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Ffastapi.tiangolo.com\u002F\" target=\"_blank\" rel=\"noopener\">FastAPI\u003C\u002Fa> 則負責 AI 任務。這個切法很務實。前端流程、佇列、驗證，和 embeddings、retrieval、模型呼叫，本來就不是同一種工作。\u003C\u002Fp>\u003Cp>Python 端用 \u003Ca href=\"https:\u002F\u002Fwww.sbert.net\u002F\" target=\"_blank\" rel=\"noopener\">sentence-transformers\u003C\u002Fa> 做 embeddings，再丟進 \u003Ca href=\"https:\u002F\u002Fwww.trychroma.com\u002F\" target=\"_blank\" rel=\"noopener\">ChromaDB\u003C\u002Fa>。Node.js 端處理登入、文件佇列、EJS 畫面。這樣拆開，維護起來比較不會一團亂。\u003C\u002Fp>\u003Cp>更重要的是，它不是只靠語意搜尋。它還混了關鍵字檢索。這點很關鍵。因為文件世界很髒，很多內容會有固定名詞、編號、日期，單靠向量相似度常常會漏掉。\u003C\u002Fp>\u003Cblockquote>“Retrieval-augmented generation is the best way to keep models grounded in your data.” — Harrison Chase\u003C\u002Fblockquote>\u003Cp>這句是 \u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fharrison-chase-6b5b3b4\u002F\" target=\"_blank\" rel=\"noopener\">Harrison Chase\u003C\u002Fa> 說的。他是 \u003Ca href=\"https:\u002F\u002Fwww.langchain.com\u002F\" target=\"_blank\" rel=\"noopener\">LangChain\u003C\u002Fa> 的共同創辦人。講白了，Paperless-AI 就是在做這件事：先找對資料，再讓模型回答。\u003C\u002Fp>\u003Cp>它也支援多種模型來源。你可以接 \u003Ca href=\"https:\u002F\u002Fopenai.com\u002F\" target=\"_blank\" rel=\"noopener\">OpenAI\u003C\u002Fa>，也可以接 \u003Ca href=\"https:\u002F\u002Fazure.microsoft.com\u002Fen-us\u002Fproducts\u002Fai-services\u002Fopenai-service\" target=\"_blank\" rel=\"noopener\">Azure OpenAI\u003C\u002Fa>，或是用 \u003Ca href=\"https:\u002F\u002Follama.com\u002F\" target=\"_blank\" rel=\"noopener\">Ollama\u003C\u002Fa> 跑本機模型。對有敏感文件的團隊來說，這很重要。\u003C\u002Fp>\u003Ch2>RAG、chunking、hybrid search 在忙什麼\u003C\u002Fh2>\u003Cp>這套系統的核心，不是「把文件丟給模型」。而是先切 chunk，再做索引，再找相關段落。這樣做很土，但有效。因為長文件一多，co\u003Ca href=\"\u002Fnews\u002Fanthropic-ai-consulting-venture-wall-street-zh\">nt\u003C\u002Fa>ext 很快就爆掉。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778062863654-ws43.png\" alt=\"Paperless-AI：把文件庫變聊天機器人\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Paperless-AI 還會把既有標籤和 correspond\u003Ca href=\"\u002Fnews\u002Fmicrosoft-agent-framework-building-blocks-dotnet-part-3-zh\">ent\u003C\u002Fa> 塞進 prompt。這個細節很小，但效果很大。模型比較不會自己亂生新分類，導致你的文件夾整個失控。\u003C\u002Fp>\u003Cp>它也有 \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa>-aware truncation。這是很實際的保護機制。文件太長時，先算 token 再送出，能少掉很多失敗請求。說白了，就是少浪費 \u003Ca href=\"\u002Ftag\u002Fapi\">API\u003C\u002Fa> 成本。\u003C\u002Fp>\u003Cul>\u003Cli>chunking 先縮小內容範圍\u003C\u002Fli>\u003Cli>語意搜尋加關鍵字搜尋，命中率更穩\u003C\u002Fli>\u003Cli>沿用既有標籤，避免分類亂掉\u003C\u002Fli>\u003Cli>token 控制，減少 context overflow\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>跟其他做法比，差在哪\u003C\u002Fh2>\u003Cp>如果只看 \u003Ca href=\"https:\u002F\u002Fwww.paperless-ngx.com\u002F\" target=\"_blank\" rel=\"noopener\">Paperless-ngx\u003C\u002Fa>，它擅長的是存檔、權限、基本搜尋。它不是拿來做文件問答的。你可以說它是文件倉庫，但不是 AI 助理。\u003C\u002Fp>\u003Cp>如果改用一般雲端 \u003Ca href=\"\u002Ftag\u002Fai-工具\">AI 工具\u003C\u002Fa>，流程會更快，但資料外流風險也更高。很多公司一碰到合約、採購單、內部信件，就不敢把內容直接丟出去。這時候自架方案就有價值。\u003C\u002Fp>\u003Cp>Paperless-AI 卡在中間，位置很漂亮。它保留原本的文件系統，再補上檢索和聊天。對已經在用 Paperless-ngx 的團隊來說，這比整套重做實際多了。\u003C\u002Fp>\u003Cul>\u003Cli>Paperless-ngx：強在存檔與 metadata\u003C\u002Fli>\u003Cli>Paperless-AI：加上檢索、聊天、標籤\u003C\u002Fli>\u003Cli>雲端 AI 工具：方便，但資料控制較弱\u003C\u002Fli>\u003Cli>Ollama 本機部署：較適合敏感文件\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這類工具的產業脈絡\u003C\u002Fh2>\u003Cp>文件管理和 AI 的結合，現在已經不是新鮮事。但很多產品都只做半套。不是只有搜尋框，就是只有聊天框。真正有用的方案，通常要同時處理 ingestion、索引、權限、回覆品質。\u003C\u002Fp>\u003Cp>這也是為\u003Ca href=\"\u002Fnews\u002Fwhy-anthropic-is-right-on-ai-cyber-risk-zh\">什麼\u003C\u002Fa> RAG 會一直出現在企業工具裡。因為企業不缺模型。企業缺的是把資料找準的能力。模型本身很會講，但如果上下文錯了，答案也會跟著歪掉。\u003C\u002Fp>\u003Cp>我覺得 Paperless-AI 的價值，不在於它多炫。它的價值在於它很像真的會被部署的東西。它沒有要求你換掉整個文件流程，只是把 AI 疊上去。\u003C\u002Fp>\u003Cp>如果你在做內部工具，這個方向很值得參考。先把資料整理好，再讓 AI 接手高摩擦工作。不要反過來。很多團隊就是先追模型，再補資料，最後整個系統都很難維護。\u003C\u002Fp>\u003Ch2>結論：這種文件庫會越來越像知識介面\u003C\u002Fh2>\u003Cp>我猜接下來的文件系統，會把搜尋、標籤、問答放在同一條流程。不是三個功能分開按，而是使用者輸入一句話，系統自己去找、去判斷、去整理。\u003C\u002Fp>\u003Cp>如果你已經有 Paperless-ngx，下一步不一定是換模型。你更該先問：你的文件能不能被準確檢索，能不能被穩定分類，能不能在不外流的前提下回答問題。這三件事做好，AI 才真的有用。\u003C\u002Fp>","Paperless-AI 把 Paperless-ngx 變成可聊天的文件庫，結合 RAG、hybrid search、AI 標籤與自架部署，適合大量合約、發票與內部文件。","viblo.asia","https:\u002F\u002Fviblo.asia\u002Fp\u002Fopen-source-220-paperless-ai-he-thong-quan-tri-tri-thuc-tai-lieu-thong-minh-voi-kien-truc-hybrid-nodejspython-co-che-rag-va-hybrid-search-toi-uu-AY4qQdbr4Pw",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778062851674-apws.png",[13,14,15,16,17,18,19,20],"Paperless-AI","Paperless-ngx","RAG","hybrid search","文件管理","自架 AI","ChromaDB","Ollama","zh",1,false,"2026-05-06T10:20:34.644825+00:00","2026-05-06T10:20:34.527+00:00","done","65082419-30ac-4c8b-8206-5000fefc4cd1","paperless-ai-document-chat-rag-hybrid-search-zh","tools","ba6e64b7-f424-464c-90f6-dc5d66ccaf0a","published","2026-05-07T09:00:19.409+00:00",[34,35,36],"Paperless-AI 把 Paperless-ngx 變成可聊天的文件庫，重點在 RAG 與 hybrid search。","它用 Node.js + Python 的混合架構，分開處理流程與 AI 任務。","對敏感文件來說，自架與本機推論比純雲端工具更實際。",{"tags":38,"relatedLang":10,"relatedPosts":48},[39,41,43,45,46],{"name":13,"slug":40},"paperless-ai",{"name":15,"slug":42},"rag",{"name":16,"slug":44},"hybrid-search",{"name":17,"slug":17},{"name":14,"slug":47},"paperless-ngx",[49,55,61,67,73,79],{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":29},"a109dac1-43f3-4a6b-982c-13b59e8f61e9","vibe-research-ai-tools-workflows-zh","Vibe Research：用 AI 加速研究流程","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778904653705-zekc.png","2026-05-16T04:10:33.15767+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":29},"cb68bb90-3638-4334-87c7-02580f59877a","aws-repository-wide-security-scanner-matters-zh","為什麼 AWS 的全倉庫安全掃描比更快的 SAST 更重要","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778901047875-n4l9.png","2026-05-16T03:10:24.757504+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":29},"7c966206-36f7-4d6b-b2e5-088a4732ede4","why-docker-microvm-sandboxes-ai-agents-zh","為什麼 Docker 的 microVM 沙盒才是 AI 代理的正解","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778893837158-gpxf.png","2026-05-16T01:10:20.668094+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":29},"d058a76f-6548-4135-8970-f3a97f255446","why-gemini-api-pricing-is-cheaper-than-it-looks-zh","為什麼 Gemini API 定價其實比看起來更便宜","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778869845081-j4m7.png","2026-05-15T18:30:25.797639+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":29},"68e4be16-dc38-4524-a6ea-5ebe22a6c4fb","why-vidhub-huiyuan-hutong-bushi-quan-shebei-tongyong-zh","為什麼 VidHub 會員互通不是「買一次全設備通用」","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778789450987-advz.png","2026-05-14T20:10:24.048988+00:00",{"id":80,"slug":81,"title":82,"cover_image":83,"image_url":83,"created_at":84,"category":29},"7a1e174f-746b-4e82-a0e3-b2475ab39747","why-buns-zig-to-rust-experiment-is-right-zh","為什麼 Bun 的 Zig-to-Rust 實驗是對的","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778767879127-5dna.png","2026-05-14T14:10:26.886397+00:00",[86,91,96,101,106,111,116,121,126,131],{"id":87,"slug":88,"title":89,"created_at":90},"de769291-4574-4c46-a76d-772bd99e6ec9","googles-biggest-gemini-launches-in-2026-zh","Google 2026 最大 Gemini 盤點","2026-03-26T07:26:39.21072+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"855cd52f-6fab-46cc-a7c1-42195e8a0de4","surepath-real-time-mcp-policy-controls-zh","SurePath 推出即時 MCP 政策控管","2026-03-26T07:57:40.77233+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"9b19ab54-edef-4dbd-9ce4-a51e4bae4ebb","mcp-in-2026-the-ai-tool-layer-teams-use-zh","2026 年 MCP：團隊真的在用的 AI 工具層","2026-03-26T08:01:46.589694+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"af9c46c3-7a28-410b-9f04-32b3de30a68c","prompting-in-2026-what-actually-works-zh","2026 提示工程，真正有用的是什麼","2026-03-26T08:08:12.453028+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"05553086-6ed0-4758-81fd-6cab24b575e0","garry-tan-open-sources-claude-code-toolkit-zh","Garry Tan 開源 Claude Code 工具包","2026-03-26T08:26:20.068737+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"042a73a2-18a2-433d-9e8f-9802b9559aac","github-ai-projects-to-watch-in-2026-zh","2026 必看 20 個 GitHub AI 專案","2026-03-26T08:28:09.619964+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"a5f94120-ac0d-4483-9a8b-63590071ac6a","claude-code-vs-cursor-2026-zh","Claude Code 與 Cursor 深度對比：202…","2026-03-26T13:27:14.279193+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"0975afa1-e0c7-4130-a20d-d890eaed995e","practical-github-guide-learning-ml-2026-zh","2026 機器學習入門 GitHub 實用指南","2026-03-27T01:16:49.712576+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"bfdb467a-290f-4a80-b3a9-6f081afb6dff","aiml-2026-student-ai-ml-lab-repo-review-zh","AIML-2026：像課綱的學生實驗 Repo","2026-03-27T01:21:51.467798+00:00",{"id":132,"slug":133,"title":134,"created_at":135},"80cabc3e-09fc-4ff5-8f07-b8d68f5ae545","ai-trending-github-repos-and-research-feeds-zh","AI Trending：把 AI 資源收成一張表","2026-03-27T01:31:35.262183+00:00"]