[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-build-code-aware-rag-pipeline-langchain-en":3,"article-related-build-code-aware-rag-pipeline-langchain-en":31,"series-ai-agent-e7be4c51-f2a0-44fb-b829-c5f2c0edb102":79},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"e7be4c51-f2a0-44fb-b829-c5f2c0edb102","build-code-aware-rag-pipeline-langchain-en","Build a code-aware RAG pipeline with LangChain","\u003Cp data-speakable=\"summary\">Set up a code-aware retrieval augmented generation pipeline with \u003Ca href=\"\u002Ftag\u002Flangchain\">LangChain\u003C\u002Fa>.\u003C\u002Fp>\u003Cp>This guide is for developers who want to build a retrieval augmented generation system that handles Python and Markdown files cleanly, splits content by tokens, and returns grounded answers from your own documents. By the end, you will have a working LangChain-based \u003Ca href=\"\u002Ftag\u002Frag\">RAG\u003C\u002Fa> workflow that loads files, chunks them with syntax awareness, stores embeddings, and answers questions with retrieved context.\u003C\u002Fp>\u003Ch2>Before you start\u003C\u002Fh2>\u003Cul>\u003Cli>Node.js 20+ or Python 3.10+; this guide uses Python examples.\u003C\u002Fli>\u003Cli>A LangChain account or local environment with access to LangChain packages.\u003C\u002Fli>\u003Cli>An LLM API key, such as OpenAI, Anthropic, or another supported provider.\u003C\u002Fli>\u003Cli>An embeddings API key for the same provider, or a local embeddings model.\u003C\u002Fli>\u003Cli>A small document set with .py and .md files.\u003C\u002Fli>\u003Cli>Git installed so you can clone a sample repo or your own project docs.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Step 1: Install LangChain packages\u003C\u002Fh2>\u003Cp>Goal: create a clean project with the libraries needed for loading files, splitting text, embedding chunks, and running retrieval.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781811178447-we5p.png\" alt=\"Build a code-aware RAG pipeline with LangChain\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>pip install langchain langchain-community langchain-text-splitters langchain-openai faiss-cpu tiktoken\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>Verification: you should see the packages install without errors, and \u003Ccode>python -c \"import langchain\"\u003C\u002Fcode> should run successfully.\u003C\u002Fp>\u003Ch2>Step 2: Load Python and Markdown files\u003C\u002Fh2>\u003Cp>Goal: ingest source files into LangChain documents so the pipeline can treat code and docs as searchable inputs.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781811180770-cua7.png\" alt=\"Build a code-aware RAG pipeline with LangChain\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cpre>\u003Ccode>from langchain_community.document_loaders import DirectoryLoader, TextLoader\n\npy_loader = DirectoryLoader(\".\u002Fdocs\", glob=\"**\u002F*.py\", loader_cls=TextLoader)\nmd_loader = DirectoryLoader(\".\u002Fdocs\", glob=\"**\u002F*.md\", loader_cls=TextLoader)\n\npython_docs = py_loader.load()\nmarkdown_docs = md_loader.load()\nall_docs = python_docs + markdown_docs\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>Verification: you should see a non-empty list of documents, and each document should include page content from your files.\u003C\u002Fp>\u003Ch2>Step 3: Split documents by tokens\u003C\u002Fh2>\u003Cp>Goal: chunk content with \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa>-aware boundaries so the model sees complete ideas instead of arbitrary character slices.\u003C\u002Fp>\u003Cpre>\u003Ccode>from langchain_text_splitters import RecursiveCharacterTextSplitter\n\nsplitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n    chunk_size=800,\n    chunk_overlap=120,\n)\nchunks = splitter.split_documents(all_docs)\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>Verification: you should see more chunks than source files, and chunk sizes should stay close to your token target rather than breaking mid-function or mid-paragraph.\u003C\u002Fp>\u003Ch2>Step 4: Create a vector index\u003C\u002Fh2>\u003Cp>Goal: turn chunks into embeddings and store them in a retriever-friendly index for \u003Ca href=\"\u002Fnews\u002Fbuild-semantic-search-opensearch-vectors-en\">semantic search\u003C\u002Fa>.\u003C\u002Fp>\u003Cpre>\u003Ccode>from langchain_openai import OpenAIEmbeddings\nfrom langchain_community.vectorstores import FAISS\n\nembeddings = OpenAIEmbeddings()\nvectorstore = FAISS.from_documents(chunks, embeddings)\nretriever = vectorstore.as_retriever(search_kwargs={\"k\": 4})\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>Verification: you should see the FAISS index build successfully, and calling the retriever should return the top matching chunks for a sample query.\u003C\u002Fp>\u003Ch2>Step 5: Wire the RAG chain\u003C\u002Fh2>\u003Cp>Goal: connect retrieval to generation so the model answers using the most relevant chunks from your dataset.\u003C\u002Fp>\u003Cpre>\u003Ccode>from langchain_openai import ChatOpenAI\nfrom langchain.chains import RetrievalQA\n\nllm = ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)\nqa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)\n\nresult = qa.invoke({\"query\": \"What does the codebase do?\"})\nprint(result[\"result\"])\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>Verification: you should see an answer that references your documents instead of a generic response, and the retrieved context should align with the question.\u003C\u002Fp>\u003Ch2>Step 6: Test chunk quality and retrieval\u003C\u002Fh2>\u003Cp>Goal: confirm that syntax-aware splitting and token-based chunking improve answer quality on code-heavy questions.\u003C\u002Fp>\u003Cp>Run a few targeted prompts such as function names, setup instructions, or architecture questions, then compare the retrieved chunks to the final answer. If the model misses key details, reduce chunk size, increase overlap, or add metadata filters for file type and path.\u003C\u002Fp>\u003Cp>Verification: you should see more precise answers for code and documentation questions, with fewer broken snippets and fewer irrelevant chunks in the top results.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Metric\u003C\u002Fth>\u003Cth>Before\u002FBaseline\u003C\u002Fth>\u003Cth>After\u002FResult\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Chunking method\u003C\u002Ftd>\u003Ctd>Character-based splits\u003C\u002Ftd>\u003Ctd>Token-based splits\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Code awareness\u003C\u002Ftd>\u003Ctd>Functions and blocks may break mid-way\u003C\u002Ftd>\u003Ctd>Splits stay closer to syntax boundaries\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Retrieval quality\u003C\u002Ftd>\u003Ctd>More noisy context\u003C\u002Ftd>\u003Ctd>More relevant top-k chunks\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Answer grounding\u003C\u002Ftd>\u003Ctd>Higher chance of generic responses\u003C\u002Ftd>\u003Ctd>More document-specific responses\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Common mistakes\u003C\u002Fh2>\u003Cul>\u003Cli>Using plain character chunking for code files. Fix: switch to a token-aware splitter and tune chunk size for functions and sections.\u003C\u002Fli>\u003Cli>Embedding too much content in one chunk. Fix: lower chunk size and increase overlap so retrieval returns focused context.\u003C\u002Fli>\u003Cli>Forgetting to verify retrieved sources. Fix: print the top-k chunks before generation and inspect whether the context matches the query.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What's next\u003C\u002Fh2>\u003Cp>Once this pipeline works, add metadata filters, source citations, persistence for the vector store, and evaluation tests so you can measure retrieval quality as your document set grows.\u003C\u002Fp>","Set up a code-aware retrieval augmented generation pipeline with LangChain.","www.datacamp.com","https:\u002F\u002Fwww.datacamp.com\u002Fcourses\u002Fretrieval-augmented-generation-rag-with-langchain",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781811178447-we5p.png","ai-agent","en","b8b8e12b-0e01-4204-bc59-eddef030606d",[17,18,19,20,21,22],"LangChain","RAG","embeddings","FAISS","token splitting","retrieval",[24,25,26],"Use token-aware splitting for code and Markdown so chunks stay semantically useful.","Store embeddings in a vector index and expose it through a retriever.","Verify retrieval before generation to improve answer grounding and reduce noise.",0,"2026-06-18T19:32:32.646714+00:00","2026-06-18T19:32:32.638+00:00","c58956f2-0e6f-4be5-b68a-39eda67428b3",{"tags":32,"relatedLang":38,"relatedPosts":42},[33,35,37],{"name":18,"slug":34},"rag",{"name":17,"slug":36},"langchain",{"name":19,"slug":19},{"id":15,"slug":39,"title":40,"language":41},"build-code-aware-rag-pipeline-langchain-zh","建立具程式感知的 RAG 管線","zh",[43,49,55,61,67,73],{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":13},"a882d067-6acb-447d-993e-27a057d19e16","glm-5-vibe-coding-agentic-engineering-en","GLM-5 turns vibe coding into agentic engineering","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781853520038-t5b8.png","2026-06-19T07:18:09.934598+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":13},"221ce4cc-ac8a-486b-97ed-b5ddaf6c6cf7","kimi-k2-6-turns-agents-into-a-swarm-en","Kimi K2.6 turns agents into a swarm","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781824696228-ongx.png","2026-06-18T23:17:48.267558+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":13},"6908129c-aaf5-4ffa-bbee-00c0c64d8332","lightrag-simple-defaults-beat-rag-complexity-en","LightRAG proves graph RAG needs simpler defaults, not more complexity","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781812063896-xlys.png","2026-06-18T19:47:20.976816+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":13},"2c508377-9009-41ad-8a60-32531961b37b","ebay-mcp-ai-assistants-ebay-sell-apis-en","ebay-mcp puts eBay Sell APIs in AI assistants","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781809378267-qvns.png","2026-06-18T19:02:33.802715+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":13},"e60c0f75-2fb3-4038-b0ab-4b0012007c73","github-last30days-skill-ai-research-model-en","GitHub’s last30days skill is the right model for AI research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781752667618-v8qc.png","2026-06-18T03:17:23.11071+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":13},"91107dfa-fd91-433e-8d63-6dc73fc925ca","tcs-anthropic-enterprise-ai-partnership-en","TCS and Anthropic strike enterprise AI pact","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781713079931-ntip.png","2026-06-17T16:17:35.913583+00:00",[80,85,90,95,100,105,110,115,120,125],{"id":81,"slug":82,"title":83,"created_at":84},"03db8de8-8dc2-4ac1-9cf7-898782efbb1f","anthropic-claude-ai-agent-task-automation-en","Anthropic's Claude AI Agent: A New Era of Task Automation","2026-03-25T16:25:06.513026+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"045d1abc-190d-4594-8c95-91e2a26f0c5a","googles-2026-ai-agent-report-decoded-en","Google’s 2026 AI Agent Report, Decoded","2026-03-26T11:15:23.046616+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"e64aba21-254b-4f93-aa21-837484bb52ec","kimi-k25-review-stronger-still-not-legend-en","Kimi K2.5 review: stronger, still not a legend","2026-03-27T07:15:55.385951+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"30dfb781-a1b2-4add-aebe-b3df40247c37","claude-code-controls-mac-desktop-en","Claude Code now controls your Mac desktop","2026-03-28T03:01:59.384091+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"254405b6-7833-4800-8e13-f5196deefbe6","cloudflare-100x-faster-ai-agent-sandbox-en","Cloudflare’s 100x Faster AI Agent Sandbox","2026-03-28T03:09:44.356437+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"04f29b7f-9b91-4306-89a7-97d725e6e1ba","openai-backs-isara-agent-swarm-bet-en","OpenAI backs Isara’s agent-swarm bet","2026-03-28T03:15:27.849766+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"3b0bf479-e4ae-4703-9666-721a7e0cdb91","openai-plan-automated-ai-researcher-en","OpenAI’s plan for an automated AI researcher","2026-03-28T03:17:42.312819+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"fe91bce0-b85d-4efa-a207-24ae9939c29f","harness-engineering-ai-agent-reliability-2026","Harness Engineering: From Bridle to Operating System, The Missing Link in AI Agent Reliability","2026-03-31T06:36:55.648751+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"7a09007d-820f-43b3-8607-8ad1bfcb94c8","mcp-explained-from-prompts-to-production-en","MCP Explained: From Prompts to Production","2026-04-01T09:24:40.089177+00:00",{"id":126,"slug":127,"title":128,"created_at":129},"116d5ee9-a4f1-4b5a-aac5-5d035dd22bbe","amazon-bedrock-agents-multi-agent-workflows-en","Amazon Bedrock Agents Gets Multi-Agent Workflows","2026-04-01T09:30:30.197685+00:00"]