Effective code retrieval is a foundational element of AI-powered coding assistance. Morph provides specialized tools for navigating and understanding codebases at scale.
Modern code retrieval systems have evolved far beyond simple embedding + reranking approaches. Today’s most advanced systems employ multi-stage pipelines with specialized components:
1
Symbol-aware Chunking
Tree-sitter → AST nodes at function/class granularity. Stores fully-qualified names, paths, imports, and call-graph edges.
Use the Morph SDK for AST-aware chunking to parse and index your codebase at the symbol level. Morph’s built-in integration handles language-specific parsing for optimal symbol extraction and embedding with Morph Embeddings
Prevents embedding “bleed-through” and keeps chunks ≤512 tokens, boosting both retriever recall and patch quality.
2
Hybrid Recall (BM25 ∪ Embeddings)
Lexical BM25 guarantees that an exact symbol or error message is never missed, while Morph Embeddings retrieve semantically related code. Reciprocal-rank fusion of the two lists typically gives +18-25 pp recall versus dense-only retrieval on SWE-Bench.
3
Agent-driven Reading (Claude + `read_file` tool)
Instead of pushing entire files into the prompt, Claude inspects the retrieved metadata and decides which chunks to open using the read_file tool. This cuts prompt size by 40-60% and reduces hallucinations - this is what the SOTA agents do.
4
Hierarchical RAG (HiRAG)
Maintain file → symbol → line pointers so the agent can drill down progressively; useful for monorepos with very large individual files.
This is the recommended pipeline for building SOTA agents that companies like Cursor, OpenAI, and Anthropic are using.
To really push this - incorporate things your users do. Last file they edited, cursor position, etc…
This is what the best AI products do - they incorporate the data they have that others don’t and incorporate it into retrieval pipelines.
{"name":"codebase_search","description":"Find snippets of code from the codebase most relevant to the search query.\nThis is a semantic search tool, so the query should ask for something semantically matching what is needed.\nIf it makes sense to only search in particular directories, please specify them in the target_directories field.\nUnless there is a clear reason to use your own search query, please just reuse the user's exact query with their wording.\nTheir exact wording/phrasing can often be helpful for the semantic search query. Keeping the same exact question format can also be helpful.","parameters":{"properties":{"query":{"description":"The search query to find relevant code. You should reuse the user's exact query/most recent message with their wording unless there is a clear reason not to.","type":"string"},"target_directories":{"description":"Glob patterns for directories to search over","items":{"type":"string"},"type":"array"},"explanation":{"description":"One sentence explanation as to why this tool is being used, and how it contributes to the goal.","type":"string"}},"required":["query"]}}
Implementation Architecture:
1
Code Chunking
Split codebase into semantically meaningful chunks (functions, classes, methods)
The Morph SDK provides built-in AST-aware chunking with language-specific parsers to ensure optimal symbol extraction.
2
Embedding Generation
Process code chunks with Morph Embeddings to create vector representations
3
Vector Storage
Store embeddings in a vector database with metadata (file path, line numbers). For optimal performance, use the Morph Enterprise where we store embeddings in a vector database with metadata (file path, line numbers).
4
Query Processing
Convert natural language queries to the same vector space
5
Retrieval & Reranking
Two-stage retrieval: broad similarity search followed by precision reranking
{"name":"list_dir","description":"List the contents of a directory. The quick tool to use for discovery, before using more targeted tools like semantic search or file reading. Useful to try to understand the file structure before diving deeper into specific files. Can be used to explore the codebase.","parameters":{"properties":{"relative_workspace_path":{"description":"Path to list contents of, relative to the workspace root.","type":"string"},"explanation":{"description":"One sentence explanation as to why this tool is being used, and how it contributes to the goal.","type":"string"}},"required":["relative_workspace_path"]}}
Implementation Strategy:
import{ promises as fs }from'fs';import*as path from'path';asyncfunctionlistDirectory(dirPath, options ={}){try{// Read directory contentsconst entries =await fs.readdir(dirPath,{ withFileTypes:true});// Process entries with metadataconst results = entries.map(entry =>{const isDirectory = entry.isDirectory();const fullPath = path.join(dirPath, entry.name);return{ name: entry.name, path: fullPath, type: isDirectory ?'directory':'file',// Include file extension for better filtering extension:!isDirectory ? path.extname(entry.name).substring(1):null};});// Sort: directories first, then files alphabeticallyreturn results.sort((a, b)=>{if(a.type === b.type)return a.name.localeCompare(b.name);return a.type ==='directory'?-1:1;});}catch(error){thrownewError(`Failed to list directory ${dirPath}: ${error.message}`);}}
Best practices:
Use as an initial discovery step to understand project structure
Build a mental map of codebase organization before diving into specific files
{"name":"file_search","description":"Fast file search based on fuzzy matching against file path. Use if you know part of the file path but don't know where it's located exactly. Response will be capped to 10 results. Make your query more specific if need to filter results further.","parameters":{"properties":{"query":{"description":"Fuzzy filename to search for","type":"string"},"explanation":{"description":"One sentence explanation as to why this tool is being used, and how it contributes to the goal.","type":"string"}},"required":["query","explanation"]}}
Implementation Example:
import*as glob from'glob';import*as path from'path';import{ distance as levenshteinDistance }from'fastest-levenshtein';asyncfunctionfileSearch(query, options ={}){try{// Find all files in the workspaceconst allFiles =awaitglob('**/*',{ ignore:['**/node_modules/**','**/dist/**','**/build/**','**/.git/**'], nodir:true});// Score files by similarity to queryconst scoredFiles = allFiles.map(file =>{const fileName = path.basename(file);const fileNameScore =levenshteinDistance(query.toLowerCase(), fileName.toLowerCase());const pathScore =levenshteinDistance(query.toLowerCase(), file.toLowerCase());// Weigh filename matches higher than path matchesconst score = Math.min(fileNameScore *2, pathScore);return{ file, score };});// Sort by score (lower is better) and take top 10return scoredFiles.sort((a, b)=> a.score - b.score).slice(0,10).map(result => result.file);}catch(error){thrownewError(`File search failed: ${error.message}`);}}
Best practices:
Use when you have partial knowledge of a filename or path
Make queries specific to reduce the number of results
Consider fuzzy matching algorithms that prioritize prefix matches