
MCP Servers with Code Mode: The missing piece in Agentic AI
A practical look at why MCP tool calling hits scaling limits, and how Code Mode (typed APIs + sandboxed execution) unlocks efficient multi-step agent workflows.
Aug 14, 2025
10 min read
written by Utkarsh Khanna

Retrieval-Augmented Generation (RAG) is transforming how modern applications access and use knowledge. Instead of relying solely on a model’s training data, RAG systems bring in relevant, fresh, and domain-specific context at runtime—bridging the gap between static LLMs and real-world intelligence.
This guide unpacks the building blocks of a production-grade RAG system—from architecture decisions and document processing to vector search, performance tuning, and evaluation metrics. Whether you’re creating an internal research assistant, customer support bot, or AI-driven PDF tool, this foundation will help you ship smarter, faster, and more reliable RAG applications.
RAG stands for Retrieval-Augmented Generation, a paradigm where a language model is enhanced with dynamic, relevant context from a connected knowledge base. Instead of asking the model to “remember everything,” RAG systems retrieve information on-the-fly to ground the response.
Why it matters:
Common use cases:

Users upload PDFs and instantly receive AI-powered summaries, bullet-point insights, and extracted metadata. It saves time and cognitive load during legal reviews, research reading, or market analysis.
Static documents become interactive. Users can ask natural language questions and receive contextual answers from within the document, navigating complex reports conversationally.
Thanks to OpenAI’s streaming APIs and frameworks like Next.js 14, responses flow in real-time, boosting perceived performance and keeping the interaction smooth.
Even the best systems face interruptions. Production-grade RAG apps monitor connectivity and inform users gracefully if the chat fails or the connection is lost.

The application leverages Next.js 14 with the App Router, providing several architectural advantages:
The UI layer uses Tailwind CSS with a custom theme configuration supporting both light and dark modes. This approach ensures consistent styling while maintaining flexibility for future design iterations.

The heart of any RAG system lies in its document processing pipeline. Our implementation follows a sophisticated multi-stage approach:
When users upload PDFs, the system first extracts text content while preserving document structure. This involves:
We use semantic chunking strategies:
// Intelligent chunking strategy
function chunkDocument(text, options = {}) {
const {
maxChunkSize = 1000,
overlap = 200,
preserveStructure = true
} = options;
}
Effective chunking strategies include:
Each document chunk is converted to high-dimensional vector representations using OpenAI's embedding models. These embeddings capture semantic meaning, enabling similarity-based retrieval:
async function generateEmbeddings(chunks) {
const embeddings = await Promise.all(
chunks.map(chunk => openai.embeddings.create({
model: "text-embedding-3-small",
input: chunk.content
}))
);
return embeddings.map((embedding, index) => ({
...chunks[index],
embedding: embedding.data[0].embedding
}));
}
We store vectors in MongoDB Atlas with vector search enabled, alongside metadata.
The most technically challenging aspect of RAG systems is implementing efficient vector search. Our MongoDB-based approach includes several optimizations:

db.documents.createIndex({
"embedding": "vectorSearch"
}, {
"vectorSearchOptions": {
"type": "knn",
"dimensions": 1536,
"similarity": "cosine"
}
});
Vector queries are optimized for both accuracy and performance:
async function searchSimilarChunks(queryEmbedding, options = {}) {
const {
limit = 5,
threshold = 0.7,
filters = {}
} = options;
return await db.collection('documents').aggregate([
{
$vectorSearch: {
index: "vector_index",
path: "embedding",
queryVector: queryEmbedding,
numCandidates: limit * 10,
limit: limit,
filter: filters
}
},
{
$match: {
score: { $gte: threshold }
}
}
]);
}
Advanced RAG systems combine vector search with traditional text search for improved accuracy:

Production RAG applications require careful optimization across multiple dimensions:
Document processing can be memory-intensive, especially for large PDFs. Key optimizations include:
async function processDocumentInBatches(document, batchSize = 50) {
const chunks = chunkDocument(document);
const batches = [];
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const embeddings = await generateEmbeddings(batch);
await storeBatch(embeddings);
// Prevent memory leaks
if (global.gc) global.gc();
}
}
Response streaming improves user experience significantly:
async function* streamRAGResponse(query, retrievedChunks) {
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: buildSystemPrompt(retrievedChunks)
},
{
role: "user",
content: query
}
],
stream: true
});
for await (const chunk of completion) {
if (chunk.choices[0]?.delta?.content) {
yield chunk.choices[0].delta.content;
}
}
}
Production RAG systems require comprehensive evaluation frameworks:
Extending beyond text to include images, tables, and other media types in document understanding and retrieval.
Integrating RAG with autonomous agents that can reason about when and how to retrieve information, potentially querying multiple knowledge bases.
Moving beyond general-purpose embedding models to domain-specific, fine-tuned retrievers that better understand specialized terminology and concepts.
Building production-ready RAG applications requires careful attention to architecture, performance, and user experience. The combination of intelligent document processing, efficient vector search, and optimized generation creates powerful systems that can transform how users interact with information.
The key to successful RAG implementation lies in understanding the specific requirements of your use case, choosing appropriate technologies, and continuously optimizing based on user feedback and performance metrics. As the field continues advancing, RAG applications will become increasingly sophisticated, enabling even more natural and effective human-document interactions.
Whether you're building a PDF insight tool, a customer support system, or a research assistant, the principles and techniques outlined in this guide provide a solid foundation for creating RAG applications that deliver real value to users while maintaining the performance and reliability required for production environments.

Software Engineer
Utkarsh is a mid-level Engineer with Strong experience in networking and server side technologies