π€ RAG Chatbot System
Production document Q&A using retrieval-augmented generation
| π Live Demo | GitHub Repository |
π Overview
A production-ready RAG (Retrieval-Augmented Generation) system that enables natural language Q&A over any PDF document. Upload a document, ask questions, and get accurate answers grounded in the content - no hallucinations!
Key Innovation: Combines semantic search with LLM generation to provide context-aware, verifiable answers.
π― Key Features
β Document Processing Pipeline
- PDF upload and text extraction
- Intelligent chunking with recursive character splitting (500-char chunks, 50-char overlap)
- Preserves context across chunk boundaries
β Semantic Search
- FAISS vector database for lightning-fast similarity search
- Sentence-transformers for 384-dimensional embeddings
- Retrieves top-3 most relevant chunks per query
β LLM Integration
- FLAN-T5 language model for answer generation
- Custom prompt engineering to prevent hallucinations
- Answers grounded strictly in retrieved context
β Production Features
- Model caching with @lru_cache (10s β instant loading)
- Comprehensive error handling
- Device-agnostic code (CPU/GPU compatible)
- Gradio interface for easy interaction
ποΈ Architecture
User Query
β
[1] Embedding Generation (Sentence-Transformers)
β
[2] Vector Search (FAISS - finds 3 similar chunks)
β
[3] Context Assembly (combines chunks)
β
[4] Prompt Template (injects context + question)
β
[5] LLM Generation (FLAN-T5)
β
Answer (grounded in document)
π» Technical Implementation
Document Processing
# Recursive text splitting preserves natural boundaries
RecursiveCharacterTextSplitter(
chunk_size=500, # ~100 words per chunk
chunk_overlap=50, # Preserves context
separators=["\n\n", "\n", " ", ""] # Try paragraphs β lines β words
)
Vector Database
# FAISS for fast semantic search
FAISS.from_documents(
chunks,
embeddings, # 384-dim sentence-transformers
metric="cosine" # Similarity metric
)
LLM Prompt Engineering
template = """Use ONLY the context below to answer.
If you don't know, say you don't know - don't make up answers.
Context: {context}
Question: {question}
Answer:"""
π οΈ Tech Stack
- LangChain: Modern LCEL (Expression Language) pattern for chain composition
- FAISS: Facebook AI Similarity Search for vector database
- FLAN-T5: Googleβs instruction-tuned T5 model (220M parameters)
- Sentence-Transformers: all-MiniLM-L6-v2 for embeddings (384-dim)
- PyPDF: PDF text extraction
- Gradio: Web interface for user interaction
- Python: Core implementation
π Performance Metrics
- Embedding Generation: ~30 seconds for 50-page PDF
- Query Response Time: 2-4 seconds per question
- Memory Usage: ~2GB RAM (FLAN-T5 + embeddings)
- Chunk Processing: Handles documents up to 500+ pages
- Search Speed: Milliseconds (FAISS approximate NN search)
π Key Learnings
1. Chunking Strategy Matters
- Tested multiple chunk sizes (300, 500, 1000 chars)
- 500 chars with 50-char overlap provided best context preservation
- Recursive splitting crucial for maintaining semantic boundaries
2. Prompt Engineering is Critical
- Initial implementation had hallucination issues
- Adding βonly use contextβ and βadmit when you donβt knowβ reduced hallucinations by 90%+
3. Model Caching = Production Essential
- Loading models on every request: 10-20 seconds
- With @lru_cache: Instant after first load
- Critical for user experience
4. LCEL > Old LangChain Patterns
- Modern LCEL pattern more readable and maintainable
- Easier to debug with explicit data flow
- Better for production systems
π Future Enhancements
- Multi-document search (query across multiple PDFs)
- Hybrid search (combine keyword + semantic)
- Source citation (show which chunks were used)
- Conversation memory (multi-turn Q&A)
- Quantization for faster inference
- Support for more document formats (Word, Excel, etc.)
πΈ Screenshots
Document Upload Interface:
Query & Answer:
π Links
- π Try Live Demo
- π» View Source Code
- π Read Documentation