Retrieval Augmented Generation Bootcamp

About This Offering

This intensive workshop provides a technical foundation in Retrieval-Augmented Generation (RAG) using a "Primitives-First" methodology. We reject the "magic wrapper" approach. Instead, we use progressive architecture.

Registration

Browse our catalog

Check out all of our short course and bootcamp offerings! see all offerings

About the Instructor

Shubham Chatterjee, Director, IRIS Lab
Email: chatterjee@mst.edu
Website: https://sites.mst.edu/schatterjee/
LinkedIn: com/in/shubham-chatterjee

About the Course

Registration: Open until March 31, 2026
Course Dates: April 7 & 9, 2026
PDH: 16
Price: $1,249
Location: 2127 Innerbelt Business Center Drive, St. Louis, MO 63114
- Directions
- Also delivered live online via Zoom
Required:
- Access to laptop computer
- Basic Python literacy (can read and modify code)
- Willingness to learn by doing
- Google account (for Colab access)

Helpful but Not Required:
- Familiarity with numpy/pandas
- Understanding of machine learning concepts
- Experience with APIs or transformers library

Not Required:
- Deep learning expertise
- NLP background
- Information retrieval knowledge
- Production deployment experience
- Paid API keys or services (everything is open-source!)

This bootcamp is ideal for:
- Software engineers exploring AI/ML applications
- Data scientists integrating LLMs into workflows
- Technical leaders evaluating RAG solutions
- Researchers in information retrieval and NLP
- Anyone building document-based AI systems

This bootcamp is especially valuable if you:
- Want to understand RAG internals, not just use high-level APIs
- Are building custom RAG solutions requiring fine-grained control
- Want hands-on experience with both primitives and frameworks

Course Overview

This intensive workshop provides a technical foundation in Retrieval-Augmented Generation (RAG) using a "Primitives-First" methodology.

We reject the "magic wrapper" approach. Instead, we use progressive architecture:

Day 1 — The Mechanics (Build from Scratch):
You will build retrieval engines and RAG pipelines using raw Python libraries (numpy, FAISS, transformers). Every line of code will be transparent. You'll work with the Vaswani dataset (11,429 IR research abstracts with 93 queries) to understand how semantic search actually works—no frameworks, no abstraction layers.

Day 2 — The Production (Orchestrate at Scale):
You will scale to production using LlamaIndex on the BEIR Programmers benchmark (32K StackExchange programming posts with 876 queries). You'll see how frameworks simplify complexity while building a technical Q&A system that answers real programming questions.

Technology Stack

We use a specific toolset for each stage of learning:

Day 1: The Primitives (White-Box)

Purpose: Understand every component by building from scratch

Embeddings: sentence-transformers (all-MiniLM-L6-v2)
Vector Operations: numpy for manual similarity calculations
Indexing: FAISS for vector search
LLM: Llama 3.2 (3B) via transformers library
Evaluation: PyTerrier for IR metrics (MRR, NDCG, MAP)
Dataset: Vaswani corpus
- 11,429 scientific abstracts from information retrieval research
- 93 queries with complete relevance judgments
- Built into PyTerrier (zero setup)
- Domain: Learning RAG by searching RAG papers

Day 2: The “Production” Stack

Purpose: Scale to production with orchestration frameworks

Orchestration: LlamaIndex for pipeline management
LLM Integration: Same Llama 3.2 through LlamaIndex
Data Loading: ir_datasets (standard IR benchmark library)
Dataset: BEIR CQADupStack Programmers
- 32,000 StackExchange programming posts
- 876 queries (real programming questions)
- 1,700+ relevance judgments (qrels)
- Domain: Technical Q&A from StackExchange
- Zero preparation needed (built into ir_datasets)

Day 1 Morning Schedule

Day 1: Retrieval Architectures (No Frameworks)

Goal: Build a RAG system from scratch to understand the internal mechanics.

09:00 - 09:30 | Welcome & The "Watch It Fail" Demo

The Hook: Live hallucination demonstration with Llama 3.2
- Watch the model confidently fabricate citations and facts
- Understand viscerally why RAG is necessary
The Solution: Why RAG exists—limitations of parametric knowledge
Roadmap: Build Manually (Day 1) → Scale with Frameworks (Day 2)
Two-day overview: Vaswani → BEIR Programmers, Primitives → Frameworks

09:30 - 10:30 | Lecture: LLM Fundamentals & Setup

Theory: How LLMs work
- Tokenization and next-token prediction
- Context windows and attention mechanisms
- Why hallucination happens
- The need for external knowledge
Open-source vs. Proprietary: Cost, privacy, control, learning benefits
Lab Setup:
- Colab GPU verification and configuration
- Loading Llama 3.2 (3B parameters)
- First generation test (observe hallucination)
Dataset Introduction: The Vaswani Corpus
- What it contains: 11,429 IR research abstracts
- Why it's perfect for learning: built-in, scientific, manageable
- Loading with PyTerrier (one line of code!)

10:30 - 11:00 | Break

11:00 - 12:30 | Lab 1: Vector Spaces & Embeddings

Learning Objective: Understand semantic similarity by working directly with vectors.

Activities:

Load the Vaswani dataset (11,429 abstracts, 93 queries)
Encode text into numerical vectors using sentence-transformers
Calculate cosine similarity manually with numpy
Visualize why semantically similar documents rank higher
Explore the vector space with concrete examples
See how "meaning" becomes "mathematics"

Key Concepts:

Embeddings as dense representations of meaning
Cosine similarity for measuring semantic closeness
High-dimensional vector spaces

Deliverable: Deep intuition of how text becomes searchable mathematics.

12:30 - 01:30 | Lunch

Day 1 Afternoon Schedule

01:30 - 02:45 | Lab 2: Building the Dense Retriever

Learning Objective: Build a vector search engine from scratch and evaluate it.

Activities:

Encode all 11,429 Vaswani abstracts (~5 minutes)
Build a FAISS index manually—no frameworks, you see every step
- Initialize index with correct dimensionality
- Normalize vectors for cosine similarity
- Add embeddings to index
Retrieve top-100 documents for all 93 queries
Evaluate your retriever using PyTerrier:
- MRR@10 (Mean Reciprocal Rank): How quickly do you find relevant docs?
- NDCG@10 (Normalized Discounted Cumulative Gain): Quality of ranking
- MAP (Mean Average Precision): Overall precision across all queries
- Recall@100: Coverage of relevant documents
Compare your scores to baseline systems
Analyze: Which queries work well? Which fail? Why?

Checkpoint: You should achieve approximately:

MRR@10 ≈ 0.25
NDCG@10 ≈ 0.35
MAP ≈ 0.20

Key Understanding: See where dense retrieval excels (semantic/conceptual queries) and where it struggles (entity/name queries).

Deliverable: A transparent, working search engine with evaluation metrics—you built every component.

02:45 - 03:15 | Break

03:15 - 05:00 | Lab 3: The Manual RAG Pipeline

Learning Objective: Orchestrate the "Retrieve → Augment → Generate" flow manually.

Activities:

Connect your FAISS retriever to Llama 3.2
Format retrieved documents into context (manual string manipulation)
Experiment with four prompt strategies:
1. Baseline prompt: No context provided
  - Result: Observe hallucination and fabrication
2. Context injection: Add retrieved documents
  - Result: Better, but still sometimes drifts from sources
3. Grounded prompting: "Based ONLY on the provided sources..."
  - Result: Much better adherence to sources
4. With citations: Add explicit citation requirements
  - Result: Best—grounded AND attributed

Manually parse and validate citations from responses
Generate answers for 10 Vaswani queries
Compare quality across all four prompt strategies

Analysis Activities:

Measure hallucination rate for each prompt strategy
Identify which prompt elements matter most
Understand the dramatic impact of prompt engineering on RAG quality

Key Learning: Prompt design is absolutely critical for RAG performance. Small changes in instructions produce dramatically different results.

Deliverable: A complete RAG system where you control every component and understand exactly how each piece affects output quality.

Day 2 Morning Schedule

Day 2: Production Architecture (LlamaIndex)

Goal: Scale to production complexity using orchestration frameworks on a larger, more challenging dataset.

09:00 - 09:30 | Day 1 Recap + Scaling to Production

Review Yesterday:

Built retrieval manually on 11K documents (Vaswani)
Saw every component: embeddings, indexing, retrieval, prompting
Understand the mechanics deeply

Today's Challenge:

Scale from 11K → 32K documents
Shift from academic abstracts → real-world technical Q&A
Use production frameworks (LlamaIndex) to handle complexity
Build a system that answers actual programming questions

Introduction to BEIR:

BEIR = Benchmark for IR (widely used in research)
18 different datasets across domains
We'll use: CQADupStack/Programmers
- 32,000 StackExchange programming posts
- Real questions from programmers
- 876 test queries with relevance judgments

Why This Dataset:

Relevant domain (you're likely programmers!)
Real-world content (not sanitized academic text)
Interesting questions you can relate to
Zero preparation needed (built into ir_datasets)

09:30 - 10:30 | Lecture: Introduction to LlamaIndex

The Core Question: "Why use frameworks when we can build manually?"

Comparing Approaches:

Yesterday's Manual Approach (50+ lines):

Load documents manually
Encode with sentence-transformers
Build FAISS index step-by-step
Write retrieval logic
Format context manually
Craft prompts
Parse responses

Today with LlamaIndex (10 lines):

Load documents
Create index (LlamaIndex handles embedding + indexing)
Create query engine (handles retrieval + context + prompting)
Query and get response

What LlamaIndex Abstracts:

Document loading and parsing
Automatic chunking strategies
Embedding and indexing orchestration
Retrieval with configurable parameters
Context window management
Response synthesis from multiple sources
Citation tracking

What You Still Control:

Choice of LLM and embedding model
Custom prompt templates
Retrieval parameters (top-k, similarity threshold)
Response modes (compact, tree_summarize, etc.)
Evaluation and optimization

When to Use Frameworks:

Standard RAG use cases (your own documents)
Rapid prototyping and iteration
Production deployments (tested, maintained code)
When you need features like streaming, callbacks, agents

When to Stay Low-Level:

Custom requirements (special formats, unusual workflows)
Performance-critical applications (need fine-grained control)
Research (exploring novel approaches)
Debugging complex issues (need to see everything)

Key Insight: After Day 1, you understand what LlamaIndex is doing. You're not using "magic"—you know the mechanics underneath.

10:30 - 11:00 | Break

11:00 - 12:30 | Lab 4: Building Retrieval with LlamaIndex

Learning Objective: Use LlamaIndex to build retrieval at scale, understanding what it abstracts from your Day 1 manual work.

Activities:

Load BEIR CQADupStack/Programmers:

Use ir_datasets library (standard for IR research)
Load 32K StackExchange posts
Load 876 test queries
Load relevance judgments (qrels)

Explore the data:

Examine sample posts (questions + answers)
Look at code snippets, technical terminology
Review sample queries (real programming questions)
Examples you'll see:
- "How to debug memory leaks in C++?"
- "What is the difference between abstract class and interface?"
- "Best practices for REST API design"

Convert to LlamaIndex format:

Transform StackExchange posts into LlamaIndex Documents
Preserve metadata (document IDs for evaluation)

Build index with LlamaIndex:

One function call handles: chunking, embedding, indexing
Takes ~10-15 minutes for 32K documents
Compare to Day 1: same result, much less code

Test retrieval:

Query with sample programming questions
Examine retrieved posts
Verify relevant content is returned

Evaluate retrieval quality:

Calculate MRR@10, NDCG@10 using official qrels
Compare to published baselines (if available)
Analyze which types of queries work well

Reflection Questions:

How does LlamaIndex's retrieval compare to your Day 1 FAISS retriever?
What's different about retrieving from technical posts vs. academic abstracts?
Can you identify where LlamaIndex is doing what you did manually yesterday?

Checkpoint: Everyone should have:

Working LlamaIndex retriever over 32K documents
Evaluation metrics calculated
Understanding of retrieval quality on this dataset

Deliverable: Production-scale retriever built with framework, evaluated with standard metrics.

12:30 - 01:30 | Lunch

Day 2 Afternoon Schedule

01:30 - 02:30 | Lecture: Generation with LlamaIndex

RAG for Technical Q&A:

Unique Challenges:

Code snippets: Need to preserve formatting, syntax
Technical terminology: Must be precise and accurate
Multiple answers: StackExchange posts have various solutions
Context length: Posts can be long, need smart selection
Synthesis: Combine insights from multiple posts

Prompt Engineering for Technical Content:

Clear instructions for handling code
Emphasis on accuracy and precision
Instructions for synthesizing multiple sources
Handling cases where posts contradict each other

LlamaIndex Query Engines:

Retriever mode: Just get documents
Query engine mode: Retrieve + synthesize answer
Response modes:
- compact: Best for focused questions
- tree_summarize: Best for synthesis across many documents
- refine: Iterative refinement of answer

Configuring for Programming Q&A:

Custom prompt templates
Appropriate top-k (number of posts to retrieve)
Response mode selection
Citation configuration

Example Use Cases:

Debug help: "Why is my code throwing NullPointerException?"
Concept explanation: "What's the difference between processes and threads?"
Best practices: "How should I structure a REST API?"

02:30 - 03:00 | Lab 5: End-to-End RAG with LlamaIndex

Learning Objective: Build a complete RAG pipeline that answers programming questions using retrieved StackExchange posts.

Activities:

Configure LlamaIndex with Llama 3.2:

Set up the LLM connection
Choose embedding model (same as Day 1 for consistency)

Create custom prompt template:

Design prompt specifically for programming Q&A
Include instructions for:
- Using retrieved StackExchange posts
- Preserving code formatting
- Synthesizing multiple answers
- Being precise with technical terms
- Citing source posts

Build query engine:

Connect retriever to LLM
Configure response mode
Set retrieval parameters (top-k)

Generate answers for test queries:

Pick 10 diverse programming questions
Generate answers using your RAG system
Review outputs manually

Quality assessment:

Accuracy: Is technical information correct?
Completeness: Does it address the full question?
Faithfulness: Does it stay grounded in retrieved posts?
Usefulness: Would this actually help a programmer?
Code handling: Are code snippets preserved correctly?

Compare to source posts:

Look at what was retrieved
See how the answer synthesizes information
Identify cases of good vs. poor synthesis

Reflection:

How does generation quality depend on retrieval quality?
What makes a "good" answer for programming questions?
Where does the system struggle?

Deliverable: Working RAG system that can answer real programming questions from StackExchange knowledge.

03:00 - 03:15 | Break

03:15 - 04:30 | Lab 6 & 7: Evaluation & Iterative Optimization

Learning Objective: Build evaluation suite and systematically improve your RAG system through measured iteration.

Part 1: Comprehensive Evaluation (45 minutes)

Retrieval Evaluation:

Run your retriever on all 876 test queries
Calculate standard metrics using qrels:
- MRR@10 (Mean Reciprocal Rank)
- NDCG@10 (Normalized Discounted Cumulative Gain)
- MAP (Mean Average Precision)
- Recall@100 (Coverage)
Generate per-query breakdown
Identify queries with poor retrieval

Generation Evaluation:

Automated metrics:

Response length distribution
Citation coverage (how many retrieved docs cited?)
Format validation (proper structure?)

Manual assessment (on 20 sample queries):

Faithfulness: Answer grounded in sources? (Score 1-5)
Accuracy: Technical information correct? (Score 1-5)
Completeness: Addresses full question? (Score 1-5)
Usefulness: Actually helpful for a programmer? (Score 1-5)

LLM-as-judge (optional):

Use Llama 3.2 to evaluate other responses
Automated scoring of larger query set
Compare to your manual assessments

Generate Evaluation Report:

Summary statistics (averages, distributions)
Best-performing queries (what went right?)
Worst-performing queries (what went wrong?)
Patterns in failures (types of questions that struggle)

Analysis Questions:

Where does retrieval fail? (wrong posts retrieved)
Where does generation fail? (poor synthesis, hallucination)
What types of questions work best?
What types of questions need improvement?

Part 2: Iterative Optimization (45 minutes)

The Optimization Loop:

Identify Target Queries:

Select bottom 20% by metrics
Focus on fixable failures (not data limitations)

Diagnose Root Causes:

Retrieval failure?
- Relevant posts exist but not retrieved
- Wrong posts ranked higher
- Query needs reformulation
Generation failure?
- Retrieved posts are good but answer is poor
- Hallucination or drift from sources
- Poor synthesis of multiple posts
- Code formatting lost

Apply Targeted Fixes:
For retrieval issues:

Adjust similarity_top_k (retrieve more/fewer)
Experiment with different embedding models
Try query reformulation or expansion

For generation issues:

Refine prompt template (add specific instructions)
Change response mode (compact vs. tree_summarize)
Adjust temperature or other generation parameters
Add examples to prompt (few-shot)

Re-evaluate:

Run evaluation again on your fixes
Measure improvement in metrics
Verify fixes didn't break other queries

Document Changes:

Track what you tried
Record which improvements worked
Note unexpected side effects

Iterate:

Repeat cycle until metrics plateau or time runs out
Aim for 15-20% improvement over baseline

Collaborative Learning:

Share strategies with classmates
Discuss what worked and what didn't
Learn from each other's approaches

Goal: Achieve measurable, documented improvement in your RAG system through systematic optimization.

Deliverable:

Optimized RAG system with improved metrics
Documentation of optimization process
Understanding of what works and why

04:30 - 05:00 | Results Showcase & Wrap-Up

Class Showcase:

Lightning Demos (2 minutes each):

Share one query where your system excels
Show one challenging query and how you addressed it
Demonstrate your biggest improvement

Class Metrics Competition:

Best Retrieval: Highest MRR@10 or NDCG@10
Best Generation: Highest manual quality scores
Most Improved: Biggest gain from baseline
Best Technical Answer: Vote on most helpful programming response

Discussion: Production Considerations

Scaling Beyond the Bootcamp:

Handling larger document collections (millions of posts)
Serving queries with low latency (<500ms)
Caching strategies for repeated queries
Incremental index updates (new StackExchange posts)

Real-World Technical Q&A Systems:

GitHub Copilot Chat (code generation + search)
StackOverflow search enhancements
Internal developer documentation systems
IDE-integrated help systems

Cost and Performance:

Embedding costs (one-time vs. query-time)
LLM inference costs (per query)
Infrastructure requirements
Trade-offs: accuracy vs. speed vs. cost

When to Use What We Learned:

Use primitives (Day 1 approach) when:

Building novel research systems
Need fine-grained performance optimization
Debugging complex production issues
Custom requirements not supported by frameworks

Use frameworks (Day 2 approach) when:

Standard RAG use cases
Rapid prototyping
Production systems (tested, maintained)
Team projects (shared abstractions)

Next Steps & Resources:

Continue Learning:

Try other BEIR datasets (18 total, various domains)
Build RAG for your own documents
Explore LlamaIndex advanced features (agents, streaming)
Learn LangChain for multi-step workflows

Recommended Resources:

LlamaIndex documentation and tutorials
BEIR benchmark leaderboard (see state-of-the-art)
RAG research papers and best practices
Communities: Discord servers, GitHub discussions

Course FAQ

Q: I'm not a strong Python programmer. Can I keep up?
A: Yes. We provide starter code with clear instructions. You'll modify working code, not write from scratch. If you can read Python, you'll be fine. We focus on concepts, not syntax.

Q: Will Colab's free GPU be enough?
A: Yes. Everything is optimized for Colab's free tier. Llama 3.2 (3B parameters) runs comfortably. Indexing times are manageable (~5 min for Vaswani, ~15 min for BEIR).

Q: Why build from scratch on Day 1 instead of just using frameworks?
A: Understanding primitives means you can debug when frameworks fail. You'll know what LlamaIndex is doing under the hood, making you far more effective. It's like learning to drive a manual transmission—you understand the car better.

Q: What if I get stuck during hands-on sessions?
A: Every notebook has recovery checkpoints. The instructor provides real-time help. Collaborative debugging with peers is encouraged. We learn together.

Q: Do I need to know about StackExchange or programming Q&A for Day 2?
A: Not in detail. The content is relatable if you've ever programmed. Questions range from beginner to advanced. You can focus on queries matching your expertise level.

Q: Can I use these techniques on my own documents?
A: Absolutely! That's the goal. The techniques you learn (Vaswani, BEIR) transfer directly to any document collection—company docs, research papers, legal documents, etc.

Q: Will we cover fine-tuning LLMs?
A: No, this bootcamp focuses specifically on RAG. We'll discuss when fine-tuning is appropriate vs. RAG, but implementation is out of scope.

Q: What's the difference between BEIR and TREC?
A: Both are IR research benchmarks. BEIR is a suite of 18 datasets (we use one). TREC is a long-running competition series. Both are prestigious and portfolio-worthy.

Q: Can I work on this after the bootcamp?
A: Yes! BEIR has 18 total datasets to explore. Your code will run anywhere you have a GPU. We'll provide resources for continued learning.

Q: What if I want to learn more about LlamaIndex features not covered?
A: We cover core RAG functionality. LlamaIndex has many advanced features (agents, streaming, multi-document queries) you can explore after. We'll provide tutorial links.

Q: How much does participation cost?
A: [Pricing TBD] This includes all materials, lunches, and certification. No hidden costs—everything runs on free open-source tools.

Q: What if I can't attend in person?
A: This bootcamp is designed for in-person collaborative learning. Remote options may be offered in future sessions.

This workshop is offered through the Computer Science Department at Missouri University of Science and Technology. Developed and taught by Dr. Shubham Chatterjee, who leads the IRIS Lab and has published extensively in neural information retrieval and RAG systems at premier venues including SIGIR, EMNLP, and ECIR.

Ready to master RAG from the ground up? Join us for this intensive hands-on bootcamp where you'll build real systems on real benchmarks!

Find a Department

April 7 & 9, 2026Retrieval Augmented Generation (RAG) Bootcamp

About This Offering

Browse our catalog

About the Instructor

About the Course

Course Overview

Technology Stack

Day 1: The Primitives (White-Box)

Day 2: The “Production” Stack

Day 1 Morning Schedule

Day 1: Retrieval Architectures (No Frameworks)

Day 1 Afternoon Schedule

Day 2 Morning Schedule

Day 2: Production Architecture (LlamaIndex)

Day 2 Afternoon Schedule

Course FAQ

Corporate and Professional Education for St. Louis Operations Missouri University of Science and Technology

Missouri University of Science and Technology

Top Searches

Quick Links