About This Offering

This intensive workshop provides a technical foundation in Retrieval-Augmented Generation (RAG) using a "Primitives-First" methodology. We reject the "magic wrapper" approach. Instead, we use progressive architecture.

Registration

  • Registration: Open until March 29, 2026
  • Course Dates: April 7 & 9, 2026
  • PDH: 16
  • Price: $1,249
  • Location: 2127 Innerbelt Business Center Drive, St. Louis, MO 63114
    • Directions 
    • Also delivered live online via Zoom
  • Required:
    • Access to laptop computer
    • Basic Python literacy (can read and modify code)
    • Willingness to learn by doing
    • Google account (for Colab access)
  • Helpful but Not Required:
    • Familiarity with numpy/pandas
    • Understanding of machine learning concepts
    • Experience with APIs or transformers library
  • Not Required:
    • Deep learning expertise
    • NLP background
    • Information retrieval knowledge
    • Production deployment experience
    • Paid API keys or services (everything is open-source!)
  • This bootcamp is ideal for:
    • Software engineers exploring AI/ML applications
    • Data scientists integrating LLMs into workflows
    • Technical leaders evaluating RAG solutions
    • Researchers in information retrieval and NLP
    • Anyone building document-based AI systems
  • This bootcamp is especially valuable if you:
    • Want to understand RAG internals, not just use high-level APIs
    • Are building custom RAG solutions requiring fine-grained control
    • Want hands-on experience with both primitives and frameworks

This intensive workshop provides a technical foundation in Retrieval-Augmented Generation (RAG) using a "Primitives-First" methodology.

We reject the "magic wrapper" approach. Instead, we use progressive architecture:

Day 1 — The Mechanics (Build from Scratch):
 You will build retrieval engines and RAG pipelines using raw Python libraries (numpy, FAISS, transformers). Every line of code will be transparent. You'll work with the Vaswani dataset (11,429 IR research abstracts with 93 queries) to understand how semantic search actually works—no frameworks, no abstraction layers.

Day 2 — The Production (Orchestrate at Scale):
 You will scale to production using LlamaIndex on the BEIR Programmers benchmark (32K StackExchange programming posts with 876 queries). You'll see how frameworks simplify complexity while building a technical Q&A system that answers real programming questions.

Technology Stack

We use a specific toolset for each stage of learning:

Day 1: The Primitives (White-Box)

Purpose: Understand every component by building from scratch

  • Embeddings: sentence-transformers (all-MiniLM-L6-v2)
  • Vector Operations: numpy for manual similarity calculations
  • Indexing: FAISS for vector search
  • LLM: Llama 3.2 (3B) via transformers library
  • Evaluation: PyTerrier for IR metrics (MRR, NDCG, MAP)
  • Dataset: Vaswani corpus
    • 11,429 scientific abstracts from information retrieval research
    • 93 queries with complete relevance judgments
    • Built into PyTerrier (zero setup)
    • Domain: Learning RAG by searching RAG papers

Day 2: The “Production” Stack

Purpose: Scale to production with orchestration frameworks

  • Orchestration: LlamaIndex for pipeline management
  • LLM Integration: Same Llama 3.2 through LlamaIndex
  • Data Loading: ir_datasets (standard IR benchmark library)
  • Dataset: BEIR CQADupStack Programmers
    • 32,000 StackExchange programming posts
    • 876 queries (real programming questions)
    • 1,700+ relevance judgments (qrels)
    • Domain: Technical Q&A from StackExchange
    • Zero preparation needed (built into ir_datasets)

Day 1: Retrieval Architectures (No Frameworks)

Goal: Build a RAG system from scratch to understand the internal mechanics.

09:00 - 09:30 | Welcome & The "Watch It Fail" Demo

  • The Hook: Live hallucination demonstration with Llama 3.2
    • Watch the model confidently fabricate citations and facts
    • Understand viscerally why RAG is necessary
  • The Solution: Why RAG exists—limitations of parametric knowledge
  • Roadmap: Build Manually (Day 1) → Scale with Frameworks (Day 2)
  • Two-day overview: Vaswani → BEIR Programmers, Primitives → Frameworks

09:30 - 10:30 | Lecture: LLM Fundamentals & Setup

  • Theory: How LLMs work
    • Tokenization and next-token prediction
    • Context windows and attention mechanisms
    • Why hallucination happens
    • The need for external knowledge
  • Open-source vs. Proprietary: Cost, privacy, control, learning benefits
  • Lab Setup:
    • Colab GPU verification and configuration
    • Loading Llama 3.2 (3B parameters)
    • First generation test (observe hallucination)
  • Dataset Introduction: The Vaswani Corpus
    • What it contains: 11,429 IR research abstracts
    • Why it's perfect for learning: built-in, scientific, manageable
    • Loading with PyTerrier (one line of code!)

10:30 - 11:00 | Break

11:00 - 12:30 | Lab 1: Vector Spaces & Embeddings

Learning Objective: Understand semantic similarity by working directly with vectors.

Activities:

  • Load the Vaswani dataset (11,429 abstracts, 93 queries)
  • Encode text into numerical vectors using sentence-transformers
  • Calculate cosine similarity manually with numpy
  • Visualize why semantically similar documents rank higher
  • Explore the vector space with concrete examples
  • See how "meaning" becomes "mathematics"

Key Concepts:

  • Embeddings as dense representations of meaning
  • Cosine similarity for measuring semantic closeness
  • High-dimensional vector spaces

Deliverable: Deep intuition of how text becomes searchable mathematics.

12:30 - 01:30 | Lunch

01:30 - 02:45 | Lab 2: Building the Dense Retriever

Learning Objective: Build a vector search engine from scratch and evaluate it.

Activities:

  • Encode all 11,429 Vaswani abstracts (~5 minutes)
  • Build a FAISS index manually—no frameworks, you see every step
    • Initialize index with correct dimensionality
    • Normalize vectors for cosine similarity
    • Add embeddings to index
  • Retrieve top-100 documents for all 93 queries
  • Evaluate your retriever using PyTerrier:
    • MRR@10 (Mean Reciprocal Rank): How quickly do you find relevant docs?
    • NDCG@10 (Normalized Discounted Cumulative Gain): Quality of ranking
    • MAP (Mean Average Precision): Overall precision across all queries
    • Recall@100: Coverage of relevant documents
  • Compare your scores to baseline systems
  • Analyze: Which queries work well? Which fail? Why?

Checkpoint: You should achieve approximately:

  • MRR@10 ≈ 0.25
  • NDCG@10 ≈ 0.35
  • MAP ≈ 0.20

Key Understanding: See where dense retrieval excels (semantic/conceptual queries) and where it struggles (entity/name queries).

Deliverable: A transparent, working search engine with evaluation metrics—you built every component.

02:45 - 03:15 | Break

03:15 - 05:00 | Lab 3: The Manual RAG Pipeline

Learning Objective: Orchestrate the "Retrieve → Augment → Generate" flow manually.

Activities:

  • Connect your FAISS retriever to Llama 3.2
  • Format retrieved documents into context (manual string manipulation)
  • Experiment with four prompt strategies:
    1. Baseline prompt: No context provided
      • Result: Observe hallucination and fabrication
    2. Context injection: Add retrieved documents
      • Result: Better, but still sometimes drifts from sources
    3. Grounded prompting: "Based ONLY on the provided sources..."
      • Result: Much better adherence to sources
    4. With citations: Add explicit citation requirements
      • Result: Best—grounded AND attributed
  • Manually parse and validate citations from responses
  • Generate answers for 10 Vaswani queries
  • Compare quality across all four prompt strategies

Analysis Activities:

  • Measure hallucination rate for each prompt strategy
  • Identify which prompt elements matter most
  • Understand the dramatic impact of prompt engineering on RAG quality

Key Learning: Prompt design is absolutely critical for RAG performance. Small changes in instructions produce dramatically different results.

Deliverable: A complete RAG system where you control every component and understand exactly how each piece affects output quality.

Day 2: Production Architecture (LlamaIndex)

Goal: Scale to production complexity using orchestration frameworks on a larger, more challenging dataset.

09:00 - 09:30 | Day 1 Recap + Scaling to Production

Review Yesterday:

  • Built retrieval manually on 11K documents (Vaswani)
  • Saw every component: embeddings, indexing, retrieval, prompting
  • Understand the mechanics deeply

Today's Challenge:

  • Scale from 11K → 32K documents
  • Shift from academic abstracts → real-world technical Q&A
  • Use production frameworks (LlamaIndex) to handle complexity
  • Build a system that answers actual programming questions

Introduction to BEIR:

  • BEIR = Benchmark for IR (widely used in research)
  • 18 different datasets across domains
  • We'll use: CQADupStack/Programmers
    • 32,000 StackExchange programming posts
    • Real questions from programmers
    • 876 test queries with relevance judgments

Why This Dataset:

  • Relevant domain (you're likely programmers!)
  • Real-world content (not sanitized academic text)
  • Interesting questions you can relate to
  • Zero preparation needed (built into ir_datasets)

09:30 - 10:30 | Lecture: Introduction to LlamaIndex

The Core Question: "Why use frameworks when we can build manually?"

Comparing Approaches:

Yesterday's Manual Approach (50+ lines):

  • Load documents manually
  • Encode with sentence-transformers
  • Build FAISS index step-by-step
  • Write retrieval logic
  • Format context manually
  • Craft prompts
  • Parse responses

Today with LlamaIndex (10 lines):

  • Load documents
  • Create index (LlamaIndex handles embedding + indexing)
  • Create query engine (handles retrieval + context + prompting)
  • Query and get response

What LlamaIndex Abstracts:

  • Document loading and parsing
  • Automatic chunking strategies
  • Embedding and indexing orchestration
  • Retrieval with configurable parameters
  • Context window management
  • Response synthesis from multiple sources
  • Citation tracking

What You Still Control:

  • Choice of LLM and embedding model
  • Custom prompt templates
  • Retrieval parameters (top-k, similarity threshold)
  • Response modes (compact, tree_summarize, etc.)
  • Evaluation and optimization

When to Use Frameworks:

  • Standard RAG use cases (your own documents)
  • Rapid prototyping and iteration
  • Production deployments (tested, maintained code)
  • When you need features like streaming, callbacks, agents

When to Stay Low-Level:

  • Custom requirements (special formats, unusual workflows)
  • Performance-critical applications (need fine-grained control)
  • Research (exploring novel approaches)
  • Debugging complex issues (need to see everything)

Key Insight: After Day 1, you understand what LlamaIndex is doing. You're not using "magic"—you know the mechanics underneath.

10:30 - 11:00 | Break

11:00 - 12:30 | Lab 4: Building Retrieval with LlamaIndex

Learning Objective: Use LlamaIndex to build retrieval at scale, understanding what it abstracts from your Day 1 manual work.

Activities:

  1. Load BEIR CQADupStack/Programmers:

  • Use ir_datasets library (standard for IR research)
  • Load 32K StackExchange posts
  • Load 876 test queries
  • Load relevance judgments (qrels)
  1. Explore the data:

  • Examine sample posts (questions + answers)
  • Look at code snippets, technical terminology
  • Review sample queries (real programming questions)
  • Examples you'll see:
    • "How to debug memory leaks in C++?"
    • "What is the difference between abstract class and interface?"
    • "Best practices for REST API design"
  1. Convert to LlamaIndex format:

  • Transform StackExchange posts into LlamaIndex Documents
  • Preserve metadata (document IDs for evaluation)
  1. Build index with LlamaIndex:

  • One function call handles: chunking, embedding, indexing
  • Takes ~10-15 minutes for 32K documents
  • Compare to Day 1: same result, much less code
  1. Test retrieval:

  • Query with sample programming questions
  • Examine retrieved posts
  • Verify relevant content is returned
  1. Evaluate retrieval quality:

  • Calculate MRR@10, NDCG@10 using official qrels
  • Compare to published baselines (if available)
  • Analyze which types of queries work well

Reflection Questions:

  • How does LlamaIndex's retrieval compare to your Day 1 FAISS retriever?
  • What's different about retrieving from technical posts vs. academic abstracts?
  • Can you identify where LlamaIndex is doing what you did manually yesterday?

Checkpoint: Everyone should have:

  • Working LlamaIndex retriever over 32K documents
  • Evaluation metrics calculated
  • Understanding of retrieval quality on this dataset

Deliverable: Production-scale retriever built with framework, evaluated with standard metrics.

12:30 - 01:30 | Lunch

01:30 - 02:30 | Lecture: Generation with LlamaIndex

RAG for Technical Q&A:

Unique Challenges:

  • Code snippets: Need to preserve formatting, syntax
  • Technical terminology: Must be precise and accurate
  • Multiple answers: StackExchange posts have various solutions
  • Context length: Posts can be long, need smart selection
  • Synthesis: Combine insights from multiple posts

Prompt Engineering for Technical Content:

  • Clear instructions for handling code
  • Emphasis on accuracy and precision
  • Instructions for synthesizing multiple sources
  • Handling cases where posts contradict each other

LlamaIndex Query Engines:

  • Retriever mode: Just get documents
  • Query engine mode: Retrieve + synthesize answer
  • Response modes:
    • compact: Best for focused questions
    • tree_summarize: Best for synthesis across many documents
    • refine: Iterative refinement of answer

Configuring for Programming Q&A:

  • Custom prompt templates
  • Appropriate top-k (number of posts to retrieve)
  • Response mode selection
  • Citation configuration

Example Use Cases:

  • Debug help: "Why is my code throwing NullPointerException?"
  • Concept explanation: "What's the difference between processes and threads?"
  • Best practices: "How should I structure a REST API?"

02:30 - 03:00 | Lab 5: End-to-End RAG with LlamaIndex

Learning Objective: Build a complete RAG pipeline that answers programming questions using retrieved StackExchange posts.

Activities:

  1. Configure LlamaIndex with Llama 3.2:
  • Set up the LLM connection
  • Choose embedding model (same as Day 1 for consistency)
  1. Create custom prompt template:
  • Design prompt specifically for programming Q&A
  • Include instructions for:
    • Using retrieved StackExchange posts
    • Preserving code formatting
    • Synthesizing multiple answers
    • Being precise with technical terms
    • Citing source posts
  1. Build query engine:
  • Connect retriever to LLM
  • Configure response mode
  • Set retrieval parameters (top-k)
  1. Generate answers for test queries:
  • Pick 10 diverse programming questions
  • Generate answers using your RAG system
  • Review outputs manually
  1. Quality assessment:
  • Accuracy: Is technical information correct?
  • Completeness: Does it address the full question?
  • Faithfulness: Does it stay grounded in retrieved posts?
  • Usefulness: Would this actually help a programmer?
  • Code handling: Are code snippets preserved correctly?
  1. Compare to source posts:
  • Look at what was retrieved
  • See how the answer synthesizes information
  • Identify cases of good vs. poor synthesis

Reflection:

  • How does generation quality depend on retrieval quality?
  • What makes a "good" answer for programming questions?
  • Where does the system struggle?

Deliverable: Working RAG system that can answer real programming questions from StackExchange knowledge.

03:00 - 03:15 | Break

03:15 - 04:30 | Lab 6 & 7: Evaluation & Iterative Optimization

Learning Objective: Build evaluation suite and systematically improve your RAG system through measured iteration.

Part 1: Comprehensive Evaluation (45 minutes)

Retrieval Evaluation:

  • Run your retriever on all 876 test queries
  • Calculate standard metrics using qrels:
    • MRR@10 (Mean Reciprocal Rank)
    • NDCG@10 (Normalized Discounted Cumulative Gain)
    • MAP (Mean Average Precision)
    • Recall@100 (Coverage)
  • Generate per-query breakdown
  • Identify queries with poor retrieval

Generation Evaluation:

Automated metrics:

  • Response length distribution
  • Citation coverage (how many retrieved docs cited?)
  • Format validation (proper structure?)

Manual assessment (on 20 sample queries):

  • Faithfulness: Answer grounded in sources? (Score 1-5)
  • Accuracy: Technical information correct? (Score 1-5)
  • Completeness: Addresses full question? (Score 1-5)
  • Usefulness: Actually helpful for a programmer? (Score 1-5)

LLM-as-judge (optional):

  • Use Llama 3.2 to evaluate other responses
  • Automated scoring of larger query set
  • Compare to your manual assessments

Generate Evaluation Report:

  • Summary statistics (averages, distributions)
  • Best-performing queries (what went right?)
  • Worst-performing queries (what went wrong?)
  • Patterns in failures (types of questions that struggle)

Analysis Questions:

  • Where does retrieval fail? (wrong posts retrieved)
  • Where does generation fail? (poor synthesis, hallucination)
  • What types of questions work best?
  • What types of questions need improvement?

Part 2: Iterative Optimization (45 minutes)

The Optimization Loop:

  1. Identify Target Queries:
  • Select bottom 20% by metrics
  • Focus on fixable failures (not data limitations)
  1. Diagnose Root Causes:
  • Retrieval failure?
    • Relevant posts exist but not retrieved
    • Wrong posts ranked higher
    • Query needs reformulation
  • Generation failure?
    • Retrieved posts are good but answer is poor
    • Hallucination or drift from sources
    • Poor synthesis of multiple posts
    • Code formatting lost
  1. Apply Targeted Fixes:
     For retrieval issues:
  • Adjust similarity_top_k (retrieve more/fewer)
  • Experiment with different embedding models
  • Try query reformulation or expansion
  1. For generation issues:
  • Refine prompt template (add specific instructions)
  • Change response mode (compact vs. tree_summarize)
  • Adjust temperature or other generation parameters
  • Add examples to prompt (few-shot)
  1. Re-evaluate:
  • Run evaluation again on your fixes
  • Measure improvement in metrics
  • Verify fixes didn't break other queries
  1. Document Changes:
  • Track what you tried
  • Record which improvements worked
  • Note unexpected side effects
  1. Iterate:
  • Repeat cycle until metrics plateau or time runs out
  • Aim for 15-20% improvement over baseline

Collaborative Learning:

  • Share strategies with classmates
  • Discuss what worked and what didn't
  • Learn from each other's approaches

Goal: Achieve measurable, documented improvement in your RAG system through systematic optimization.

Deliverable:

  • Optimized RAG system with improved metrics
  • Documentation of optimization process
  • Understanding of what works and why

04:30 - 05:00 | Results Showcase & Wrap-Up

Class Showcase:

Lightning Demos (2 minutes each):

  • Share one query where your system excels
  • Show one challenging query and how you addressed it
  • Demonstrate your biggest improvement

Class Metrics Competition:

  • Best Retrieval: Highest MRR@10 or NDCG@10
  • Best Generation: Highest manual quality scores
  • Most Improved: Biggest gain from baseline
  • Best Technical Answer: Vote on most helpful programming response

Discussion: Production Considerations

Scaling Beyond the Bootcamp:

  • Handling larger document collections (millions of posts)
  • Serving queries with low latency (<500ms)
  • Caching strategies for repeated queries
  • Incremental index updates (new StackExchange posts)

Real-World Technical Q&A Systems:

  • GitHub Copilot Chat (code generation + search)
  • StackOverflow search enhancements
  • Internal developer documentation systems
  • IDE-integrated help systems

Cost and Performance:

  • Embedding costs (one-time vs. query-time)
  • LLM inference costs (per query)
  • Infrastructure requirements
  • Trade-offs: accuracy vs. speed vs. cost

When to Use What We Learned:

Use primitives (Day 1 approach) when:

  • Building novel research systems
  • Need fine-grained performance optimization
  • Debugging complex production issues
  • Custom requirements not supported by frameworks

Use frameworks (Day 2 approach) when:

  • Standard RAG use cases
  • Rapid prototyping
  • Production systems (tested, maintained)
  • Team projects (shared abstractions)

Next Steps & Resources:

Continue Learning:

  • Try other BEIR datasets (18 total, various domains)
  • Build RAG for your own documents
  • Explore LlamaIndex advanced features (agents, streaming)
  • Learn LangChain for multi-step workflows

Recommended Resources:

  • LlamaIndex documentation and tutorials
  • BEIR benchmark leaderboard (see state-of-the-art)
  • RAG research papers and best practices
  • Communities: Discord servers, GitHub discussions

Q: I'm not a strong Python programmer. Can I keep up?
 A: Yes. We provide starter code with clear instructions. You'll modify working code, not write from scratch. If you can read Python, you'll be fine. We focus on concepts, not syntax.

Q: Will Colab's free GPU be enough?
 A: Yes. Everything is optimized for Colab's free tier. Llama 3.2 (3B parameters) runs comfortably. Indexing times are manageable (~5 min for Vaswani, ~15 min for BEIR).

Q: Why build from scratch on Day 1 instead of just using frameworks?
 A: Understanding primitives means you can debug when frameworks fail. You'll know what LlamaIndex is doing under the hood, making you far more effective. It's like learning to drive a manual transmission—you understand the car better.

Q: What if I get stuck during hands-on sessions?
 A: Every notebook has recovery checkpoints. The instructor provides real-time help. Collaborative debugging with peers is encouraged. We learn together.

Q: Do I need to know about StackExchange or programming Q&A for Day 2?
 A: Not in detail. The content is relatable if you've ever programmed. Questions range from beginner to advanced. You can focus on queries matching your expertise level.

Q: Can I use these techniques on my own documents?
 A: Absolutely! That's the goal. The techniques you learn (Vaswani, BEIR) transfer directly to any document collection—company docs, research papers, legal documents, etc.

Q: Will we cover fine-tuning LLMs?
 A: No, this bootcamp focuses specifically on RAG. We'll discuss when fine-tuning is appropriate vs. RAG, but implementation is out of scope.

Q: What's the difference between BEIR and TREC?
 A: Both are IR research benchmarks. BEIR is a suite of 18 datasets (we use one). TREC is a long-running competition series. Both are prestigious and portfolio-worthy.

Q: Can I work on this after the bootcamp?
 A: Yes! BEIR has 18 total datasets to explore. Your code will run anywhere you have a GPU. We'll provide resources for continued learning.

Q: What if I want to learn more about LlamaIndex features not covered?
 A: We cover core RAG functionality. LlamaIndex has many advanced features (agents, streaming, multi-document queries) you can explore after. We'll provide tutorial links.

Q: How much does participation cost?
 A: [Pricing TBD] This includes all materials, lunches, and certification. No hidden costs—everything runs on free open-source tools.

Q: What if I can't attend in person?
 A: This bootcamp is designed for in-person collaborative learning. Remote options may be offered in future sessions.

This workshop is offered through the Computer Science Department at Missouri University of Science and Technology. Developed and taught by Dr. Shubham Chatterjee, who leads the IRIS Lab and has published extensively in neural information retrieval and RAG systems at premier venues including SIGIR, EMNLP, and ECIR.

Ready to master RAG from the ground up? Join us for this intensive hands-on bootcamp where you'll build real systems on real benchmarks!