Technical Guides

A/B Testing LLM Models in Production: A Complete Guide

How to scientifically compare GPT-4 vs Claude vs Gemini in your actual use case. Includes statistical significance testing and code examples.

ReForge Team

Engineering

6 min read

Benchmarks are great, but they don't tell you which LLM works best for **your** specific use case. The only way to know? A/B test in production with your actual users and data.

This guide will show you how to set up rigorous A/B tests for LLMs, measure what matters, and make data-driven decisions about which models to use.

Why A/B Test LLMs?

Benchmarks Don't Tell the Whole Story

Public benchmarks like MMLU and HumanEval are useful, but:

  • They test general capabilities, not your specific task
  • They don't account for your prompt engineering
  • They don't measure what your users care about
  • They're often gamed or overfitted
  • Real Example

    A team chose Claude 3 Opus because it scored highest on benchmarks. After A/B testing, they found GPT-4 Turbo performed 15% better **for their specific use case** (customer support) and cost 5x less.

    Setting Up A/B Tests

    Define Your Hypothesis

    interface Hypothesis {
      control: 'gpt-4-turbo'  // Current model
      treatment: 'claude-3-opus'  // New model to test
      metric: 'user_satisfaction'  // Primary metric
      expectedImprovement: 0.10  // 10% improvement
      minSampleSize: 1000  // Minimum requests per variant
    }

    Implement Random Assignment

    class ABTest {
      assignVariant(userId: string, testId: string): 'control' | 'treatment' {
        // Consistent hashing ensures same user always gets same variant
        const hash = createHash('md5')
          .update(\`\${testId}:\${userId}\`)
          .digest('hex')
    
        const num = parseInt(hash.substring(0, 8), 16)
        return num % 2 === 0 ? 'control' : 'treatment'
      }
    
      async handleRequest(userId: string, prompt: string) {
        const variant = this.assignVariant(userId, 'llm-comparison-test')
    
        const model = variant === 'control' ? 'gpt-4-turbo' : 'claude-3-opus'
        const response = await callLLM(model, prompt)
    
        // Log for analysis
        await this.logResult({
          userId,
          variant,
          model,
          prompt,
          response,
          timestamp: new Date()
        })
    
        return response
      }
    }

    Defining Success Metrics

    Quantitative Metrics

    interface Metrics {
      // Cost metrics
      costPerRequest: number
      monthlySpend: number
    
      // Performance metrics
      latency: number
      errorRate: number
    
      // Quality metrics (computed)
      bleuScore?: number
      rougeScore?: number
      coherence?: number
    }

    Qualitative Metrics

    interface UserFeedback {
      userId: string
      variant: 'control' | 'treatment'
      rating: 1 | 2 | 3 | 4 | 5  // Thumbs up/down or stars
      timestamp: Date
    }
    
    // Track user satisfaction
    async function collectFeedback(requestId: string, rating: number) {
      const request = await db.requests.findUnique({ where: { id: requestId } })
    
      await db.feedback.create({
        data: {
          userId: request.userId,
          variant: request.variant,
          rating,
          timestamp: new Date()
        }
      })
    }

    Statistical Significance

    Don't make decisions on insufficient data:

    interface TestResults {
      control: {
        sampleSize: number
        mean: number
        stdDev: number
      }
      treatment: {
        sampleSize: number
        mean: number
        stdDev: number
      }
    }
    
    function calculateSignificance(results: TestResults): {
      pValue: number
      significant: boolean
      improvement: number
    } {
      // Welch's t-test for unequal variances
      const { control, treatment } = results
    
      const meanDiff = treatment.mean - control.mean
      const pooledSE = Math.sqrt(
        (control.stdDev ** 2 / control.sampleSize) +
        (treatment.stdDev ** 2 / treatment.sampleSize)
      )
    
      const tStatistic = meanDiff / pooledSE
    
      // Calculate p-value (simplified)
      const pValue = calculatePValue(tStatistic)
    
      return {
        pValue,
        significant: pValue < 0.05,  // 95% confidence
        improvement: meanDiff / control.mean
      }
    }

    Sample Size Calculator

    function calculateRequiredSampleSize(
      baselineRate: number,  // e.g., 0.80 (80% satisfaction)
      minDetectableEffect: number,  // e.g., 0.05 (5% improvement)
      alpha: number = 0.05,  // Significance level
      power: number = 0.80   // Statistical power
    ): number {
      // Simplified calculation
      const p1 = baselineRate
      const p2 = baselineRate * (1 + minDetectableEffect)
    
      const zAlpha = 1.96  // For 95% confidence
      const zBeta = 0.84   // For 80% power
    
      const numerator = (zAlpha + zBeta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
      const denominator = (p2 - p1) ** 2
    
      return Math.ceil(numerator / denominator)
    }
    
    // Example: Need 1,568 samples per variant to detect 5% improvement
    const required = calculateRequiredSampleSize(0.80, 0.05)
    console.log(\`Required sample size: \${required} per variant\`)

    Running the Experiment

    Complete A/B Test Framework

    class LLMABTest {
      async runExperiment(config: ExperimentConfig) {
        const results = { control: [], treatment: [] }
    
        // Run until we have enough samples
        while (results.control.length < config.minSampleSize) {
          const request = await this.getNextRequest()
          const variant = this.assignVariant(request.userId, config.testId)
    
          const model = variant === 'control'
            ? config.controlModel
            : config.treatmentModel
    
          // Make request
          const startTime = Date.now()
          const response = await callLLM(model, request.prompt)
          const latency = Date.now() - startTime
    
          // Collect metrics
          const metrics = {
            cost: calculateCost(model, response.usage),
            latency,
            quality: await evaluateQuality(response, request.expectedOutput)
          }
    
          results[variant].push(metrics)
    
          // Log for analysis
          await this.logExperimentResult({
            testId: config.testId,
            variant,
            model,
            ...metrics
          })
        }
    
        // Analyze results
        return this.analyzeResults(results)
      }
    
      private async analyzeResults(results: Results) {
        // Cost analysis
        const costAnalysis = this.compareCosts(results)
    
        // Quality analysis
        const qualityAnalysis = this.compareQuality(results)
    
        // Latency analysis
        const latencyAnalysis = this.compareLatency(results)
    
        // Statistical significance
        const significance = calculateSignificance({
          control: this.summarize(results.control),
          treatment: this.summarize(results.treatment)
        })
    
        return {
          costAnalysis,
          qualityAnalysis,
          latencyAnalysis,
          significance,
          recommendation: this.makeRecommendation(
            costAnalysis,
            qualityAnalysis,
            significance
          )
        }
      }
    }

    Avoiding Common Pitfalls

    Pitfall 1: Selection Bias

    // BAD: Only testing on successful requests
    if (request.successful) {
      testVariant = assignVariant()
    }
    
    // GOOD: Test on all requests
    testVariant = assignVariant()

    Pitfall 2: Novelty Effect

    // Run test for minimum duration to avoid novelty effects
    const MIN_TEST_DURATION = 7 * 24 * 60 * 60 * 1000  // 7 days
    
    async function shouldConcludeTest(testId: string): Promise<boolean> {
      const test = await getTest(testId)
      const duration = Date.now() - test.startTime
    
      // Must run for minimum duration AND have enough samples
      return duration >= MIN_TEST_DURATION && test.sampleSize >= test.minRequired
    }

    Pitfall 3: Peeking at Results

    // BAD: Stopping test as soon as you see significance
    if (currentResults.pValue < 0.05) {
      return concludeTest()  // Multiple testing problem!
    }
    
    // GOOD: Pre-define sample size and stick to it
    if (currentSamples >= predefinedSampleSize) {
      return analyzeResults()
    }

    Real-World Example

    // A/B test: GPT-4 vs Claude 3 Opus for customer support
    
    const testResults = {
      control: {  // GPT-4 Turbo
        sampleSize: 2000,
        avgCost: 0.012,
        avgLatency: 420,
        avgQuality: 0.87,
        userSatisfaction: 0.82
      },
      treatment: {  // Claude 3 Opus
        sampleSize: 2000,
        avgCost: 0.058,
        avgLatency: 650,
        avgQuality: 0.89,
        userSatisfaction: 0.84
      }
    }
    
    // Analysis
    const costDiff = testResults.treatment.avgCost - testResults.control.avgCost
    const qualityDiff = testResults.treatment.avgQuality - testResults.control.avgQuality
    
    console.log(\`
    📊 A/B Test Results:
    
    Cost:
    - GPT-4: $\${testResults.control.avgCost}/request
    - Claude: $\${testResults.treatment.avgCost}/request
    - Difference: +\${((costDiff / testResults.control.avgCost) * 100).toFixed(1)}% (Claude more expensive)
    
    Quality:
    - GPT-4: \${(testResults.control.avgQuality * 100).toFixed(1)}%
    - Claude: \${(testResults.treatment.avgQuality * 100).toFixed(1)}%
    - Difference: +\${((qualityDiff / testResults.control.avgQuality) * 100).toFixed(1)}% (Claude better)
    
    User Satisfaction:
    - GPT-4: \${(testResults.control.userSatisfaction * 100).toFixed(1)}%
    - Claude: \${(testResults.treatment.userSatisfaction * 100).toFixed(1)}%
    - Difference: +2.4% (statistically significant, p < 0.01)
    
    💡 Recommendation: Stick with GPT-4 Turbo
    - 5x cheaper
    - 2% quality improvement with Claude doesn't justify 383% cost increase
    - User satisfaction improvement (2.4%) is marginal
    \`)

    Key Takeaways

  • **Don't trust benchmarks alone** - A/B test with your real data
  • **Define metrics upfront** - Know what success looks like
  • **Calculate required sample size** - Don't conclude too early
  • **Test statistical significance** - Avoid false positives
  • **Avoid common pitfalls** - Selection bias, novelty effects, peeking
  • **Consider total cost** - Quality improvements must justify price increases
  • A/B testing is the scientific way to choose LLMs. Trust the data, not the hype.

    ab-testingexperimentationquality-metricsproduction

    How good is your vibe coding really?

    Prism scores every coding session so you finally know how your prompting performs — and how you compare. Works with Claude Code today; Codex and Gemini next.

    Check your Score