A/B Testing LLM Models in Production: A Complete Guide

How to scientifically compare GPT-4 vs Claude vs Gemini in your actual use case. Includes statistical significance testing and code examples.

Benchmarks are great, but they don't tell you which LLM works best for **your** specific use case. The only way to know? A/B test in production with your actual users and data.

This guide will show you how to set up rigorous A/B tests for LLMs, measure what matters, and make data-driven decisions about which models to use.

Why A/B Test LLMs?

Benchmarks Don't Tell the Whole Story

Public benchmarks like MMLU and HumanEval are useful, but:

They test general capabilities, not your specific task

They don't account for your prompt engineering

They don't measure what your users care about

They're often gamed or overfitted

Real Example

A team chose Claude 3 Opus because it scored highest on benchmarks. After A/B testing, they found GPT-4 Turbo performed 15% better **for their specific use case** (customer support) and cost 5x less.

Setting Up A/B Tests

Define Your Hypothesis

interface Hypothesis {
  control: 'gpt-4-turbo'  // Current model
  treatment: 'claude-3-opus'  // New model to test
  metric: 'user_satisfaction'  // Primary metric
  expectedImprovement: 0.10  // 10% improvement
  minSampleSize: 1000  // Minimum requests per variant
}

Implement Random Assignment

class ABTest {
  assignVariant(userId: string, testId: string): 'control' | 'treatment' {
    // Consistent hashing ensures same user always gets same variant
    const hash = createHash('md5')
      .update(\`\${testId}:\${userId}\`)
      .digest('hex')

    const num = parseInt(hash.substring(0, 8), 16)
    return num % 2 === 0 ? 'control' : 'treatment'
  }

  async handleRequest(userId: string, prompt: string) {
    const variant = this.assignVariant(userId, 'llm-comparison-test')

    const model = variant === 'control' ? 'gpt-4-turbo' : 'claude-3-opus'
    const response = await callLLM(model, prompt)

    // Log for analysis
    await this.logResult({
      userId,
      variant,
      model,
      prompt,
      response,
      timestamp: new Date()
    })

    return response
  }
}

Defining Success Metrics

Quantitative Metrics

interface Metrics {
  // Cost metrics
  costPerRequest: number
  monthlySpend: number

  // Performance metrics
  latency: number
  errorRate: number

  // Quality metrics (computed)
  bleuScore?: number
  rougeScore?: number
  coherence?: number
}

Qualitative Metrics

interface UserFeedback {
  userId: string
  variant: 'control' | 'treatment'
  rating: 1 | 2 | 3 | 4 | 5  // Thumbs up/down or stars
  timestamp: Date
}

// Track user satisfaction
async function collectFeedback(requestId: string, rating: number) {
  const request = await db.requests.findUnique({ where: { id: requestId } })

  await db.feedback.create({
    data: {
      userId: request.userId,
      variant: request.variant,
      rating,
      timestamp: new Date()
    }
  })
}

Statistical Significance

Don't make decisions on insufficient data:

interface TestResults {
  control: {
    sampleSize: number
    mean: number
    stdDev: number
  }
  treatment: {
    sampleSize: number
    mean: number
    stdDev: number
  }
}

function calculateSignificance(results: TestResults): {
  pValue: number
  significant: boolean
  improvement: number
} {
  // Welch's t-test for unequal variances
  const { control, treatment } = results

  const meanDiff = treatment.mean - control.mean
  const pooledSE = Math.sqrt(
    (control.stdDev ** 2 / control.sampleSize) +
    (treatment.stdDev ** 2 / treatment.sampleSize)
  )

  const tStatistic = meanDiff / pooledSE

  // Calculate p-value (simplified)
  const pValue = calculatePValue(tStatistic)

  return {
    pValue,
    significant: pValue < 0.05,  // 95% confidence
    improvement: meanDiff / control.mean
  }
}

Sample Size Calculator

function calculateRequiredSampleSize(
  baselineRate: number,  // e.g., 0.80 (80% satisfaction)
  minDetectableEffect: number,  // e.g., 0.05 (5% improvement)
  alpha: number = 0.05,  // Significance level
  power: number = 0.80   // Statistical power
): number {
  // Simplified calculation
  const p1 = baselineRate
  const p2 = baselineRate * (1 + minDetectableEffect)

  const zAlpha = 1.96  // For 95% confidence
  const zBeta = 0.84   // For 80% power

  const numerator = (zAlpha + zBeta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
  const denominator = (p2 - p1) ** 2

  return Math.ceil(numerator / denominator)
}

// Example: Need 1,568 samples per variant to detect 5% improvement
const required = calculateRequiredSampleSize(0.80, 0.05)
console.log(\`Required sample size: \${required} per variant\`)

Running the Experiment

Complete A/B Test Framework

class LLMABTest {
  async runExperiment(config: ExperimentConfig) {
    const results = { control: [], treatment: [] }

    // Run until we have enough samples
    while (results.control.length < config.minSampleSize) {
      const request = await this.getNextRequest()
      const variant = this.assignVariant(request.userId, config.testId)

      const model = variant === 'control'
        ? config.controlModel
        : config.treatmentModel

      // Make request
      const startTime = Date.now()
      const response = await callLLM(model, request.prompt)
      const latency = Date.now() - startTime

      // Collect metrics
      const metrics = {
        cost: calculateCost(model, response.usage),
        latency,
        quality: await evaluateQuality(response, request.expectedOutput)
      }

      results[variant].push(metrics)

      // Log for analysis
      await this.logExperimentResult({
        testId: config.testId,
        variant,
        model,
        ...metrics
      })
    }

    // Analyze results
    return this.analyzeResults(results)
  }

  private async analyzeResults(results: Results) {
    // Cost analysis
    const costAnalysis = this.compareCosts(results)

    // Quality analysis
    const qualityAnalysis = this.compareQuality(results)

    // Latency analysis
    const latencyAnalysis = this.compareLatency(results)

    // Statistical significance
    const significance = calculateSignificance({
      control: this.summarize(results.control),
      treatment: this.summarize(results.treatment)
    })

    return {
      costAnalysis,
      qualityAnalysis,
      latencyAnalysis,
      significance,
      recommendation: this.makeRecommendation(
        costAnalysis,
        qualityAnalysis,
        significance
      )
    }
  }
}

Avoiding Common Pitfalls

Pitfall 1: Selection Bias

// BAD: Only testing on successful requests
if (request.successful) {
  testVariant = assignVariant()
}

// GOOD: Test on all requests
testVariant = assignVariant()

Pitfall 2: Novelty Effect

// Run test for minimum duration to avoid novelty effects
const MIN_TEST_DURATION = 7 * 24 * 60 * 60 * 1000  // 7 days

async function shouldConcludeTest(testId: string): Promise<boolean> {
  const test = await getTest(testId)
  const duration = Date.now() - test.startTime

  // Must run for minimum duration AND have enough samples
  return duration >= MIN_TEST_DURATION && test.sampleSize >= test.minRequired
}

Pitfall 3: Peeking at Results

// BAD: Stopping test as soon as you see significance
if (currentResults.pValue < 0.05) {
  return concludeTest()  // Multiple testing problem!
}

// GOOD: Pre-define sample size and stick to it
if (currentSamples >= predefinedSampleSize) {
  return analyzeResults()
}

Real-World Example

// A/B test: GPT-4 vs Claude 3 Opus for customer support

const testResults = {
  control: {  // GPT-4 Turbo
    sampleSize: 2000,
    avgCost: 0.012,
    avgLatency: 420,
    avgQuality: 0.87,
    userSatisfaction: 0.82
  },
  treatment: {  // Claude 3 Opus
    sampleSize: 2000,
    avgCost: 0.058,
    avgLatency: 650,
    avgQuality: 0.89,
    userSatisfaction: 0.84
  }
}

// Analysis
const costDiff = testResults.treatment.avgCost - testResults.control.avgCost
const qualityDiff = testResults.treatment.avgQuality - testResults.control.avgQuality

console.log(\`
📊 A/B Test Results:

Cost:
- GPT-4: $\${testResults.control.avgCost}/request
- Claude: $\${testResults.treatment.avgCost}/request
- Difference: +\${((costDiff / testResults.control.avgCost) * 100).toFixed(1)}% (Claude more expensive)

Quality:
- GPT-4: \${(testResults.control.avgQuality * 100).toFixed(1)}%
- Claude: \${(testResults.treatment.avgQuality * 100).toFixed(1)}%
- Difference: +\${((qualityDiff / testResults.control.avgQuality) * 100).toFixed(1)}% (Claude better)

User Satisfaction:
- GPT-4: \${(testResults.control.userSatisfaction * 100).toFixed(1)}%
- Claude: \${(testResults.treatment.userSatisfaction * 100).toFixed(1)}%
- Difference: +2.4% (statistically significant, p < 0.01)

💡 Recommendation: Stick with GPT-4 Turbo
- 5x cheaper
- 2% quality improvement with Claude doesn't justify 383% cost increase
- User satisfaction improvement (2.4%) is marginal
\`)

Key Takeaways

**Don't trust benchmarks alone** - A/B test with your real data

**Define metrics upfront** - Know what success looks like

**Calculate required sample size** - Don't conclude too early

**Test statistical significance** - Avoid false positives

**Avoid common pitfalls** - Selection bias, novelty effects, peeking

**Consider total cost** - Quality improvements must justify price increases

A/B testing is the scientific way to choose LLMs. Trust the data, not the hype.