A/B Testing LLM Models in Production: A Complete Guide
How to scientifically compare GPT-4 vs Claude vs Gemini in your actual use case. Includes statistical significance testing and code examples.
ReForge Team
Engineering
Benchmarks are great, but they don't tell you which LLM works best for **your** specific use case. The only way to know? A/B test in production with your actual users and data.
This guide will show you how to set up rigorous A/B tests for LLMs, measure what matters, and make data-driven decisions about which models to use.
Why A/B Test LLMs?
Benchmarks Don't Tell the Whole Story
Public benchmarks like MMLU and HumanEval are useful, but:
Real Example
A team chose Claude 3 Opus because it scored highest on benchmarks. After A/B testing, they found GPT-4 Turbo performed 15% better **for their specific use case** (customer support) and cost 5x less.
Setting Up A/B Tests
Define Your Hypothesis
interface Hypothesis {
control: 'gpt-4-turbo' // Current model
treatment: 'claude-3-opus' // New model to test
metric: 'user_satisfaction' // Primary metric
expectedImprovement: 0.10 // 10% improvement
minSampleSize: 1000 // Minimum requests per variant
}Implement Random Assignment
class ABTest {
assignVariant(userId: string, testId: string): 'control' | 'treatment' {
// Consistent hashing ensures same user always gets same variant
const hash = createHash('md5')
.update(\`\${testId}:\${userId}\`)
.digest('hex')
const num = parseInt(hash.substring(0, 8), 16)
return num % 2 === 0 ? 'control' : 'treatment'
}
async handleRequest(userId: string, prompt: string) {
const variant = this.assignVariant(userId, 'llm-comparison-test')
const model = variant === 'control' ? 'gpt-4-turbo' : 'claude-3-opus'
const response = await callLLM(model, prompt)
// Log for analysis
await this.logResult({
userId,
variant,
model,
prompt,
response,
timestamp: new Date()
})
return response
}
}Defining Success Metrics
Quantitative Metrics
interface Metrics {
// Cost metrics
costPerRequest: number
monthlySpend: number
// Performance metrics
latency: number
errorRate: number
// Quality metrics (computed)
bleuScore?: number
rougeScore?: number
coherence?: number
}Qualitative Metrics
interface UserFeedback {
userId: string
variant: 'control' | 'treatment'
rating: 1 | 2 | 3 | 4 | 5 // Thumbs up/down or stars
timestamp: Date
}
// Track user satisfaction
async function collectFeedback(requestId: string, rating: number) {
const request = await db.requests.findUnique({ where: { id: requestId } })
await db.feedback.create({
data: {
userId: request.userId,
variant: request.variant,
rating,
timestamp: new Date()
}
})
}Statistical Significance
Don't make decisions on insufficient data:
interface TestResults {
control: {
sampleSize: number
mean: number
stdDev: number
}
treatment: {
sampleSize: number
mean: number
stdDev: number
}
}
function calculateSignificance(results: TestResults): {
pValue: number
significant: boolean
improvement: number
} {
// Welch's t-test for unequal variances
const { control, treatment } = results
const meanDiff = treatment.mean - control.mean
const pooledSE = Math.sqrt(
(control.stdDev ** 2 / control.sampleSize) +
(treatment.stdDev ** 2 / treatment.sampleSize)
)
const tStatistic = meanDiff / pooledSE
// Calculate p-value (simplified)
const pValue = calculatePValue(tStatistic)
return {
pValue,
significant: pValue < 0.05, // 95% confidence
improvement: meanDiff / control.mean
}
}Sample Size Calculator
function calculateRequiredSampleSize(
baselineRate: number, // e.g., 0.80 (80% satisfaction)
minDetectableEffect: number, // e.g., 0.05 (5% improvement)
alpha: number = 0.05, // Significance level
power: number = 0.80 // Statistical power
): number {
// Simplified calculation
const p1 = baselineRate
const p2 = baselineRate * (1 + minDetectableEffect)
const zAlpha = 1.96 // For 95% confidence
const zBeta = 0.84 // For 80% power
const numerator = (zAlpha + zBeta) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2))
const denominator = (p2 - p1) ** 2
return Math.ceil(numerator / denominator)
}
// Example: Need 1,568 samples per variant to detect 5% improvement
const required = calculateRequiredSampleSize(0.80, 0.05)
console.log(\`Required sample size: \${required} per variant\`)Running the Experiment
Complete A/B Test Framework
class LLMABTest {
async runExperiment(config: ExperimentConfig) {
const results = { control: [], treatment: [] }
// Run until we have enough samples
while (results.control.length < config.minSampleSize) {
const request = await this.getNextRequest()
const variant = this.assignVariant(request.userId, config.testId)
const model = variant === 'control'
? config.controlModel
: config.treatmentModel
// Make request
const startTime = Date.now()
const response = await callLLM(model, request.prompt)
const latency = Date.now() - startTime
// Collect metrics
const metrics = {
cost: calculateCost(model, response.usage),
latency,
quality: await evaluateQuality(response, request.expectedOutput)
}
results[variant].push(metrics)
// Log for analysis
await this.logExperimentResult({
testId: config.testId,
variant,
model,
...metrics
})
}
// Analyze results
return this.analyzeResults(results)
}
private async analyzeResults(results: Results) {
// Cost analysis
const costAnalysis = this.compareCosts(results)
// Quality analysis
const qualityAnalysis = this.compareQuality(results)
// Latency analysis
const latencyAnalysis = this.compareLatency(results)
// Statistical significance
const significance = calculateSignificance({
control: this.summarize(results.control),
treatment: this.summarize(results.treatment)
})
return {
costAnalysis,
qualityAnalysis,
latencyAnalysis,
significance,
recommendation: this.makeRecommendation(
costAnalysis,
qualityAnalysis,
significance
)
}
}
}Avoiding Common Pitfalls
Pitfall 1: Selection Bias
// BAD: Only testing on successful requests
if (request.successful) {
testVariant = assignVariant()
}
// GOOD: Test on all requests
testVariant = assignVariant()Pitfall 2: Novelty Effect
// Run test for minimum duration to avoid novelty effects
const MIN_TEST_DURATION = 7 * 24 * 60 * 60 * 1000 // 7 days
async function shouldConcludeTest(testId: string): Promise<boolean> {
const test = await getTest(testId)
const duration = Date.now() - test.startTime
// Must run for minimum duration AND have enough samples
return duration >= MIN_TEST_DURATION && test.sampleSize >= test.minRequired
}Pitfall 3: Peeking at Results
// BAD: Stopping test as soon as you see significance
if (currentResults.pValue < 0.05) {
return concludeTest() // Multiple testing problem!
}
// GOOD: Pre-define sample size and stick to it
if (currentSamples >= predefinedSampleSize) {
return analyzeResults()
}Real-World Example
// A/B test: GPT-4 vs Claude 3 Opus for customer support
const testResults = {
control: { // GPT-4 Turbo
sampleSize: 2000,
avgCost: 0.012,
avgLatency: 420,
avgQuality: 0.87,
userSatisfaction: 0.82
},
treatment: { // Claude 3 Opus
sampleSize: 2000,
avgCost: 0.058,
avgLatency: 650,
avgQuality: 0.89,
userSatisfaction: 0.84
}
}
// Analysis
const costDiff = testResults.treatment.avgCost - testResults.control.avgCost
const qualityDiff = testResults.treatment.avgQuality - testResults.control.avgQuality
console.log(\`
📊 A/B Test Results:
Cost:
- GPT-4: $\${testResults.control.avgCost}/request
- Claude: $\${testResults.treatment.avgCost}/request
- Difference: +\${((costDiff / testResults.control.avgCost) * 100).toFixed(1)}% (Claude more expensive)
Quality:
- GPT-4: \${(testResults.control.avgQuality * 100).toFixed(1)}%
- Claude: \${(testResults.treatment.avgQuality * 100).toFixed(1)}%
- Difference: +\${((qualityDiff / testResults.control.avgQuality) * 100).toFixed(1)}% (Claude better)
User Satisfaction:
- GPT-4: \${(testResults.control.userSatisfaction * 100).toFixed(1)}%
- Claude: \${(testResults.treatment.userSatisfaction * 100).toFixed(1)}%
- Difference: +2.4% (statistically significant, p < 0.01)
💡 Recommendation: Stick with GPT-4 Turbo
- 5x cheaper
- 2% quality improvement with Claude doesn't justify 383% cost increase
- User satisfaction improvement (2.4%) is marginal
\`)Key Takeaways
A/B testing is the scientific way to choose LLMs. Trust the data, not the hype.