Building Cost-Aware AI Features: Best Practices from Production
Learn how to design AI features with cost in mind from day one. Includes architecture patterns, code examples, and real-world case studies.
ReForge Team
Engineering
Building AI features without considering costs is a recipe for disaster. We've seen teams ship features that work great functionally but bankrupt the company financially. Don't let that be you.
In this guide, we'll show you how to design cost-aware AI features from day one, with patterns and practices from production systems handling millions of requests.
Cost-Aware Architecture Principles
Principle 1: Tier-Based Routing
Different user tiers should get different service levels:
enum UserTier {
FREE = 'free',
PRO = 'pro',
ENTERPRISE = 'enterprise'
}
function routeByTier(tier: UserTier, prompt: string) {
switch (tier) {
case UserTier.FREE:
return {
model: 'gpt-3.5-turbo',
maxTokens: 500,
priority: 'low'
}
case UserTier.PRO:
return {
model: 'gpt-4-turbo',
maxTokens: 2000,
priority: 'normal'
}
case UserTier.ENTERPRISE:
return {
model: 'gpt-4-turbo',
maxTokens: 4000,
priority: 'high',
sla: '99.9%'
}
}
}Principle 2: Budget Enforcement
Set hard limits to prevent cost overruns:
class BudgetEnforcer {
async checkBudget(userId: string, estimatedCost: number): Promise<boolean> {
const usage = await this.getMonthlyUsage(userId)
const limit = await this.getUserBudgetLimit(userId)
if (usage.totalCost + estimatedCost > limit) {
throw new BudgetExceededError(
\`Monthly budget exceeded: $\${usage.totalCost.toFixed(2)} / $\${limit}\`
)
}
return true
}
async trackUsage(userId: string, cost: number) {
await db.llm_usage.create({
user_id: userId,
cost,
timestamp: new Date()
})
// Alert user at 80% usage
const usage = await this.getMonthlyUsage(userId)
const limit = await this.getUserBudgetLimit(userId)
if (usage.totalCost > limit * 0.8) {
await this.sendBudgetAlert(userId, usage.totalCost, limit)
}
}
}Principle 3: Rate Limiting
Prevent abuse and control costs:
import { RateLimiter } from 'rate-limiter-flexible'
const limiter = new RateLimiter({
points: 100, // 100 requests
duration: 3600, // per hour
blockDuration: 0
})
async function rateLimitedLLMCall(userId: string, prompt: string) {
try {
await limiter.consume(userId, 1)
return await callLLM(prompt)
} catch (error) {
throw new RateLimitError('Rate limit exceeded. Try again in an hour.')
}
}Feature Flags for Cost Control
Use feature flags to control rollout and costs:
interface FeatureConfig {
enabled: boolean
rolloutPercentage: number
maxCostPerDay: number
fallbackBehavior: 'disable' | 'downgrade'
}
class FeatureManager {
async shouldEnableFeature(
featureId: string,
userId: string
): Promise<boolean> {
const config = await this.getFeatureConfig(featureId)
// Feature disabled globally
if (!config.enabled) return false
// Check cost budget
const todayCost = await this.getFeatureCost(featureId, 'today')
if (todayCost > config.maxCostPerDay) {
if (config.fallbackBehavior === 'disable') {
return false
}
// Downgrade to cheaper alternative
return this.useCheaperAlternative(featureId)
}
// Gradual rollout
const userHash = this.hashUserId(userId)
return userHash % 100 < config.rolloutPercentage
}
}Cost Attribution
Track costs by feature, user, and tenant:
interface CostAttribution {
userId: string
feature: string
model: string
cost: number
timestamp: Date
}
class CostTracker {
async trackCost(attribution: CostAttribution) {
await db.cost_attribution.create(attribution)
// Real-time aggregation
await this.updateDailyCosts(attribution.feature)
await this.updateUserCosts(attribution.userId)
}
async getFeatureCosts(
feature: string,
startDate: Date,
endDate: Date
): Promise<{ totalCost: number; requestCount: number }> {
const result = await db.cost_attribution.aggregate({
where: {
feature,
timestamp: { gte: startDate, lte: endDate }
},
_sum: { cost: true },
_count: true
})
return {
totalCost: result._sum.cost || 0,
requestCount: result._count
}
}
}Monitoring and Alerting
Set up proactive monitoring:
class CostMonitor {
async checkDailyBudget() {
const today = new Date()
const costs = await this.getDailyCosts(today)
const budget = await this.getDailyBudget()
if (costs > budget * 0.8) {
await this.sendAlert({
severity: 'warning',
message: \`Daily costs at 80%: $\${costs.toFixed(2)} / $\${budget}\`
})
}
if (costs > budget) {
await this.sendAlert({
severity: 'critical',
message: \`Daily budget exceeded: $\${costs.toFixed(2)} / $\${budget}\`
})
// Auto-disable expensive features
await this.disableExpensiveFeatures()
}
}
async detectAnomalies() {
const currentHour = await this.getHourlyCosts()
const avgHourly = await this.getAverageHourlyCosts()
if (currentHour > avgHourly * 3) {
await this.sendAlert({
severity: 'warning',
message: \`Cost spike detected: $\${currentHour.toFixed(2)} (3x normal)\`
})
}
}
}Best Practices
1. Design for Cost from Day One
Don't add cost controls as an afterthought:
// BAD: No cost controls
async function generateContent(prompt: string) {
return await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [{ role: 'user', content: prompt }]
})
}
// GOOD: Cost-aware from the start
async function generateContent(prompt: string, userId: string) {
// Check budget
await budgetEnforcer.checkBudget(userId, estimatedCost)
// Check rate limit
await rateLimiter.consume(userId)
// Route by user tier
const config = await routeByTier(await getUserTier(userId), prompt)
// Call with limits
const response = await openai.chat.completions.create({
model: config.model,
max_tokens: config.maxTokens,
messages: [{ role: 'user', content: prompt }]
})
// Track costs
await costTracker.trackCost({
userId,
feature: 'content_generation',
cost: calculateCost(response.usage)
})
return response
}2. Fail Gracefully
When budgets are exceeded, degrade gracefully:
async function handleLLMRequest(prompt: string, userId: string) {
try {
return await callPremiumLLM(prompt, userId)
} catch (error) {
if (error instanceof BudgetExceededError) {
// Offer cheaper alternative
return await callCheapLLM(prompt)
}
if (error instanceof RateLimitError) {
// Return cached response or queue for later
return await getCachedResponse(prompt) || queueForLater(prompt, userId)
}
throw error
}
}3. Monitor Everything
const costMetrics = {
dailyCost: new Gauge('llm_daily_cost_dollars'),
requestCost: new Histogram('llm_request_cost_dollars'),
requestCount: new Counter('llm_request_total'),
budgetUtilization: new Gauge('llm_budget_utilization_percent')
}
// Update metrics on every request
await costMetrics.dailyCost.set(await getDailyCost())
await costMetrics.requestCost.observe(requestCost)
await costMetrics.requestCount.inc()Real-World Example
Here's how we built a cost-aware chat feature:
class CostAwareChat {
async sendMessage(
userId: string,
message: string,
conversationId: string
): Promise<string> {
// 1. Get user tier and budget
const tier = await getUserTier(userId)
const budget = await getRemainingBudget(userId)
// 2. Estimate cost before calling
const history = await getConversationHistory(conversationId)
const estimatedCost = estimateChatCost(message, history, tier)
// 3. Check budget
if (estimatedCost > budget) {
throw new BudgetExceededError('Upgrade to continue chatting')
}
// 4. Route by tier
const model = tier === 'enterprise' ? 'gpt-4-turbo' : 'gpt-3.5-turbo'
// 5. Call with caching
const cachedResponse = await cache.get(message, history)
if (cachedResponse) return cachedResponse
// 6. Make request
const response = await openai.chat.completions.create({
model,
messages: [...history, { role: 'user', content: message }],
max_tokens: tier === 'free' ? 500 : 2000
})
// 7. Track costs
await trackCost(userId, 'chat', calculateCost(response.usage))
return response.choices[0].message.content
}
}Key Takeaways
Building cost-aware AI features isn't optional—it's essential for sustainable growth.