Best Practices

Building Cost-Aware AI Features: Best Practices from Production

Learn how to design AI features with cost in mind from day one. Includes architecture patterns, code examples, and real-world case studies.

ReForge Team

Engineering

6 min read

Building AI features without considering costs is a recipe for disaster. We've seen teams ship features that work great functionally but bankrupt the company financially. Don't let that be you.

In this guide, we'll show you how to design cost-aware AI features from day one, with patterns and practices from production systems handling millions of requests.

Cost-Aware Architecture Principles

Principle 1: Tier-Based Routing

Different user tiers should get different service levels:

enum UserTier {
  FREE = 'free',
  PRO = 'pro',
  ENTERPRISE = 'enterprise'
}

function routeByTier(tier: UserTier, prompt: string) {
  switch (tier) {
    case UserTier.FREE:
      return {
        model: 'gpt-3.5-turbo',
        maxTokens: 500,
        priority: 'low'
      }

    case UserTier.PRO:
      return {
        model: 'gpt-4-turbo',
        maxTokens: 2000,
        priority: 'normal'
      }

    case UserTier.ENTERPRISE:
      return {
        model: 'gpt-4-turbo',
        maxTokens: 4000,
        priority: 'high',
        sla: '99.9%'
      }
  }
}

Principle 2: Budget Enforcement

Set hard limits to prevent cost overruns:

class BudgetEnforcer {
  async checkBudget(userId: string, estimatedCost: number): Promise<boolean> {
    const usage = await this.getMonthlyUsage(userId)
    const limit = await this.getUserBudgetLimit(userId)

    if (usage.totalCost + estimatedCost > limit) {
      throw new BudgetExceededError(
        \`Monthly budget exceeded: $\${usage.totalCost.toFixed(2)} / $\${limit}\`
      )
    }

    return true
  }

  async trackUsage(userId: string, cost: number) {
    await db.llm_usage.create({
      user_id: userId,
      cost,
      timestamp: new Date()
    })

    // Alert user at 80% usage
    const usage = await this.getMonthlyUsage(userId)
    const limit = await this.getUserBudgetLimit(userId)

    if (usage.totalCost > limit * 0.8) {
      await this.sendBudgetAlert(userId, usage.totalCost, limit)
    }
  }
}

Principle 3: Rate Limiting

Prevent abuse and control costs:

import { RateLimiter } from 'rate-limiter-flexible'

const limiter = new RateLimiter({
  points: 100,  // 100 requests
  duration: 3600,  // per hour
  blockDuration: 0
})

async function rateLimitedLLMCall(userId: string, prompt: string) {
  try {
    await limiter.consume(userId, 1)
    return await callLLM(prompt)
  } catch (error) {
    throw new RateLimitError('Rate limit exceeded. Try again in an hour.')
  }
}

Feature Flags for Cost Control

Use feature flags to control rollout and costs:

interface FeatureConfig {
  enabled: boolean
  rolloutPercentage: number
  maxCostPerDay: number
  fallbackBehavior: 'disable' | 'downgrade'
}

class FeatureManager {
  async shouldEnableFeature(
    featureId: string,
    userId: string
  ): Promise<boolean> {
    const config = await this.getFeatureConfig(featureId)

    // Feature disabled globally
    if (!config.enabled) return false

    // Check cost budget
    const todayCost = await this.getFeatureCost(featureId, 'today')
    if (todayCost > config.maxCostPerDay) {
      if (config.fallbackBehavior === 'disable') {
        return false
      }
      // Downgrade to cheaper alternative
      return this.useCheaperAlternative(featureId)
    }

    // Gradual rollout
    const userHash = this.hashUserId(userId)
    return userHash % 100 < config.rolloutPercentage
  }
}

Cost Attribution

Track costs by feature, user, and tenant:

interface CostAttribution {
  userId: string
  feature: string
  model: string
  cost: number
  timestamp: Date
}

class CostTracker {
  async trackCost(attribution: CostAttribution) {
    await db.cost_attribution.create(attribution)

    // Real-time aggregation
    await this.updateDailyCosts(attribution.feature)
    await this.updateUserCosts(attribution.userId)
  }

  async getFeatureCosts(
    feature: string,
    startDate: Date,
    endDate: Date
  ): Promise<{ totalCost: number; requestCount: number }> {
    const result = await db.cost_attribution.aggregate({
      where: {
        feature,
        timestamp: { gte: startDate, lte: endDate }
      },
      _sum: { cost: true },
      _count: true
    })

    return {
      totalCost: result._sum.cost || 0,
      requestCount: result._count
    }
  }
}

Monitoring and Alerting

Set up proactive monitoring:

class CostMonitor {
  async checkDailyBudget() {
    const today = new Date()
    const costs = await this.getDailyCosts(today)
    const budget = await this.getDailyBudget()

    if (costs > budget * 0.8) {
      await this.sendAlert({
        severity: 'warning',
        message: \`Daily costs at 80%: $\${costs.toFixed(2)} / $\${budget}\`
      })
    }

    if (costs > budget) {
      await this.sendAlert({
        severity: 'critical',
        message: \`Daily budget exceeded: $\${costs.toFixed(2)} / $\${budget}\`
      })

      // Auto-disable expensive features
      await this.disableExpensiveFeatures()
    }
  }

  async detectAnomalies() {
    const currentHour = await this.getHourlyCosts()
    const avgHourly = await this.getAverageHourlyCosts()

    if (currentHour > avgHourly * 3) {
      await this.sendAlert({
        severity: 'warning',
        message: \`Cost spike detected: $\${currentHour.toFixed(2)} (3x normal)\`
      })
    }
  }
}

Best Practices

1. Design for Cost from Day One

Don't add cost controls as an afterthought:

// BAD: No cost controls
async function generateContent(prompt: string) {
  return await openai.chat.completions.create({
    model: 'gpt-4-turbo',
    messages: [{ role: 'user', content: prompt }]
  })
}

// GOOD: Cost-aware from the start
async function generateContent(prompt: string, userId: string) {
  // Check budget
  await budgetEnforcer.checkBudget(userId, estimatedCost)

  // Check rate limit
  await rateLimiter.consume(userId)

  // Route by user tier
  const config = await routeByTier(await getUserTier(userId), prompt)

  // Call with limits
  const response = await openai.chat.completions.create({
    model: config.model,
    max_tokens: config.maxTokens,
    messages: [{ role: 'user', content: prompt }]
  })

  // Track costs
  await costTracker.trackCost({
    userId,
    feature: 'content_generation',
    cost: calculateCost(response.usage)
  })

  return response
}

2. Fail Gracefully

When budgets are exceeded, degrade gracefully:

async function handleLLMRequest(prompt: string, userId: string) {
  try {
    return await callPremiumLLM(prompt, userId)
  } catch (error) {
    if (error instanceof BudgetExceededError) {
      // Offer cheaper alternative
      return await callCheapLLM(prompt)
    }
    if (error instanceof RateLimitError) {
      // Return cached response or queue for later
      return await getCachedResponse(prompt) || queueForLater(prompt, userId)
    }
    throw error
  }
}

3. Monitor Everything

const costMetrics = {
  dailyCost: new Gauge('llm_daily_cost_dollars'),
  requestCost: new Histogram('llm_request_cost_dollars'),
  requestCount: new Counter('llm_request_total'),
  budgetUtilization: new Gauge('llm_budget_utilization_percent')
}

// Update metrics on every request
await costMetrics.dailyCost.set(await getDailyCost())
await costMetrics.requestCost.observe(requestCost)
await costMetrics.requestCount.inc()

Real-World Example

Here's how we built a cost-aware chat feature:

class CostAwareChat {
  async sendMessage(
    userId: string,
    message: string,
    conversationId: string
  ): Promise<string> {
    // 1. Get user tier and budget
    const tier = await getUserTier(userId)
    const budget = await getRemainingBudget(userId)

    // 2. Estimate cost before calling
    const history = await getConversationHistory(conversationId)
    const estimatedCost = estimateChatCost(message, history, tier)

    // 3. Check budget
    if (estimatedCost > budget) {
      throw new BudgetExceededError('Upgrade to continue chatting')
    }

    // 4. Route by tier
    const model = tier === 'enterprise' ? 'gpt-4-turbo' : 'gpt-3.5-turbo'

    // 5. Call with caching
    const cachedResponse = await cache.get(message, history)
    if (cachedResponse) return cachedResponse

    // 6. Make request
    const response = await openai.chat.completions.create({
      model,
      messages: [...history, { role: 'user', content: message }],
      max_tokens: tier === 'free' ? 500 : 2000
    })

    // 7. Track costs
    await trackCost(userId, 'chat', calculateCost(response.usage))

    return response.choices[0].message.content
  }
}

Key Takeaways

  • **Design for cost from day one** - Don't add controls as an afterthought
  • **Use tier-based routing** - Different users get different service levels
  • **Enforce budgets** - Prevent cost overruns with hard limits
  • **Track everything** - Cost attribution by user, feature, and tenant
  • **Monitor proactively** - Alert before budgets are exceeded
  • **Fail gracefully** - Degrade service instead of failing completely
  • Building cost-aware AI features isn't optional—it's essential for sustainable growth.

    architecturecost-awarenessdesign-patternsproduction

    How good is your vibe coding really?

    Prism scores every coding session so you finally know how your prompting performs — and how you compare. Works with Claude Code today; Codex and Gemini next.

    Check your Score