How to Track OpenAI API spend per feature: a cost-attribution playbook

Your OpenAI invoice says you spent $4,237 last month. It does not tell you that $3,100 of that came from one runaway summarization endpoint, $700 from a customer who is paying you $50 a month, and $437 from a feature that nobody uses. The dashboard hides the picture you need to make any pricing, capacity, or roadmap decision.

This guide shows you how to do OpenAI API cost attribution the right way: tag every request with metadata, aggregate spend per feature, route, and customer, set per-key budget caps, and design prompts so cost stops being a mystery line item. If you ship LLM features, this is the work that turns AI from a runaway expense into a managed product cost.

💡

Apidog gives you the request-level visibility and scenario testing you need to verify your cost-tracking wrapper works before it ships to production. Download Apidog and use it to replay tagged requests, assert log shape, and validate that every call carries the metadata your warehouse expects.

button

TL;DR

Tag every OpenAI API call with structured metadata (feature, route, customer_id, environment), emit a structured log line per request that captures token counts and computed cost, then aggregate by tag in your warehouse. Set per-key budget caps in the OpenAI dashboard, alert on hourly spend anomalies, and validate the wrapper end-to-end with Apidog scenario tests before you trust the numbers.

Introduction

You ship a new AI feature on Tuesday. By Friday morning, your CFO is in your DM asking why the OpenAI line jumped 40 percent. You open the dashboard. It shows total spend climbing. It does not show which feature, which customer, or which route is responsible. You start guessing.

This is the gap every team running production LLM workloads hits. OpenAI’s billing surface is built for accounts payable, not for engineering attribution. You see daily totals, model breakdowns, and that’s it. You do not see the request shape, the customer who triggered it, or the feature that called the model.

The fix is simple in concept and unforgiving in execution. You wrap every API call with a metadata layer. You log every request to a structured store. You aggregate by tag. You set caps. You alert on deltas. By the end of this post you will have a concrete data model, two working code samples, a verification workflow with Apidog, and a tooling comparison so you can decide whether to build, buy, or use the OpenAI usage API directly.

For pricing context that drives your cost math, see the GPT-5.5 pricing breakdown. For a related billing-attribution problem on the developer-tools side, see GitHub Copilot usage billing for API teams. For the OpenAI API basics, see the official OpenAI API reference.

Why OpenAI’s billing dashboard isn’t enough

Open the OpenAI billing page right now. You will see a daily-spend chart, a model breakdown, and a usage limit. That is the entire surface. It works fine if you have one application, one customer, and one feature. The moment you have any of: multiple features in the same product, multiple customers, multiple environments, or multiple developers, the dashboard stops being useful.

Here is what is missing.

Total spend without context. The dashboard tells you that you spent $312 yesterday on GPT-5.5. It does not tell you whether that came from one customer hitting your support-chat endpoint 50,000 times or from a background job that re-summarized your entire knowledge base because someone left a flag flipped. Both look identical on the chart.

No per-feature breakdown. OpenAI tags requests by API key and model. It does not tag them by your feature, route, customer ID, or environment. If you want any of that, you build it yourself. There is no native dimension for product analytics in the dashboard.

Reporting lag. Usage data shows up with a delay measured in tens of minutes to a few hours. By the time a runaway loop appears in your dashboard, it has already burned a credit card. You need real-time tracking, not a historical chart.

No alert primitives. OpenAI gives you one budget cap per organization and one soft notification email. There is no per-key alert, no per-feature threshold, no anomaly detection. If you want “page me if the chat endpoint exceeds $50 in an hour,” you build it yourself.

No customer attribution. If you sell B2B SaaS with an AI feature, you need to know which customer drove which spend so you can price, throttle, or upsell. The dashboard cannot answer “what is customer X costing me this month.” Your finance team needs that number to compute gross margin per customer. Without it, your unit economics are guesswork.

Service accounts and project-level keys help, but only partially. OpenAI’s project keys let you split usage by project. That gets you one level of attribution. It does not get you per-feature, per-customer, or per-route. You still need application-level metadata to answer the questions that matter. The OpenAI usage API returns aggregated data per project, not per request.

The pattern repeats across every team that ships LLM features at scale. The Dev.to thread “OpenAI Tells You What You Spent. Not Where. So I Built a Dashboard” went viral because it named the problem out loud: you cannot manage what you cannot measure. The native dashboard answers a finance question. You need to answer a product question.

The cost-attribution data model

Cost attribution starts with one decision: every OpenAI request gets a tagged event written to your warehouse. That event is your unit of analysis. Get the schema right and the rest of the work, dashboards, alerts, caps, becomes a SQL query.

Here is the minimum schema you need.

Column	Type	Example	Why it matters
`request_id`	uuid	`7a91...`	Idempotency, deduplication, retries
`timestamp`	timestamptz	`2026-05-06T14:23:01Z`	Time-series queries, anomaly detection
`feature`	text	`support-chat`	The product surface that triggered the call
`route`	text	`/api/v1/chat/answer`	The HTTP route or background job ID
`customer_id`	text	`cust_4291`	Per-customer spend, gross margin
`environment`	text	`prod`, `staging`, `dev`	Keep dev cost out of customer attribution
`model`	text	`gpt-5.5`, `gpt-5.4-mini`	Pricing differs per model
`prompt_tokens`	int	`15234`	Input token count from the response
`completion_tokens`	int	`812`	Output token count from the response
`reasoning_tokens`	int	`4500`	Reasoning tokens (counted as output billing)
`cached_tokens`	int	`12000`	Prompt-cache hits, billed at 50 percent
`latency_ms`	int	`2341`	For correlating cost with user experience
`cost_usd`	numeric(10,6)	`0.045672`	Computed at write time from token counts
`prompt_cache_key`	text	`system-v3`	Track cache hit rates per feature
`error_code`	text	`null`, `429`	So you do not double-count retries

Compute cost at write time, not at query time. Pricing changes; you want a frozen number that reflects the rate you paid the day the request happened. The compute logic for GPT-5.5 looks like this:

PRICING = {  # USD per 1M tokens, as of May 2026
    "gpt-5.5":      {"input": 5.00,  "cached": 2.50,  "output": 30.00},
    "gpt-5.5-pro":  {"input": 30.00, "cached": 15.00, "output": 180.00},
    "gpt-5.4":      {"input": 2.50,  "cached": 1.25,  "output": 15.00},
    "gpt-5.4-mini": {"input": 0.25,  "cached": 0.125, "output": 2.00},
}

def compute_cost_usd(model, prompt_tokens, cached_tokens, completion_tokens, reasoning_tokens):
    rates = PRICING[model]
    uncached = max(0, prompt_tokens - cached_tokens)
    input_cost  = (uncached      * rates["input"])  / 1_000_000
    cache_cost  = (cached_tokens * rates["cached"]) / 1_000_000
    output_cost = ((completion_tokens + reasoning_tokens) * rates["output"]) / 1_000_000
    return round(input_cost + cache_cost + output_cost, 6)

Reasoning tokens count as output. The OpenAI API returns them in usage.completion_tokens_details.reasoning_tokens, but they are billed at the output rate. Miss this and you under-count cost on every Thinking-mode call. For the full pricing math see the GPT-5.5 pricing breakdown.

Now wrap the OpenAI client. Every call goes through one function. That function takes metadata, makes the request, and writes the event.

import time, uuid, json, logging
from openai import OpenAI

client = OpenAI()
logger = logging.getLogger("llm.cost")

def call_with_attribution(
    *, feature, route, customer_id, environment,
    model, messages, **openai_kwargs
):
    request_id = str(uuid.uuid4())
    started = time.time()
    error_code = None
    response = None

    try:
        response = client.chat.completions.create(
            model=model, messages=messages, **openai_kwargs
        )
    except Exception as e:
        error_code = getattr(e, "code", "unknown_error")
        raise
    finally:
        latency_ms = int((time.time() - started) * 1000)
        u = response.usage if response else None
        prompt_tokens     = getattr(u, "prompt_tokens", 0)
        completion_tokens = getattr(u, "completion_tokens", 0)
        cached_tokens     = getattr(getattr(u, "prompt_tokens_details", None), "cached_tokens", 0) or 0
        reasoning_tokens  = getattr(getattr(u, "completion_tokens_details", None), "reasoning_tokens", 0) or 0
        cost_usd = compute_cost_usd(model, prompt_tokens, cached_tokens, completion_tokens, reasoning_tokens)

        logger.info(json.dumps({
            "event": "openai.request",
            "request_id": request_id,
            "feature": feature,
            "route": route,
            "customer_id": customer_id,
            "environment": environment,
            "model": model,
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "reasoning_tokens": reasoning_tokens,
            "cached_tokens": cached_tokens,
            "latency_ms": latency_ms,
            "cost_usd": cost_usd,
            "error_code": error_code,
        }))

    return response

That single wrapper is your cost-attribution surface. Every feature in your product calls it. The structured log line is your warehouse input. From here, ship the logs to BigQuery, ClickHouse, Snowflake, or Postgres via your existing log pipeline (Vector, Fluent Bit, Logstash, OTLP collector). No second pipeline. No extra service.

For Node.js teams, the shape is identical. Wrap the OpenAI SDK in a function that takes metadata, captures response.usage, computes cost, and writes a JSON line. If your platform already runs an event bus (Kafka, NATS, Pub/Sub), publish the event there instead of stdout and let downstream consumers fan it out to your warehouse and your alerting system.

Wire up cost tracking and test it with Apidog

You have the schema and the wrapper. Now turn it into something operational. Six steps.

1. Replace direct OpenAI calls with the wrapper. Grep your codebase for OpenAI( and client.chat.completions.create. Every hit becomes a call_with_attribution(...) call. Make feature and route mandatory arguments. Pass them at the call site, not from a global. If you forget to pass them, the function should raise, not default to “unknown.”

2. Emit structured logs. Log to stdout as JSON, one line per event. Set the logger level to INFO for these events specifically. Do not interleave them with debug noise. If you already use a structured logger (structlog, pino, winston), wire it into that.

3. Aggregate per feature in your warehouse. Once events land in BigQuery or ClickHouse, the queries write themselves:

SELECT
  feature,
  DATE_TRUNC(timestamp, DAY) AS day,
  COUNT(*) AS requests,
  SUM(cost_usd) AS spend_usd,
  SUM(prompt_tokens + completion_tokens) AS tokens,
  AVG(latency_ms) AS avg_latency_ms,
  SUM(cached_tokens) / NULLIF(SUM(prompt_tokens), 0) AS cache_hit_rate
FROM openai_events
WHERE environment = 'prod'
  AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY feature, day
ORDER BY day DESC, spend_usd DESC;

4. Chart spend per route. Point Grafana, Metabase, Looker, or Superset at the table. Build three views: spend per feature over time, spend per customer over time, and a top-20 route table sorted by yesterday’s spend. That is your daily ops dashboard.

5. Test the wrapper with Apidog before you ship it. This is the step most teams skip and regret. Your wrapper writes structured logs. If the schema is wrong, your warehouse is silently wrong, and the dashboards lie. Use Apidog to drive end-to-end tests against your service:

Create a scenario in Apidog that sends a request to your AI endpoint with a known customer_id and feature.
Capture the response and the side-channel log emission (stdout, OTLP, your log endpoint).
Use Apidog’s response assertions to verify the log payload contains the expected feature, route, customer_id, and that cost_usd > 0 and prompt_tokens > 0.
Run the same scenario across staging and prod environments using Apidog environment variables.
Replay tagged requests and assert that retried calls do not double-count cost (your wrapper should reuse request_id on retry).

For broader API testing approaches that fit this verification step, see API testing tools for QA engineers. For the contract-first approach that pairs with cost-attribution coverage, see contract-first API development.

6. Set per-key budget caps and alerts. OpenAI lets you create one project key per environment or per feature. Use it. Create a prod-support-chat key, a prod-summarization key, a staging-all key. Set hard caps in the OpenAI dashboard so a runaway loop on one feature cannot drain the entire org budget. Layer your own alerting on top: a SQL query that runs every 10 minutes and pages you if any feature exceeds 3x its 7-day rolling-average hourly spend. PagerDuty, Opsgenie, or a Slack webhook all work; the trigger comes from your warehouse, not from OpenAI.

The combination of native key caps plus warehouse-driven alerts gives you two layers of defense. The native caps are a backstop against catastrophic burn. The warehouse alerts catch the slow drift before the cap fires.

Advanced techniques and pro tips

Once the basic pipeline is running, the optimizations follow.

Prompt caching. GPT-5.5 charges 50 percent of the input rate for cached tokens. Structure your system prompt as a stable prefix and put per-request variables at the end. Cache hit rates above 70 percent on chat workloads are normal once you do this. Track cache_hit_rate per feature in your dashboard so you can see when a prompt change tanks your hit rate. The official OpenAI prompt caching docs cover the eligibility rules.

Batch API for offline work. Anything that does not need a synchronous response goes through the Batch API for a 50 percent discount. Nightly summarization, eval runs, embedding backfills, document re-processing. The cost wrapper still applies; call the Batch endpoint and tag the events with batch_job_id so you can attribute them back to the source workload.

Reasoning effort tuning. GPT-5.5 Thinking is the same model at higher reasoning.effort. Each effort level multiplies output tokens. Audit your features: are you running medium where low would pass quality bars? Run an A/B test, track quality and cost, ship the cheaper option if quality holds. For deeper math see how to use the GPT-5.5 API.

Context-window discipline. Long prompts are expensive. RAG with a tight retrieval budget beats stuffing the whole knowledge base into the context window. Track average prompt_tokens per feature; if it climbs week over week without a feature change, your prompt is bloating.

Watch the GPT-5.5 272K-token cliff. OpenAI applies a 2x input multiplier and 1.5x output multiplier on requests above 272K tokens. Add a guard in your wrapper that logs a warning when prompt_tokens > 250000 so you can catch prompts that are about to hit the cliff. For pricing details see the GPT-5.5 pricing post.

Per-customer spend caps. If you sell B2B and your contract includes an LLM-backed feature, you need a per-customer cap. Compute rolling spend per customer_id from your warehouse and have your application check it before each call. If the cap is hit, return a 429 with a “monthly AI quota exceeded” message and a billing CTA. This is what turns AI features from a margin risk into a margin product.

Common mistakes to avoid.

Counting reasoning tokens as input. They are output. Bill them at the output rate.
Trusting the OpenAI dashboard for real-time decisions. It lags. Use your own warehouse.
Tagging at the SDK level instead of the call site. Tags belong with the feature, not with the client.
Forgetting to tag background jobs. Cron jobs, queue workers, and webhooks make the same OpenAI calls; tag them with a synthetic route like cron:nightly-summarize or queue:image-caption.
Sampling. Do not sample. Log every request. The data volume is trivial; the attribution accuracy is not.
Letting customer_id be null. If you do not know the customer, log internal or system, never null. Null becomes an attribution black hole.

Alternatives and tooling

You do not have to build this yourself. Here is the honest comparison.

Approach	What it does well	What it costs	When to use
OpenAI usage API	Native, no setup, accurate to the cent	Free	One project, one feature, no per-customer attribution
Helicone	Drop-in proxy, dashboards, caching, per-user costs	Free tier; paid from $20/mo	Want a hosted dashboard fast, OK with proxy in the path
Langfuse	Open source, self-host or cloud, traces + cost	Free self-hosted; cloud from $29/mo	Want traces and cost in one tool, prefer open source
LangSmith	Tight LangChain integration, eval + cost	Paid from $39/user/mo	Already on LangChain, want one vendor
Custom warehouse	Full control, fits your existing stack, no proxy	Engineering time	Big workload, custom dimensions, strict data residency

Tradeoffs to keep in mind. A proxy (Helicone) puts a hop in your critical path; the latency cost is small but real, and you take a dependency on their availability. A self-hosted observability stack (Langfuse) gives you full control but you operate it. The custom-warehouse path is what most large teams end up at; it integrates with the rest of your data stack, but you write the queries and the alerts yourself. The native usage API is fine if your needs are simple, useless once they are not.

For a deeper read on what good LLM cost observability looks like in practice, the Helicone team’s guide on tracking LLM costs walks through the proxy-based approach. The Langfuse documentation on cost tracking covers the open-source path.

If you operate this at platform scale, the patterns generalize. See API platforms for microservices architecture for how cost-attribution wrappers fit into a service-mesh strategy.

Real-world use cases

B2B SaaS with per-customer LLM spend. A company sells a sales-intelligence product. Each customer triggers GPT-5.5 calls when they request a brief. Without attribution, the company knows it spends $80,000 a month on OpenAI. With per-customer attribution, it learns that 12 percent of customers drive 71 percent of spend. It introduces tiered pricing, soft quotas on the lowest tier, and a per-seat overage charge. Gross margin on the AI feature moves from 41 percent to 73 percent in one quarter.

Internal developer tool tracking. An engineering org gives every developer access to a private GPT-5.5 chat assistant. With per-developer tags (customer_id becomes dev_email), platform engineering sees that three developers account for 50 percent of internal spend. Two of them are running automated agent loops they forgot to turn off. Switching them off saves $1,800 a month. The third is doing legitimate work; the data justifies a higher org-wide quota for them.

AI feature spend forecasting. A product team wants to ship a new summarization feature. The PM does not know how to forecast cost. With historical per-feature data, the team builds a model: average prompt tokens per call, average output tokens, expected calls per active user, expected active users. The forecast comes back: $0.04 per active user per day, or $1.20 per month. The pricing team prices the feature at $5 per user per month. Finance signs off because the unit economics are visible.

Conclusion

You cannot manage what you cannot measure. OpenAI’s billing dashboard answers a finance question. Per-feature, per-customer, per-route attribution answers the product question. Build the wrapper, log the event, query the warehouse, and your AI line stops being a mystery.

Five takeaways:

Tag every request with feature, route, customer_id, and environment. Make the tags mandatory at the call site.
Compute cost at write time so historical data reflects the rate you paid that day.
Use one project key per environment and set hard caps in the OpenAI dashboard as a backstop.
Layer warehouse-driven alerts on top of native caps so you catch slow drift before the cap fires.
Test the wrapper with Apidog before you ship it; bad tags mean silent attribution errors.
Audit reasoning effort and prompt size quarterly; the cheapest optimization is the one that keeps quality flat.
Track cache hit rate per feature so a prompt change does not silently double your input cost.

Download Apidog and use it to verify your cost-attribution wrapper end-to-end. Drive your AI endpoints with tagged requests, assert the log payload shape, and replay scenarios across environments so the data your warehouse trusts is the data your engineers wrote.

For related cost-management reading, see the GPT-5.5 pricing breakdown and GitHub Copilot usage billing for API teams.

FAQ

Do reasoning tokens count as input or output for billing?Reasoning tokens are billed at the output rate. The OpenAI API returns them under usage.completion_tokens_details.reasoning_tokens. Add them to completion_tokens when you compute cost. For the per-effort multipliers, see the GPT-5.5 pricing breakdown.

How accurate is response.usage compared to the OpenAI dashboard?Token counts in response.usage match the dashboard to the token. Pricing changes can cause small drift if you compute cost from a stale rate table; pin the rate per model and update it the day OpenAI ships a price change.

Can I do attribution with OpenAI project keys alone?Project keys give you one dimension of attribution: per project. They do not give you per feature, per customer, or per route. Use project keys for environment splits and budget caps; use application-level metadata for everything else.

What about retries and rate-limit errors? Do they double-count cost?A request that fails before the model runs (4xx, network error before completion) does not return a usage object, so no cost is logged. A request that succeeds and is then retried at the application layer will be logged twice unless you reuse request_id. Idempotent retries should pass the same request_id and your wrapper should dedupe on write.

How fast does the OpenAI usage API return data?The usage API has a lag of tens of minutes. For real-time decisions (alerts, kill-switches), use your own warehouse. For monthly reconciliation, the usage API is fine and matches your billing invoice.

Should I sample requests to reduce log volume?No. The data volume is small (one JSON line per request), and sampling breaks per-customer and per-route attribution. Log every request.

Can I use this approach for other LLM providers?Yes. The schema generalizes. Add a provider column (openai, anthropic, google, deepseek) and one pricing table per provider. The wrapper is provider-specific; the warehouse and dashboards are not. For a comparison point, see DeepSeek V4 API pricing.

Does this work for embeddings and image-generation calls too?Yes, with provider-specific cost math. Embeddings are billed per input token at a flat rate; images are billed per image at a per-resolution rate. Add endpoint (e.g. chat, embeddings, image) to the schema and branch your cost computation on it.

button