Skip to content

Expose per-turn token usage and cost in the after_llm_call hook payload #2948

@kimizuka

Description

@kimizuka

Overview

Add per-turn token usage and cost to the after_llm_call hook payload. Today the payload carries only:

&hooks.Input{
    SessionID:       sess.ID,
    AgentName:       a.Name(),
    StopResponse:    responseContent,
    LastUserMessage: sess.GetLastUserMessageContent(),
}

This proposal adds, additively:

  • usage — the per-call chat.Usage (input / cached input / cache write / output tokens), nested to avoid colliding with existing flat token fields
  • cost — the per-call cost in USD
  • model_id — the model actually used for this call (edit: already landed in fix(runtime): populate ModelID in after_llm_call hook payload #2911 — remaining scope is usage + cost only)

Motivation

There is currently no supported way to observe per-LLM-call usage or cost from outside the core. The only place spend is materialized is the session store, and the session store only records a sub-session when it completes successfully.

When a sub-session (delegated agent / skill / background agent) fails, its turns are dropped from persistence:

  • pkg/runtime/agent_delegation.go runForwarding returns early on the first ErrorEvent (~line 289), so parent.AddSubSession(s) and SubSessionCompletedEvent are never reached.
  • runCollecting has the same shape (returns on errMsg != "" before AddSubSession).
  • The persistence observer only writes a sub-session in response to SubSessionCompletedEvent, so a failed sub-session's turns never reach the DB.

This is reasonable for the session DB (a session is resumable conversation state, and a failed sub-session isn't part of it). But the API calls were already billed. The cost the session reports can be far below the actual provider invoice.

Concretely: in a real run against Vertex AI with several failing sub-sessions, the actual invoice was ~9× the cost shown by the session. Each failed sub-session made multiple billed model calls that never appeared in any total. This makes spend impossible to reconcile and is a real safety concern when running against a metered provider.

The root cause is that cost recording is coupled to session lifecycle (SubSessionCompletedEvent). The cleanest place to expose spend is the layer where it actually happens — once per LLM call. executeAfterLLMCallHooks already fires inside the shared turn loop (pkg/runtime/loop.go:572) on every successful model call, for sub-sessions too, and before the sub-session failure handling in agent_delegation.go. A per-turn payload with usage and cost therefore captures failed sub-session spend automatically, with no coupling to session completion.

Use Cases

  1. Spend reconciliation / cost ledger — a sidecar (agent YAML + a shell / Node / Python script) appends every billed turn to its own store, so the total matches the provider invoice even when sub-sessions fail.
  2. Budget guard — a handler that warns or stops a run once cumulative cost crosses a threshold.
  3. Alerting / telemetry — forward per-turn usage and cost to an external monitoring system, independent of session persistence.

Proposed Solution

Widen the after_llm_call payload at executeAfterLLMCallHooks to include usage and cost. The data already exists locally at the call site in the turn loop; this is plumbing, not new computation. With this primitive, a cost ledger lives entirely outside the core — no new core subsystem and no new storage owned by the core.

Because this widens a public, irreversible payload contract, the schema shape should be settled first. Open questions:

  1. Unpriced vs free. A flat cost: 0 can't distinguish a free call from a model with no pricing. Prefer a structural signal (e.g. nullable cost = unpriced) over a documented "check usage != nil" convention.
  2. Token field naming. Nesting under usage avoids collision with the existing compaction-related token fields — is this the preferred convention?
  3. Coverage. Main runtime and harness paths are straightforward; compaction sub-runtimes and the chatserver / a2a / acp paths likely need follow-up. Should the initial change scope to the main loop?

I have a working implementation of the payload widening plus an example sidecar ledger, and am happy to open a PR once the schema is agreed.

Alternatives

  • Internal Observer + built-in cost ledger (a new pkg/costlog package with its own SQLite store). This works and needs no public API change, but it makes the core own a new subsystem, storage format, and config surface, and is Go-only. The hook approach keeps the core to a primitive and lets the ledger be any language. (I have prototyped both.)
  • Persist failed sub-sessions into the session DB. This conflicts with the intended meaning of the session store as resumable state, so it's a poorer fit.

Related Issues

  • #504 Token tracking when using subagents (closed)
  • #90 Cost tracking for multi-agent runs sometimes resets to zero (closed)
  • #1345 Stats on usage + interaction summary (open)

Additional Context

Line references are against main at the time of writing and may drift. The proposal also reconciles the existing doc that says after_llm_call populates the model id, which today it does not. (edit: that was fixed in #2911; model_id is already populated. The remaining scope of this issue is usage + cost.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentFor work that has to do with the general agent loop/agentic features of the appkind/featPR adds a new feature (maps to feat: commit prefix)status/needs-designRequires architectural discussion or design review
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions