AI agent observability is the practice of exporting traces, metrics, and log events from an agent runtime to a backend so you can see which tools were called, how long each model request took, how many tokens were spent, and where failures happened. The Claude Agent SDK ships with OpenTelemetry instrumentation already wired into its CLI subprocess. To turn it on for Langfuse, you set five environment variables, point OTLP at the Langfuse ingestion endpoint, and the SDK starts emitting spans for every interaction, model call, and tool. Total code: under 30 lines.
How do you trace an AI agent?
You trace an AI agent by emitting an OpenTelemetry span for every unit of work: the agent loop turn, the model request inside it, each tool call, and any sub-agent spawned through delegation. The Claude Agent SDK records these automatically once telemetry is enabled.
The SDK runs the Claude Code CLI as a child process and the CLI is what carries the instrumentation. Your application code does not produce telemetry directly. You pass configuration through environment variables and the CLI exports OTLP HTTP straight to your backend.
Four spans matter, per the official Claude Agent SDK observability docs:
claude_code.interaction-- one full turn of the agent loop, prompt to response.claude_code.llm_request-- one call to the Anthropic API, with model name, latency, and token counts as attributes.claude_code.tool-- one tool invocation, with child spansclaude_code.tool.blocked_on_userandclaude_code.tool.execution.claude_code.hook-- one hook execution (requiresENABLE_BETA_TRACING_DETAILED=1).
When the agent uses the Task tool to spawn a sub-agent, the child agent's spans nest under the parent's claude_code.tool span, so the full delegation chain shows up as one trace in your backend.
What does OpenTelemetry give you for agent observability?
OpenTelemetry gives you three independent signals -- traces, metrics, and log events -- on a vendor-neutral wire format (OTLP). The Claude Agent SDK lets you flip each one on independently with its own exporter, so you only ship the data you need.
| Signal | What it contains | Enable with |
|---|---|---|
| Traces | Spans for each interaction, model request, tool call, hook | OTEL_TRACES_EXPORTER=otlp + CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1 |
| Metrics | Counters for tokens, cost, sessions, lines of code, tool decisions | OTEL_METRICS_EXPORTER=otlp |
| Log events | Structured records for each prompt, API request, API error, tool result | OTEL_LOGS_EXPORTER=otlp |
The GenAI semantic conventions in OpenTelemetry 1.37+ standardize attribute names like gen_ai.usage.input_tokens and gen_ai.request.model across vendors. According to Datadog (2026), these conventions mean you instrument once and switch backends with no SDK changes.
The stakes are not academic. The LLM observability platform market grew from $1.97B in 2025 to $2.69B in 2026 (36.3% CAGR), and Gartner forecasts that LLM observability will be part of 50% of GenAI deployments by 2028, up from 15% in early 2026.
How do you set up Claude Agent SDK observability with Langfuse?
Setup is seven steps and roughly 27 lines of Python. You provision Langfuse credentials, set environment variables for the OTLP exporter, run a query() call, and the spans appear in Langfuse within five seconds.
Step 1: Get Langfuse credentials
Create a project at cloud.langfuse.com (free tier: 50,000 observations per month) or self-host. In Project Settings -- API Keys, generate a public key (pk-lf-...) and secret key (sk-lf-...).
Step 2: Base64-encode the keys for basic auth
echo -n "pk-lf-1234:sk-lf-5678" | base64
# -> cGstbGYtMTIzNDpzay1sZi01Njc4
Step 3: Install the SDK
pip install claude-agent-sdk
Note: with the OTLP path you do not need the Langfuse Python SDK at all. The Claude Agent CLI handles the export.
Step 4: Configure environment variables
This is the entire integration. For Langfuse Cloud (EU), set:
# otel.env.yaml -- load with python-dotenv or pass to ClaudeAgentOptions
CLAUDE_CODE_ENABLE_TELEMETRY: "1"
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA: "1"
OTEL_TRACES_EXPORTER: "otlp"
OTEL_METRICS_EXPORTER: "otlp"
OTEL_LOGS_EXPORTER: "otlp"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_EXPORTER_OTLP_ENDPOINT: "https://cloud.langfuse.com/api/public/otel"
OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Basic cGstbGYtMTIzNDpzay1sZi01Njc4"
OTEL_SERVICE_NAME: "support-triage-agent"
OTEL_RESOURCE_ATTRIBUTES: "service.version=1.4.0,deployment.environment=production"
OTEL_TRACES_EXPORT_INTERVAL: "1000"
For self-hosted Langfuse, swap the endpoint to http://your-host:3000/api/public/otel (added in Langfuse v3.22.0).
Step 5: Wire it into the agent
import asyncio
import os
from claude_agent_sdk import query, ClaudeAgentOptions
OTEL_ENV = {k: os.environ[k] for k in os.environ if k.startswith(("OTEL_", "CLAUDE_CODE_"))}
async def main():
options = ClaudeAgentOptions(env=OTEL_ENV)
async for msg in query(
prompt="Triage this Zendesk ticket and draft a reply.",
options=options,
):
print(msg)
asyncio.run(main())
That is the entire instrumentation: 12 lines of Python plus the YAML. The SDK injects W3C TRACEPARENT automatically, so if your app already runs inside an OpenTelemetry span, the agent run nests under it.
Step 6: Verify in the Langfuse UI
Open Langfuse -- Tracing. You should see a row for each claude_code.interaction span with nested llm_request and tool children. Click into a trace and you get the prompt, the model output, token counts, latency per call, and the tool inputs (if you opted in with OTEL_LOG_TOOL_DETAILS=1).
Step 7: Build dashboards
In Langfuse Dashboards, add four tiles:
- Cost per run -- sum
gen_ai.usage.costgrouped bysession.id. - Tool p95 latency -- 95th percentile of
claude_code.tool.executionduration grouped bygen_ai.tool.name. - Tool error rate -- count of
claude_code.toolspans withstatus=errordivided by total, grouped by tool name. - Turns per run -- count of
claude_code.interactionspans persession.id.
Can you use Datadog or Honeycomb for AI agents?
Yes. Both Datadog LLM Observability and Honeycomb accept OTLP HTTP and natively support the OpenTelemetry GenAI semantic conventions (OTel 1.37+). You change one environment variable -- the OTEL_EXPORTER_OTLP_ENDPOINT -- and the same Claude Agent SDK code ships traces to either backend.
Per Datadog's 2026 announcement, you can send LLM traces directly from OpenTelemetry-instrumented applications without the Datadog LLM Observability SDK or a Datadog Agent. For Honeycomb, point OTLP at https://api.honeycomb.io with x-honeycomb-team as the API key header.
When to pick which backend:
What metrics matter most for production AI agents?
Five metrics carry the diagnostic load. Latency alone is misleading for agents because token counts vary by 10x across runs. According to Groundcover's 2026 agent observability guide, a single LLM call generates 8-15 spans versus 2-3 for a typical API endpoint, so dashboards need to aggregate at the right level.
The minimum viable agent dashboard tracks:
- Cost per run -- input tokens times input price plus output tokens times output price, grouped by
session.idandservice.name. This is the metric that gets agents shut off in production. Track p50, p95, and p99. - Tool p95 latency -- per-tool, because a slow
web_searchis a different problem than a slowbash. Filter ongen_ai.tool.name. - Tool error rate -- failures per 100 invocations per tool. A tool failing 12% of the time is invisible at the run level but shows up here. From the Groundcover agent observability guide (2026), per-tool error rates are the single best leading indicator of user-reported issues.
- Turns per run -- count of
claude_code.interactionspans per session. Sudden growth here means the agent is looping. - Success rate -- the share of runs that complete the task per your offline evals. Pair this with agent eval frameworks and link traces back to eval scores in Langfuse.
Latency without cost is half a story. Cost without success rate is the other half. You need all five.
How do you alert on agent cost blow-ups?
Set two alerts: one on per-session token totals to catch runaway loops, one on cost-per-minute to catch traffic spikes. Both run as queries against your OTLP backend.
Alert 1: Per-session token blow-up
Fire when a single session.id exceeds your max-tokens budget. In Langfuse, this is a metric-condition alert on the sum of gen_ai.usage.total_tokens grouped by session.id. A sensible default for production agents is 200,000 tokens per session.
# pseudo-config -- exact syntax depends on your alerting tool
name: agent-session-token-blowup
metric: sum(gen_ai.usage.total_tokens)
group_by: [session.id, service.name]
threshold: 200000
window: 15m
severity: page
Alert 2: Cost-per-minute anomaly
Fire when the rolling 5-minute spend exceeds 3x your historical p99. This catches a viral spike, a misbehaving cron, or a free-tier abuse incident.
name: agent-cost-rate-spike
metric: sum(gen_ai.usage.cost) / 5m
group_by: [service.name, deployment.environment]
threshold: 3 * historical_p99
window: 5m
severity: page
A third worth adding once you have baseline data: tool error rate above 5% sustained for 10 minutes, grouped by gen_ai.tool.name. Tools fail silently more often than models do, and a flaky tool inflates token cost because the agent retries.
Per OneUptime's 2026 analysis, agents are now the largest source of unplanned LLM spend in production. The cost alert above is non-optional.
| Backend | OTLP-native | GenAI semantic conventions | Best for |
|---|---|---|---|
| Langfuse | Yes | Yes (via SDK v4) | LLM-specific UI, prompt linking, evals, free tier |
| Datadog LLM Observability | Yes | Yes (OTel 1.37+) | Enterprises already on Datadog APM |
| Honeycomb | Yes | Yes | High-cardinality trace querying, BubbleUp anomaly detection |
| Grafana Tempo + Loki | Yes | Manual mapping | Self-hosted, infra teams already on Grafana |
| SigNoz | Yes | Yes | Open-source full-stack OTLP backend |
| New Relic | Yes | Yes | Bundled APM + LLM in a single contract |