AI agent observability is the practice of exporting traces, metrics, and log events from an agent runtime to a backend so you can see which tools were called, how long each model request took, how many tokens were spent, and where failures happened. The Claude Agent SDK ships with OpenTelemetry instrumentation already wired into its CLI subprocess. To turn it on for Langfuse, you set five environment variables, point OTLP at the Langfuse ingestion endpoint, and the SDK starts emitting spans for every interaction, model call, and tool. Total code: under 30 lines.

How do you trace an AI agent?

You trace an AI agent by emitting an OpenTelemetry span for every unit of work: the agent loop turn, the model request inside it, each tool call, and any sub-agent spawned through delegation. The Claude Agent SDK records these automatically once telemetry is enabled.

The SDK runs the Claude Code CLI as a child process and the CLI is what carries the instrumentation. Your application code does not produce telemetry directly. You pass configuration through environment variables and the CLI exports OTLP HTTP straight to your backend.

Four spans matter, per the official Claude Agent SDK observability docs:

  • claude_code.interaction -- one full turn of the agent loop, prompt to response.
  • claude_code.llm_request -- one call to the Anthropic API, with model name, latency, and token counts as attributes.
  • claude_code.tool -- one tool invocation, with child spans claude_code.tool.blocked_on_user and claude_code.tool.execution.
  • claude_code.hook -- one hook execution (requires ENABLE_BETA_TRACING_DETAILED=1).

When the agent uses the Task tool to spawn a sub-agent, the child agent's spans nest under the parent's claude_code.tool span, so the full delegation chain shows up as one trace in your backend.

What does OpenTelemetry give you for agent observability?

OpenTelemetry gives you three independent signals -- traces, metrics, and log events -- on a vendor-neutral wire format (OTLP). The Claude Agent SDK lets you flip each one on independently with its own exporter, so you only ship the data you need.

Signal What it contains Enable with
Traces Spans for each interaction, model request, tool call, hook OTEL_TRACES_EXPORTER=otlp + CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1
Metrics Counters for tokens, cost, sessions, lines of code, tool decisions OTEL_METRICS_EXPORTER=otlp
Log events Structured records for each prompt, API request, API error, tool result OTEL_LOGS_EXPORTER=otlp

The GenAI semantic conventions in OpenTelemetry 1.37+ standardize attribute names like gen_ai.usage.input_tokens and gen_ai.request.model across vendors. According to Datadog (2026), these conventions mean you instrument once and switch backends with no SDK changes.

The stakes are not academic. The LLM observability platform market grew from $1.97B in 2025 to $2.69B in 2026 (36.3% CAGR), and Gartner forecasts that LLM observability will be part of 50% of GenAI deployments by 2028, up from 15% in early 2026.

AI Agent Observability Adoption (2026 vs 2028 forecast)
Orgs running agents in production (2026)
57%
GenAI deployments with LLM observability (Q1 2026)
15%
GenAI deployments with LLM observability (Gartner 2028 forecast)
50%
Source: Gartner via Demand Gen Report (2026)

How do you set up Claude Agent SDK observability with Langfuse?

Setup is seven steps and roughly 27 lines of Python. You provision Langfuse credentials, set environment variables for the OTLP exporter, run a query() call, and the spans appear in Langfuse within five seconds.

Step 1: Get Langfuse credentials

Create a project at cloud.langfuse.com (free tier: 50,000 observations per month) or self-host. In Project Settings -- API Keys, generate a public key (pk-lf-...) and secret key (sk-lf-...).

Step 2: Base64-encode the keys for basic auth

echo -n "pk-lf-1234:sk-lf-5678" | base64
# -> cGstbGYtMTIzNDpzay1sZi01Njc4

Step 3: Install the SDK

pip install claude-agent-sdk

Note: with the OTLP path you do not need the Langfuse Python SDK at all. The Claude Agent CLI handles the export.

Step 4: Configure environment variables

This is the entire integration. For Langfuse Cloud (EU), set:

# otel.env.yaml -- load with python-dotenv or pass to ClaudeAgentOptions
CLAUDE_CODE_ENABLE_TELEMETRY: "1"
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA: "1"
OTEL_TRACES_EXPORTER: "otlp"
OTEL_METRICS_EXPORTER: "otlp"
OTEL_LOGS_EXPORTER: "otlp"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_EXPORTER_OTLP_ENDPOINT: "https://cloud.langfuse.com/api/public/otel"
OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Basic cGstbGYtMTIzNDpzay1sZi01Njc4"
OTEL_SERVICE_NAME: "support-triage-agent"
OTEL_RESOURCE_ATTRIBUTES: "service.version=1.4.0,deployment.environment=production"
OTEL_TRACES_EXPORT_INTERVAL: "1000"

For self-hosted Langfuse, swap the endpoint to http://your-host:3000/api/public/otel (added in Langfuse v3.22.0).

Step 5: Wire it into the agent

import asyncio
import os
from claude_agent_sdk import query, ClaudeAgentOptions

OTEL_ENV = {k: os.environ[k] for k in os.environ if k.startswith(("OTEL_", "CLAUDE_CODE_"))}

async def main():
    options = ClaudeAgentOptions(env=OTEL_ENV)
    async for msg in query(
        prompt="Triage this Zendesk ticket and draft a reply.",
        options=options,
    ):
        print(msg)

asyncio.run(main())

That is the entire instrumentation: 12 lines of Python plus the YAML. The SDK injects W3C TRACEPARENT automatically, so if your app already runs inside an OpenTelemetry span, the agent run nests under it.

Step 6: Verify in the Langfuse UI

Open Langfuse -- Tracing. You should see a row for each claude_code.interaction span with nested llm_request and tool children. Click into a trace and you get the prompt, the model output, token counts, latency per call, and the tool inputs (if you opted in with OTEL_LOG_TOOL_DETAILS=1).

Step 7: Build dashboards

In Langfuse Dashboards, add four tiles:

  1. Cost per run -- sum gen_ai.usage.cost grouped by session.id.
  2. Tool p95 latency -- 95th percentile of claude_code.tool.execution duration grouped by gen_ai.tool.name.
  3. Tool error rate -- count of claude_code.tool spans with status=error divided by total, grouped by tool name.
  4. Turns per run -- count of claude_code.interaction spans per session.id.
Claude Agent SDK OpenTelemetry Default Export Intervals
Metrics
60s
Traces
5s
Logs
5s
Source: Claude Agent SDK Observability Docs (2026)

Can you use Datadog or Honeycomb for AI agents?

Yes. Both Datadog LLM Observability and Honeycomb accept OTLP HTTP and natively support the OpenTelemetry GenAI semantic conventions (OTel 1.37+). You change one environment variable -- the OTEL_EXPORTER_OTLP_ENDPOINT -- and the same Claude Agent SDK code ships traces to either backend.

Per Datadog's 2026 announcement, you can send LLM traces directly from OpenTelemetry-instrumented applications without the Datadog LLM Observability SDK or a Datadog Agent. For Honeycomb, point OTLP at https://api.honeycomb.io with x-honeycomb-team as the API key header.

When to pick which backend:

What metrics matter most for production AI agents?

Five metrics carry the diagnostic load. Latency alone is misleading for agents because token counts vary by 10x across runs. According to Groundcover's 2026 agent observability guide, a single LLM call generates 8-15 spans versus 2-3 for a typical API endpoint, so dashboards need to aggregate at the right level.

The minimum viable agent dashboard tracks:

  1. Cost per run -- input tokens times input price plus output tokens times output price, grouped by session.id and service.name. This is the metric that gets agents shut off in production. Track p50, p95, and p99.
  2. Tool p95 latency -- per-tool, because a slow web_search is a different problem than a slow bash. Filter on gen_ai.tool.name.
  3. Tool error rate -- failures per 100 invocations per tool. A tool failing 12% of the time is invisible at the run level but shows up here. From the Groundcover agent observability guide (2026), per-tool error rates are the single best leading indicator of user-reported issues.
  4. Turns per run -- count of claude_code.interaction spans per session. Sudden growth here means the agent is looping.
  5. Success rate -- the share of runs that complete the task per your offline evals. Pair this with agent eval frameworks and link traces back to eval scores in Langfuse.

Latency without cost is half a story. Cost without success rate is the other half. You need all five.

How do you alert on agent cost blow-ups?

Set two alerts: one on per-session token totals to catch runaway loops, one on cost-per-minute to catch traffic spikes. Both run as queries against your OTLP backend.

Alert 1: Per-session token blow-up

Fire when a single session.id exceeds your max-tokens budget. In Langfuse, this is a metric-condition alert on the sum of gen_ai.usage.total_tokens grouped by session.id. A sensible default for production agents is 200,000 tokens per session.

# pseudo-config -- exact syntax depends on your alerting tool
name: agent-session-token-blowup
metric: sum(gen_ai.usage.total_tokens)
group_by: [session.id, service.name]
threshold: 200000
window: 15m
severity: page

Alert 2: Cost-per-minute anomaly

Fire when the rolling 5-minute spend exceeds 3x your historical p99. This catches a viral spike, a misbehaving cron, or a free-tier abuse incident.

name: agent-cost-rate-spike
metric: sum(gen_ai.usage.cost) / 5m
group_by: [service.name, deployment.environment]
threshold: 3 * historical_p99
window: 5m
severity: page

A third worth adding once you have baseline data: tool error rate above 5% sustained for 10 minutes, grouped by gen_ai.tool.name. Tools fail silently more often than models do, and a flaky tool inflates token cost because the agent retries.

Per OneUptime's 2026 analysis, agents are now the largest source of unplanned LLM spend in production. The cost alert above is non-optional.

BackendOTLP-nativeGenAI semantic conventionsBest for
LangfuseYesYes (via SDK v4)LLM-specific UI, prompt linking, evals, free tier
Datadog LLM ObservabilityYesYes (OTel 1.37+)Enterprises already on Datadog APM
HoneycombYesYesHigh-cardinality trace querying, BubbleUp anomaly detection
Grafana Tempo + LokiYesManual mappingSelf-hosted, infra teams already on Grafana
SigNozYesYesOpen-source full-stack OTLP backend
New RelicYesYesBundled APM + LLM in a single contract