how-to 9 min read May 04, 2026

How to Set Up AI Agent Observability with OpenTelemetry and Langfuse

Q: What does OpenTelemetry give you for agent observability?

OpenTelemetry gives you three independent signals: traces, metrics, and structured log events. The GenAI semantic conventions (OTel 1.37+) standardize attribute names so you can switch backends without rewriting instrumentation.

Q: Can you use Datadog or Honeycomb for AI agents?

Yes. Both support the OpenTelemetry GenAI semantic conventions natively, so you can point OTEL_EXPORTER_OTLP_ENDPOINT at either backend and the same Claude Agent SDK code emits LLM-aware traces.

Q: What metrics matter most for production AI agents?

Cost per run, tool p95 latency, tool error rate, turns per run, and success rate. Latency alone is misleading because token counts vary by 10x across runs.

Q: How do you alert on agent cost blow-ups?

Set a threshold alert on per-session token totals (e.g., 200,000 tokens) and a separate alert on cost-per-minute exceeding 3x your historical p99, grouped by service name.

Q: Where do you set Langfuse OTLP credentials for self-hosted deployments?

Set OTEL_EXPORTER_OTLP_ENDPOINT to http://your-host:3000/api/public/otel and OTEL_EXPORTER_OTLP_HEADERS to Authorization=Basic . The endpoint was added in Langfuse v3.22.0.

By Peter Foy

Instrument a Claude Agent SDK agent with OpenTelemetry in under 30 lines. Ship traces to Langfuse, build cost and latency dashboards, alert on token blow-ups.

TL;DR

AI agent observability means tracing every loop turn, model call, and tool invocation an agent makes, plus metrics for tokens and cost. The Claude Agent SDK ships with OpenTelemetry instrumentation built into its CLI. Set five environment variables, point OTLP at Langfuse, and you get cost-per-run, tool latency, and error rate dashboards in roughly 27 lines of code.

Set `CLAUDE_CODE_ENABLE_TELEMETRY=1` and `CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1` to turn on traces
Send OTLP HTTP to Langfuse at `/api/public/otel` with base64-encoded basic auth
Watch four spans: `claude_code.interaction`, `claude_code.llm_request`, `claude_code.tool`, `claude_code.hook`
Track cost per run, tool p95 latency, tool error rate, turns per run, and success rate
Alert on `session.id` token totals over 200k and on cost-per-minute exceeding p99 baseline

AI agent observability is the practice of exporting traces, metrics, and log events from an agent runtime to a backend so you can see which tools were called, how long each model request took, how many tokens were spent, and where failures happened. The Claude Agent SDK ships with OpenTelemetry instrumentation already wired into its CLI subprocess. To turn it on for Langfuse, you set five environment variables, point OTLP at the Langfuse ingestion endpoint, and the SDK starts emitting spans for every interaction, model call, and tool. Total code: under 30 lines.

How do you trace an AI agent?

You trace an AI agent by emitting an OpenTelemetry span for every unit of work: the agent loop turn, the model request inside it, each tool call, and any sub-agent spawned through delegation. The Claude Agent SDK records these automatically once telemetry is enabled.

The SDK runs the Claude Code CLI as a child process and the CLI is what carries the instrumentation. Your application code does not produce telemetry directly. You pass configuration through environment variables and the CLI exports OTLP HTTP straight to your backend.

Four spans matter, per the official Claude Agent SDK observability docs:

claude_code.interaction -- one full turn of the agent loop, prompt to response.
claude_code.llm_request -- one call to the Anthropic API, with model name, latency, and token counts as attributes.
claude_code.tool -- one tool invocation, with child spans claude_code.tool.blocked_on_user and claude_code.tool.execution.
claude_code.hook -- one hook execution (requires ENABLE_BETA_TRACING_DETAILED=1).

When the agent uses the Task tool to spawn a sub-agent, the child agent's spans nest under the parent's claude_code.tool span, so the full delegation chain shows up as one trace in your backend.

What does OpenTelemetry give you for agent observability?

OpenTelemetry gives you three independent signals -- traces, metrics, and log events -- on a vendor-neutral wire format (OTLP). The Claude Agent SDK lets you flip each one on independently with its own exporter, so you only ship the data you need.

Signal	What it contains	Enable with
Traces	Spans for each interaction, model request, tool call, hook	`OTEL_TRACES_EXPORTER=otlp` + `CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1`
Metrics	Counters for tokens, cost, sessions, lines of code, tool decisions	`OTEL_METRICS_EXPORTER=otlp`
Log events	Structured records for each prompt, API request, API error, tool result	`OTEL_LOGS_EXPORTER=otlp`

The GenAI semantic conventions in OpenTelemetry 1.37+ standardize attribute names like gen_ai.usage.input_tokens and gen_ai.request.model across vendors. According to Datadog (2026), these conventions mean you instrument once and switch backends with no SDK changes.

The stakes are not academic. The LLM observability platform market grew from $1.97B in 2025 to $2.69B in 2026 (36.3% CAGR), and Gartner forecasts that LLM observability will be part of 50% of GenAI deployments by 2028, up from 15% in early 2026.

AI Agent Observability Adoption (2026 vs 2028 forecast)

Orgs running agents in production (2026)

57%

GenAI deployments with LLM observability (Q1 2026)

15%

GenAI deployments with LLM observability (Gartner 2028 forecast)

50%

Source: Gartner via Demand Gen Report (2026)

How do you set up Claude Agent SDK observability with Langfuse?

Setup is seven steps and roughly 27 lines of Python. You provision Langfuse credentials, set environment variables for the OTLP exporter, run a query() call, and the spans appear in Langfuse within five seconds.

Step 1: Get Langfuse credentials

Create a project at cloud.langfuse.com (free tier: 50,000 observations per month) or self-host. In Project Settings -- API Keys, generate a public key (pk-lf-...) and secret key (sk-lf-...).

Step 2: Base64-encode the keys for basic auth

echo -n "pk-lf-1234:sk-lf-5678" | base64
# -> cGstbGYtMTIzNDpzay1sZi01Njc4

Step 3: Install the SDK

pip install claude-agent-sdk

Note: with the OTLP path you do not need the Langfuse Python SDK at all. The Claude Agent CLI handles the export.

Step 4: Configure environment variables

This is the entire integration. For Langfuse Cloud (EU), set:

# otel.env.yaml -- load with python-dotenv or pass to ClaudeAgentOptions
CLAUDE_CODE_ENABLE_TELEMETRY: "1"
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA: "1"
OTEL_TRACES_EXPORTER: "otlp"
OTEL_METRICS_EXPORTER: "otlp"
OTEL_LOGS_EXPORTER: "otlp"
OTEL_EXPORTER_OTLP_PROTOCOL: "http/protobuf"
OTEL_EXPORTER_OTLP_ENDPOINT: "https://cloud.langfuse.com/api/public/otel"
OTEL_EXPORTER_OTLP_HEADERS: "Authorization=Basic cGstbGYtMTIzNDpzay1sZi01Njc4"
OTEL_SERVICE_NAME: "support-triage-agent"
OTEL_RESOURCE_ATTRIBUTES: "service.version=1.4.0,deployment.environment=production"
OTEL_TRACES_EXPORT_INTERVAL: "1000"

For self-hosted Langfuse, swap the endpoint to http://your-host:3000/api/public/otel (added in Langfuse v3.22.0).

Step 5: Wire it into the agent

import asyncio
import os
from claude_agent_sdk import query, ClaudeAgentOptions

OTEL_ENV = {k: os.environ[k] for k in os.environ if k.startswith(("OTEL_", "CLAUDE_CODE_"))}

async def main():
    options = ClaudeAgentOptions(env=OTEL_ENV)
    async for msg in query(
        prompt="Triage this Zendesk ticket and draft a reply.",
        options=options,
    ):
        print(msg)

asyncio.run(main())

That is the entire instrumentation: 12 lines of Python plus the YAML. The SDK injects W3C TRACEPARENT automatically, so if your app already runs inside an OpenTelemetry span, the agent run nests under it.

Step 6: Verify in the Langfuse UI

Open Langfuse -- Tracing. You should see a row for each claude_code.interaction span with nested llm_request and tool children. Click into a trace and you get the prompt, the model output, token counts, latency per call, and the tool inputs (if you opted in with OTEL_LOG_TOOL_DETAILS=1).

Step 7: Build dashboards

In Langfuse Dashboards, add four tiles:

Cost per run -- sum gen_ai.usage.cost grouped by session.id.
Tool p95 latency -- 95th percentile of claude_code.tool.execution duration grouped by gen_ai.tool.name.
Tool error rate -- count of claude_code.tool spans with status=error divided by total, grouped by tool name.
Turns per run -- count of claude_code.interaction spans per session.id.

Claude Agent SDK OpenTelemetry Default Export Intervals

Metrics

60s

Traces

Logs

Source: Claude Agent SDK Observability Docs (2026)

Can you use Datadog or Honeycomb for AI agents?

Yes. Both Datadog LLM Observability and Honeycomb accept OTLP HTTP and natively support the OpenTelemetry GenAI semantic conventions (OTel 1.37+). You change one environment variable -- the OTEL_EXPORTER_OTLP_ENDPOINT -- and the same Claude Agent SDK code ships traces to either backend.

Per Datadog's 2026 announcement, you can send LLM traces directly from OpenTelemetry-instrumented applications without the Datadog LLM Observability SDK or a Datadog Agent. For Honeycomb, point OTLP at https://api.honeycomb.io with x-honeycomb-team as the API key header.

When to pick which backend:

What metrics matter most for production AI agents?

Five metrics carry the diagnostic load. Latency alone is misleading for agents because token counts vary by 10x across runs. According to Groundcover's 2026 agent observability guide, a single LLM call generates 8-15 spans versus 2-3 for a typical API endpoint, so dashboards need to aggregate at the right level.

The minimum viable agent dashboard tracks:

Cost per run -- input tokens times input price plus output tokens times output price, grouped by session.id and service.name. This is the metric that gets agents shut off in production. Track p50, p95, and p99.
Tool p95 latency -- per-tool, because a slow web_search is a different problem than a slow bash. Filter on gen_ai.tool.name.
Tool error rate -- failures per 100 invocations per tool. A tool failing 12% of the time is invisible at the run level but shows up here. From the Groundcover agent observability guide (2026), per-tool error rates are the single best leading indicator of user-reported issues.
Turns per run -- count of claude_code.interaction spans per session. Sudden growth here means the agent is looping.
Success rate -- the share of runs that complete the task per your offline evals. Pair this with agent eval frameworks and link traces back to eval scores in Langfuse.

Latency without cost is half a story. Cost without success rate is the other half. You need all five.

How do you alert on agent cost blow-ups?

Set two alerts: one on per-session token totals to catch runaway loops, one on cost-per-minute to catch traffic spikes. Both run as queries against your OTLP backend.

Alert 1: Per-session token blow-up

Fire when a single session.id exceeds your max-tokens budget. In Langfuse, this is a metric-condition alert on the sum of gen_ai.usage.total_tokens grouped by session.id. A sensible default for production agents is 200,000 tokens per session.

# pseudo-config -- exact syntax depends on your alerting tool
name: agent-session-token-blowup
metric: sum(gen_ai.usage.total_tokens)
group_by: [session.id, service.name]
threshold: 200000
window: 15m
severity: page

Alert 2: Cost-per-minute anomaly

Fire when the rolling 5-minute spend exceeds 3x your historical p99. This catches a viral spike, a misbehaving cron, or a free-tier abuse incident.

name: agent-cost-rate-spike
metric: sum(gen_ai.usage.cost) / 5m
group_by: [service.name, deployment.environment]
threshold: 3 * historical_p99
window: 5m
severity: page

A third worth adding once you have baseline data: tool error rate above 5% sustained for 10 minutes, grouped by gen_ai.tool.name. Tools fail silently more often than models do, and a flaky tool inflates token cost because the agent retries.

Per OneUptime's 2026 analysis, agents are now the largest source of unplanned LLM spend in production. The cost alert above is non-optional.

Backend	OTLP-native	GenAI semantic conventions	Best for
Langfuse	Yes	Yes (via SDK v4)	LLM-specific UI, prompt linking, evals, free tier
Datadog LLM Observability	Yes	Yes (OTel 1.37+)	Enterprises already on Datadog APM
Honeycomb	Yes	Yes	High-cardinality trace querying, BubbleUp anomaly detection
Grafana Tempo + Loki	Yes	Manual mapping	Self-hosted, infra teams already on Grafana
SigNoz	Yes	Yes	Open-source full-stack OTLP backend
New Relic	Yes	Yes	Bundled APM + LLM in a single contract

Frequently asked questions

How do you trace an AI agent?

You trace an AI agent by emitting OpenTelemetry spans for each unit of work: the agent loop turn, every model call, every tool invocation, and any sub-agent. The Claude Agent SDK does this automatically once you set CLAUDE_CODE_ENABLE_TELEMETRY=1 and CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1. Spans are exported via OTLP HTTP to any backend that speaks the protocol -- Langfuse, Datadog, Honeycomb, Grafana, or a self-hosted collector.

What does OpenTelemetry give you for agent observability?

OpenTelemetry gives you three independent signals: traces (a span per agent turn, model call, tool, and hook), metrics (token counters, cost counters, tool decisions), and structured log events (prompts, API errors, tool results). Because OTLP is a standard, you can switch backends without rewriting instrumentation. The GenAI semantic conventions (OTel 1.37+) standardize the attribute names across vendors.

Can you use Datadog or Honeycomb for AI agents?

Yes. Both Datadog LLM Observability and Honeycomb support the OpenTelemetry GenAI semantic conventions (OTel 1.37+) natively, so you can point OTEL_EXPORTER_OTLP_ENDPOINT at either and get LLM-aware dashboards. Datadog is easier if you already use Datadog APM. Honeycomb's BubbleUp feature shines for diagnosing which attribute combination correlates with slow or failed runs.

How many lines of code does it take to instrument a Claude Agent SDK agent?

Under 30. The SDK already runs OpenTelemetry instrumentation inside the Claude Code CLI subprocess, so all you do is pass an env dictionary with the OTLP endpoint, exporter type, and auth header. A working Python example is roughly 25 lines including imports, the env config, and a query() call.

What metrics matter most for production AI agents?

Five metrics carry the diagnostic load: cost per run (tokens times model price), tool latency (p50 and p95 per tool), tool error rate (failures per 100 invocations), turns per run (loop blow-up indicator), and success rate (runs that complete the task as judged by an eval). Latency by itself is misleading because agent runs vary by 10x in token count.

How do you alert on agent cost blow-ups?

Set a threshold alert on claude_code.token.usage (or your platform's equivalent) grouped by session.id -- fire when a single session exceeds your max-tokens budget, for example 200k input tokens. Add a second alert on the rate of claude_code.cost.usage per minute, grouped by service.name, with a multiplier on your historical p99. This catches both runaway loops and traffic spikes.

Where do you set Langfuse OTLP credentials for self-hosted deployments?

Set OTEL_EXPORTER_OTLP_ENDPOINT to http://your-host:3000/api/public/otel and OTEL_EXPORTER_OTLP_HEADERS to Authorization=Basic <base64(public_key:secret_key)>. Encode the API keys with echo -n "pk-lf-...:sk-lf-..." | base64. The OTLP endpoint was added in Langfuse v3.22.0; older versions only accept the native ingestion API.

Do you have to self-host Langfuse to use it for agent observability?

No. Langfuse Cloud has a free tier with 50,000 observations per month and EU, US, and Japan regions. Set OTEL_EXPORTER_OTLP_ENDPOINT to the regional cloud URL plus /api/public/otel and pass your project's public/secret key pair as a base64-encoded basic auth header. Self-hosting is for teams with data residency requirements or high event volume.

How long does telemetry take to appear in Langfuse?

By default, traces and logs export every 5 seconds and metrics every 60 seconds. For short-lived agent calls in serverless functions, lower these with OTEL_TRACES_EXPORT_INTERVAL=1000 and call langfuse.flush() before the process exits. Without a flush, the SDK can drop spans still in the batch buffer when the process is killed.

After the FAQ, when readers want help wiring this into their stack

Get the full production agent stack checklist