Is Langfuse's 50K units/month the same as 50K API calls?

Not exactly. A Langfuse unit is an observation (a span or a single LLM call), not an API request. A simple LLM call = 1 unit; a RAG pipeline with retrieval + reranking + generation = 3-5 units. So 50K units translates to roughly 25,000 simple calls, or 8,000-10,000 RAG queries. Tracking pauses when you hit the cap — no overage charges.

How do I track per-user token consumption for user-level billing breakdown?

Pass a user_id parameter when creating traces. Langfuse's Dashboard has built-in user-level cost aggregation — you can see each user's token consumption and cost distribution in the Analytics page without any extra code. Add session_id to track the full cost of a single conversation.

I'm already using LangChain. Do I need to rewrite code to switch to Langfuse?

No. Langfuse provides a LangChain callback handler — just add one line of callback configuration to start tracing. If you also have non-LangChain LLM calls, use the @observe() decorator. Both approaches coexist without modifying your existing chain logic.

Has the ClickHouse acquisition affected Langfuse's MIT license? Is self-hosting still safe?

ClickHouse explicitly committed after the January 2026 acquisition: MIT license stays, no new pricing gates, no feature lockdowns. Self-hosted deployments remain fully functional, and large-scale query performance has actually improved with ClickHouse's analytics engine. MIT licensing guarantees you can always fork — stronger protection than any commercial-licensed tool.

Can Langfuse track across OpenAI + Anthropic + Gemini simultaneously?

Yes. Langfuse is framework-agnostic with native integrations for OpenAI SDK, Anthropic SDK, and Google GenAI SDK. You can compare cross-provider cost/latency/quality in a single Dashboard — seeing things like 'Claude costs $0.03 for this task at 0.9 quality, GPT-4o costs $0.05 at 0.85 quality.' This cross-provider view is harder to achieve with ecosystem-bound tools like LangSmith.

The Complete Guide to LLM Production Monitoring: Track AI Agent Costs, Quality & Hallucinations with Langfuse (2026)

Your AI agent is live. Features work, users are growing — then the monthly bill arrives. The $500/month pilot seemed reasonable, but production hit $15,000/month, and you have no idea which feature is burning the budget. Worse, users report "weird" AI responses, but you can't even identify which step caused the problem. This isn't unique to you — it's the wall every developer hits when pushing AI into production.

TL;DR: AI agent bill explosions (5-30x) and untraceable hallucinations are the two biggest production pain points. Langfuse (MIT open-source, free 50K events/month) provides a three-phase solution: cost control, quality tracking, and hallucination detection. Fastest path: AgentGateway zero-code integration in 10 minutes.

Why Is Your AI Bill 30x Higher Than the Pilot?

We run our own AI agent fleet at Shareuhack — events system, memory tracking, session logs, the full stack. From firsthand experience, agent-mode token consumption operates at an entirely different scale from standard chatbots.

The math is straightforward: an agentic task involves multi-turn reasoning, tool calls, and result verification, consuming 5-30x more tokens than a standard chatbot. Add RAG's "context tax" — every query ships with retrieved documents — and token counts inflate rapidly.

Real cases aren't hard to find: in March 2026, a developer's stolen Gemini API key was exploited by unknown attackers, racking up $82K in charges within 48 hours (The Register). The root cause was API key security, but it highlights a broader problem — without per-feature cost tracking, any anomalous consumption (whether a security incident or a runaway feature) goes unnoticed until the monthly bill arrives.

The core issue isn't expensive APIs. It's that without per-feature breakdown, you can't answer the most basic question: "Which feature is burning money?"

How LLM Observability Differs from Traditional APM

If you're already using Datadog or New Relic for infrastructure monitoring, you might think "just add some logs." But LLM observability tracks entirely different dimensions:

APM tracks infrastructure: CPU, memory, response time, error rate
LLM observability tracks reasoning quality: token distribution, reasoning quality, hallucination rate, tool selection quality

The core concept is the span — the tracking unit for each "thinking step" of an LLM agent. If you know distributed tracing (Jaeger, Zipkin), it's the same idea: each LLM call, each tool invocation, each retrieval step is a span, and together they form a complete trace.

Three dimensions of LLM observability:

Cost: Which feature costs the most? How many tokens per user?
Quality: How faithful and relevant are the responses?
Reliability: Hallucination rate, error rate, latency distribution

Cross-analyzing these three dimensions is what production LLM environments need — not something logs can solve.

Langfuse's Market Position in 2026: Why Now?

In January 2026, ClickHouse acquired Langfuse alongside completing a $400M Series D raise (acquisition price undisclosed). This wasn't just a transaction — it reshaped the LLM observability competitive landscape.

Key post-acquisition commitments:

MIT license remains unchanged: No new pricing gates, no feature lockdowns
Most generous free tier in the industry: 50,000 units/month, 30-day data retention, 2 users (Langfuse official pricing page, April 2026)
ClickHouse analytics engine backing: Massive improvement in large-scale trace query performance

Compare LangSmith: the free tier offers only 5,000 traces/month (1/10 of Langfuse), 14-day data retention (half of Langfuse) — figures sourced from each provider's official pricing pages, verified April 2026. With 25.8K+ GitHub stars (April 2026) and visible community migration from LangSmith, now is the lowest-barrier entry point for Langfuse.

Competitor Comparison: Langfuse vs LangSmith vs AgentOps

Dimension	Langfuse	LangSmith	AgentOps
License	MIT open-source	Commercial (partially open)	Commercial
Free tier	50K units/month	5K traces/month	Limited free
Data retention	30 days (free) / 90 days (Core)	14 days (free)	Plan-dependent
Framework lock-in	None (OpenAI/Anthropic/any)	LangChain-focused	Agent-focused
Self-host	Full support (Docker)	Not supported	Not supported
Core strength	Cost tracking + eval + tracing	Deep LangChain integration	Session replay

Selection guide:

Deeply invested in LangChain — LangSmith offers the smoothest integration
Pure agent use case, need session replay — AgentOps is more focused
Everything else — Langfuse is the safest choice: framework-agnostic, self-hostable, largest free tier, MIT license guarantees fork freedom

10-Minute Setup: Zero-Code vs SDK Path

Path 1: AgentGateway Zero-Code Integration

AgentGateway (released by Solo.io in February 2026) acts as an LLM proxy layer, intercepting all LLM calls and automatically sending them to Langfuse — no application code changes. Ideal for teams that don't want to modify existing code, or no-code/low-code developers. For setup steps, see the Solo.io blog post in the references.

Path 2: Direct SDK Integration (2-5 Lines)

from langfuse.decorators import observe

@observe()
def my_llm_function(user_input: str):
    # Your existing LLM call logic stays exactly the same
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": user_input}]
    )
    return response.choices[0].message.content

Add the @observe() decorator, and Langfuse automatically tracks token usage, latency, and cost. Set LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY environment variables and you're done.

Path 3: LangChain/LlamaIndex Callback

from langfuse.callback import CallbackHandler

handler = CallbackHandler()
# Add to your chain
chain.invoke({"input": "..."}, config={"callbacks": [handler]})

After setup, confirm your first trace appears in the Langfuse Dashboard — that's your LLM observability starting point.

Phase 1 — Cost Control: Find Which Feature Is Burning Money

This is the highest priority phase. You need to answer one question: which feature costs the most?

Set Up Semantic Traces

@observe()
def generate_summary(user_id: str, document: str):
    langfuse.trace(
        name="summary-generation",
        user_id=user_id,
        session_id=f"session-{user_id}",
        metadata={"feature": "summarize", "doc_length": len(document)}
    )
    # ... LLM call

The key fields are user_id, session_id, and metadata — give each trace semantic meaning instead of anonymous API call records.

Set Up Cost Alerts

Best practice: investigate when week-over-week growth exceeds 20%. No need for a sophisticated alerting system — a weekly cost report is enough.

Real example: we discovered a "rewriting" feature consuming 8x the output tokens of other features — because the prompt requested "complete rewrite" instead of "suggested edits." Adjusting the prompt reduced that feature's cost by 60%.

Key insight: Output tokens are typically the main bill driver (3-4x more expensive than input). Check output token distribution first, then decide where to optimize.

Phase 2 — Quality Tracking: LLM-as-Judge Automated Scoring

Manual spot-checking covers less than 1% of samples — meaningless for production. You need automated quality assessment.

LLM-as-Judge Setup

Define scoring rubrics: faithfulness, relevance, completeness, each scored 0-1
Choose the judge model: Use a cheaper model (e.g., GPT-4o-mini) as judge — costs only 1/10 of the evaluated call
Run batch evals: Langfuse's Datasets feature lets you build golden datasets with baseline answers

Key Metrics

Faithfulness: Is the response grounded in the provided context?
Relevance: Does the response directly address the user's question?
Tool Selection Quality: Did the agent pick the right tool?

Set Quality Gates

Auto-flag traces with eval scores below 0.7 for human review. Not perfect, but it focuses human attention from "random sampling" to "most problematic traces."

Langfuse Datasets also enable regression testing: run the golden dataset eval before every prompt change to ensure quality hasn't degraded.

Phase 3 — Hallucination Detection: Pinpoint Issues with Span Tracing

Hallucinations are the hardest production issue because they don't throw errors — the system appears to work normally, but the output is wrong.

Span-Level Hallucination Analysis

A RAG query trace contains three span layers:

Retrieval span: Fetching documents from the vector database
Generation span: LLM generating a response based on retrieved documents
Post-processing span: Formatting, safety filtering

Hallucinations can occur at any layer. You need to know which one.

Two Diagnostic Patterns

Based on diagnostic patterns described in Datadog LLM Observability's blog, two clear patterns emerge:

Latency increasing + grounding score dropping = retrieval degradation. Usually chunk size configuration issues, embedding model changes, or stale indexes. Fix: adjust retrieval parameters.
Latency stable + hallucination rate rising = prompt or model change. Usually model updates changing behavior, or prompt drift. Fix: pin model version, rollback prompt.

Use Langfuse's Scores feature to tag each span's hallucination score, then track trends in the Dashboard — moving from "hallucinations seem worse lately" to "retrieval span grounding score dropped from 0.85 to 0.6 after last week's index update."

Self-Host vs Langfuse Cloud: Which Scenario Fits?

Choose Cloud When

Team of fewer than 5, don't want to maintain infrastructure
Monthly usage under 100K units
Want the latest features first

Cloud pricing: Hobby free (50K units), Core $29/month (100K units, 90-day retention, unlimited users).

Choose Self-Host When

Data compliance requirements (GDPR, local privacy laws)
Need more than 30 days of data retention
Monthly usage exceeds 100K units and you want cost control

Self-hosting requires Docker + PostgreSQL. A small VPS ($10-20/month) handles small-scale deployments — cheaper than Cloud Core. Post-acquisition, self-hosted query performance also benefits from the ClickHouse engine.

Recommendation for indie makers: Start with Cloud's free tier to validate observability value. When monthly usage exceeds 50K units, compare Core at $29/month vs self-hosted VPS cost, and pick whichever is more economical.

Risks and Practical Considerations

Post-Acquisition Dependency Risk

MIT licensing ensures you can always fork, but Langfuse's product direction will be influenced by ClickHouse's decisions. If you depend heavily on Langfuse Cloud, consider periodically exporting trace data. Self-host users face the lowest risk.

Observability Overhead

Each trace adds minimal latency (Langfuse SDK performance tests show approximately 0.1ms in async mode), virtually imperceptible in production. If your P99 latency requirements are strict, run the Langfuse SDK in async mode (this is the default — trace data is sent in the background).

Data Security

Langfuse Cloud data is stored in Europe (AWS eu-west-1), GDPR-compliant. If your user data has local privacy law compliance requirements, self-hosting is the safer option.

Learning Curve

Span tracing requires understanding distributed tracing concepts. If your team lacks this background, start with Phase 1 (cost tracking) — don't jump to Phase 3.

Conclusion: You're Not Just Running an AI Product

The gap between "AI features work" and "AI product is operable" is observability. An AI product without monitoring is like a car without a dashboard — it drives, but you don't know how much fuel is left or whether the engine temperature is normal.

Three things to do today:

Sign up for a free Langfuse Cloud account (or integrate with the @observe() decorator)
Find your 3 most expensive features (Phase 1 cost control)
Set up a >20% week-over-week cost alert

If you're still evaluating AI API selection and cost control, check out our 2026 AI API Cost Comparison Guide.