Langfuse Observability
Overview
Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.
When to use this skill:
- •Setting up LLM observability from scratch
- •Debugging slow or incorrect LLM responses
- •Tracking token usage and costs
- •Managing prompts in production
- •Evaluating LLM output quality
- •Migrating from LangSmith to Langfuse
OrchestKit Integration:
- •Status: Migrated from LangSmith (Dec 2025)
- •Location:
backend/app/shared/services/langfuse/ - •MCP Server:
orchestkit-langfuse(optional)
Quick Start
Setup
# backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse
from app.core.config import settings
langfuse_client = Langfuse(
public_key=settings.LANGFUSE_PUBLIC_KEY,
secret_key=settings.LANGFUSE_SECRET_KEY,
host=settings.LANGFUSE_HOST # Self-hosted or cloud
)
Basic Tracing with @observe
from langfuse.decorators import observe, langfuse_context
@observe() # Automatic tracing
async def analyze_content(content: str):
langfuse_context.update_current_observation(
metadata={"content_length": len(content)}
)
return await llm.generate(content)
Session & User Tracking
langfuse.trace(
name="analysis",
user_id="user_123",
session_id="session_abc",
metadata={"content_type": "article", "agent_count": 8},
tags=["production", "orchestkit"]
)
Core Features Summary
| Feature | Description | Reference |
|---|---|---|
| Distributed Tracing | Track LLM calls with parent-child spans | references/tracing-setup.md |
| Cost Tracking | Automatic token & cost calculation | references/cost-tracking.md |
| Prompt Management | Version control for prompts | references/prompt-management.md |
| LLM Evaluation | Custom scoring with G-Eval | references/evaluation-scores.md |
| Session Tracking | Group related traces | references/session-tracking.md |
| Experiments API | A/B testing & benchmarks | references/experiments-api.md |
| Multi-Judge Eval | Ensemble LLM evaluation | references/multi-judge-evaluation.md |
References
Tracing Setup
See: references/tracing-setup.md
Key topics covered:
- •Initializing Langfuse client with @observe decorator
- •Creating nested traces and spans
- •Tracking LLM generations with metadata
- •LangChain/LangGraph CallbackHandler integration
- •Workflow integration patterns
Cost Tracking
See: references/cost-tracking.md
Key topics covered:
- •Automatic cost calculation from token usage
- •Custom model pricing configuration
- •Monitoring dashboard SQL queries
- •Cost tracking per analysis/user
- •Daily cost trend analysis
Prompt Management
See: references/prompt-management.md
Key topics covered:
- •Prompt versioning and labels (production/staging/draft)
- •Template variables with Jinja2 syntax
- •A/B testing prompt versions
- •OrchestKit 4-level caching architecture (L1-L4)
- •Linking prompts to generation spans
LLM Evaluation
See: references/evaluation-scores.md
Key topics covered:
- •Custom scoring with numeric/categorical values
- •G-Eval automated quality assessment
- •Score trends and comparisons
- •Filtering traces by score thresholds
Session Tracking
See: references/session-tracking.md
Key topics covered:
- •Grouping traces by session_id
- •Multi-turn conversation tracking
- •User and metadata analytics
Experiments API
See: references/experiments-api.md
Key topics covered:
- •Creating test datasets in Langfuse
- •Running automated evaluations
- •Regression testing for LLMs
- •Benchmarking prompt versions
Multi-Judge Evaluation
See: references/multi-judge-evaluation.md
Key topics covered:
- •Multiple LLM judges for quality assessment
- •Weighted scoring across judges
- •OrchestKit langfuse_evaluators.py integration
Best Practices
- •Always use @observe decorator for automatic tracing
- •Set user_id and session_id for better analytics
- •Add meaningful metadata (content_type, analysis_id, etc.)
- •Score all production traces for quality monitoring
- •Use prompt management instead of hardcoded prompts
- •Monitor costs daily to catch spikes early
- •Create datasets for regression testing
- •Tag production vs staging traces
LangSmith Migration Notes
Key Differences:
| Aspect | Langfuse | LangSmith |
|---|---|---|
| Hosting | Self-hosted, open-source | Cloud-only, proprietary |
| Cost | Free | Paid |
| Prompts | Built-in management | External storage needed |
| Decorator | @observe | @traceable |
External References
Related Skills
- •
observability-monitoring- General observability patterns for metrics, logging, and alerting - •
llm-evaluation- Evaluation patterns that integrate with Langfuse scoring - •
llm-streaming- Streaming response patterns with trace instrumentation - •
prompt-caching- Caching strategies that reduce costs tracked by Langfuse
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Observability platform | Langfuse (not LangSmith) | Open-source, self-hosted, free, built-in prompt management |
| Tracing approach | @observe decorator | Automatic, low-overhead instrumentation |
| Cost tracking | Automatic token counting | Built-in model pricing with custom overrides |
| Prompt management | Langfuse native | Version control, A/B testing, labels in one place |
Capability Details
distributed-tracing
Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:
- •How do I trace LLM calls across my application?
- •How to debug slow LLM responses?
- •Track execution flow in multi-agent workflows
- •Create nested trace spans
cost-tracking
Keywords: cost, token usage, pricing, budget, spend, expense Solves:
- •How do I track LLM costs?
- •Calculate token usage and pricing
- •Monitor AI budget and spending
- •Track cost per user or session
prompt-management
Keywords: prompt version, prompt template, prompt control, prompt registry Solves:
- •How do I version control prompts?
- •Manage prompts in production
- •A/B test different prompt versions
- •Link prompts to traces
llm-evaluation
Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:
- •How do I evaluate LLM output quality?
- •Score responses with custom metrics
- •Track quality trends over time
- •Compare prompt versions by quality
session-tracking
Keywords: session, user tracking, conversation, group traces Solves:
- •How do I group related traces?
- •Track multi-turn conversations
- •Monitor per-user performance
- •Organize traces by session
langchain-integration
Keywords: langchain, callback, handler, langgraph integration Solves:
- •How do I integrate Langfuse with LangChain?
- •Use CallbackHandler for tracing
- •Automatic LangGraph workflow tracing
- •LangChain observability setup
datasets-evaluation
Keywords: dataset, test set, evaluation dataset, benchmark Solves:
- •How do I create test datasets in Langfuse?
- •Run automated evaluations
- •Regression testing for LLMs
- •Benchmark prompt versions
ab-testing
Keywords: a/b test, experiment, compare prompts, variant testing Solves:
- •How do I A/B test prompts?
- •Compare two prompt versions
- •Experimental prompt evaluation
- •Statistical prompt testing
monitoring-dashboard
Keywords: dashboard, analytics, metrics, monitoring, queries Solves:
- •What are the most expensive traces?
- •Average cost by agent type
- •Quality score trends
- •Custom monitoring queries
orchestkit-integration
Keywords: orchestkit, migration, setup, workflow integration Solves:
- •How does OrchestKit use Langfuse?
- •Migrate from LangSmith to Langfuse
- •OrchestKit workflow tracing patterns
- •Cost tracking per analysis
multi-judge-evaluation
Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:
- •How do I use multiple LLM judges to evaluate quality?
- •Set up G-Eval criteria evaluation
- •Configure weighted scoring across judges
- •Wire OrchestKit's existing langfuse_evaluators.py
experiments-api
Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:
- •How do I run experiments across datasets?
- •A/B test models and prompts systematically
- •Track quality regression over time
- •Compare experiment results