$ ai-evals
← all companies

MLflow

Open-source MLOps standard with LLM tracing, evaluation, and prompt management bolted on top.

score6.6
MLOpsobservabilityLLM evalsopen-sourceopen sourcemlflow.org

Verdict

The right pick if your org already runs MLflow for ML — Databricks has invested seriously in LLM features, and adding a separate platform on top of an existing MLflow deployment is usually overkill. Outside that audience, the LLM-specialist tools deliver more LLM-specific value with less learning curve.

What it is

MLflow is the de facto open-source standard for ML experiment tracking, model registry, and deployment — used by tens of thousands of teams. Databricks has extended it heavily into LLM territory: auto-tracing across major frameworks, LLM-as-judge built in, prompt versioning, and evaluation runs that integrate with the existing MLflow run/artifact model.

Free under Apache 2.0. Databricks-hosted MLflow comes with the Databricks platform.

Where it shines

  • Footprint. MLflow is already running in your environment. That's not a feature, but it is the strongest argument for using it.
  • Auto-tracing. Wrap your LLM calls with one decorator, get spans automatically. The list of frameworks supported has grown faster than expected.
  • Databricks integration. If your data and ML lifecycle live in Databricks, MLflow extending into LLMs lets you keep that consolidation.

Where it falls short

  • Origin shows. The data model is "runs and artifacts," which fits ML training elegantly and LLM workflows awkwardly. Prompt management and dataset workflows feel grafted on.
  • Setup overhead. "Just install it and start evaluating prompts" is more steps than Braintrust's free tier.
  • Cross-functional gap. PMs do not open MLflow. The eval-first platforms have invested far more in non-engineer UX.

Bottom line

If your team already runs MLflow, extending it for LLM observability is the path of least resistance and a defensible choice. If you're starting fresh and your scope is LLM-only, the specialists (Braintrust, Langfuse, Opik) will deliver more value per dollar of attention.

Related