ai-evals.tools

Verdict

The right pick if your org already runs MLflow for ML — Databricks has invested seriously in LLM features, and adding a separate platform on top of an existing MLflow deployment is usually overkill. Outside that audience, the LLM-specialist tools deliver more LLM-specific value with less learning curve.

What it is

MLflow is the de facto open-source standard for ML experiment tracking, model registry, and deployment — used by tens of thousands of teams. Databricks has extended it heavily into LLM territory: auto-tracing across major frameworks, LLM-as-judge built in, prompt versioning, and evaluation runs that integrate with the existing MLflow run/artifact model.

Free under Apache 2.0. Databricks-hosted MLflow comes with the Databricks platform.

Where it shines

Footprint. MLflow is already running in your environment. That's not a feature, but it is the strongest argument for using it.
Auto-tracing. Wrap your LLM calls with one decorator, get spans automatically. The list of frameworks supported has grown faster than expected.
Databricks integration. If your data and ML lifecycle live in Databricks, MLflow extending into LLMs lets you keep that consolidation.

Where it falls short

Origin shows. The data model is "runs and artifacts," which fits ML training elegantly and LLM workflows awkwardly. Prompt management and dataset workflows feel grafted on.
Setup overhead. "Just install it and start evaluating prompts" is more steps than Braintrust's free tier.
Cross-functional gap. PMs do not open MLflow. The eval-first platforms have invested far more in non-engineer UX.

Bottom line

If your team already runs MLflow, extending it for LLM observability is the path of least resistance and a defensible choice. If you're starting fresh and your scope is LLM-only, the specialists (Braintrust, Langfuse, Opik) will deliver more value per dollar of attention.

MLflow

Verdict

What it is

Where it shines

Where it falls short

Bottom line

Related

Arize AI

Braintrust

Comet (Opik)