Tools for evaluating functions and models on datasets. Includes evaluators, scoring utilities, and dataset management.
Results container for experiment data with stats and examples.
Represents the results of an evaluate_comparative() call.
Evaluation result.
Batch evaluation results.
Evaluator interface class.
A dynamic evaluator that wraps a function and transforms it into a RunEvaluator.
Feedback scores for the results of comparative evaluations.
Compare predictions (as traces) from 2 or more runs.
Grades the run's string input, output, and optional answer.
A class for building LLM-as-a-judge evaluators.
Evaluate a target system on a given dataset.
Evaluate existing experiment runs.
Evaluate existing experiment runs against each other.
Evaluate an async target system on a given dataset.
Evaluate existing experiment runs asynchronously.
Create a run evaluator from a function.
Create a comaprison evaluator from a function.