experiments
patronus.experiments
adapters
BaseEvaluatorAdapter
Bases: ABC
Abstract base class for all evaluator adapters.
Evaluator adapters provide a standardized interface between the experiment framework and various types of evaluators (function-based, class-based, etc.).
All concrete adapter implementations must inherit from this class and implement the required abstract methods.
EvaluatorAdapter
Bases: BaseEvaluatorAdapter
Adapter for class-based evaluators conforming to the Evaluator or AsyncEvaluator protocol.
This adapter enables the use of evaluator classes that implement either the Evaluator or AsyncEvaluator interface within the experiment framework.
Attributes:
Name | Type | Description |
---|---|---|
evaluator |
Union[Evaluator, AsyncEvaluator]
|
The evaluator instance to adapt. |
Examples:
import typing
from typing import Optional
from patronus import datasets
from patronus.evals import Evaluator, EvaluationResult
from patronus.experiments import run_experiment
from patronus.experiments.adapters import EvaluatorAdapter
from patronus.experiments.types import TaskResult, EvalParent
class MatchEvaluator(Evaluator):
def __init__(self, sanitizer=None):
if sanitizer is None:
sanitizer = lambda x: x
self.sanitizer = sanitizer
def evaluate(self, actual: str, expected: str) -> EvaluationResult:
matched = self.sanitizer(actual) == self.sanitizer(expected)
return EvaluationResult(pass_=matched, score=int(matched))
exact_match = MatchEvaluator()
fuzzy_match = MatchEvaluator(lambda x: x.strip().lower())
class MatchAdapter(EvaluatorAdapter):
def __init__(self, evaluator: MatchEvaluator):
super().__init__(evaluator)
def transform(
self,
row: datasets.Row,
task_result: Optional[TaskResult],
parent: EvalParent,
**kwargs
) -> tuple[list[typing.Any], dict[str, typing.Any]]:
args = [row.task_output, row.gold_answer]
kwargs = {}
# Passing arguments via kwargs would also work in this case.
# kwargs = {"actual": row.task_output, "expected": row.gold_answer}
return args, kwargs
run_experiment(
dataset=[{"task_output": "string ", "gold_answer": "string"}],
evaluators=[MatchAdapter(exact_match), MatchAdapter(fuzzy_match)],
)
Source code in src/patronus/experiments/adapters.py
transform
transform(
row: Row,
task_result: Optional[TaskResult],
parent: EvalParent,
**kwargs: Any,
) -> tuple[list[typing.Any], dict[str, typing.Any]]
Transform experiment framework arguments to evaluation method arguments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row
|
Row
|
The data row being evaluated. |
required |
task_result
|
Optional[TaskResult]
|
The result of the task execution, if available. |
required |
parent
|
EvalParent
|
The parent evaluation context. |
required |
**kwargs
|
Any
|
Additional keyword arguments from the experiment. |
{}
|
Returns:
Type | Description |
---|---|
list[Any]
|
A list of positional arguments to pass to the evaluator function. |
dict[str, Any]
|
A dictionary of keyword arguments to pass to the evaluator function. |
Source code in src/patronus/experiments/adapters.py
evaluate
async
evaluate(
row: Row,
task_result: Optional[TaskResult],
parent: EvalParent,
**kwargs: Any,
) -> EvaluationResult
Evaluate the given row and task result using the adapted evaluator function.
This method implements the BaseEvaluatorAdapter.evaluate() protocol.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row
|
Row
|
The data row being evaluated. |
required |
task_result
|
Optional[TaskResult]
|
The result of the task execution, if available. |
required |
parent
|
EvalParent
|
The parent evaluation context. |
required |
**kwargs
|
Any
|
Additional keyword arguments from the experiment. |
{}
|
Returns:
Type | Description |
---|---|
EvaluationResult
|
An EvaluationResult containing the evaluation outcome. |
Source code in src/patronus/experiments/adapters.py
StructuredEvaluatorAdapter
Bases: EvaluatorAdapter
Adapter for structured evaluators.
Source code in src/patronus/experiments/adapters.py
FuncEvaluatorAdapter
Bases: BaseEvaluatorAdapter
Adapter class that allows using function-based evaluators with the experiment framework.
This adapter serves as a bridge between function-based evaluators decorated with @evaluator()
and the experiment framework's evaluation system.
It handles both synchronous and asynchronous evaluator functions.
Attributes:
Name | Type | Description |
---|---|---|
fn |
Callable
|
The evaluator function to be adapted. |
Notes
- The function passed to this adapter must be decorated with
@evaluator()
. - The adapter automatically handles the conversion between function results and proper evaluation result objects.
Examples:
Direct usage with a compatible evaluator function:
```python
from patronus import evaluator
from patronus.experiments import FuncEvaluatorAdapter, run_experiment
from patronus.datasets import Row
@evaluator()
def exact_match(row: Row, **kwargs):
return row.task_output == row.gold_answer
run_experiment(
dataset=[{"task_output": "string", "gold_answer": "string"}],
evaluators=[FuncEvaluatorAdapter(exact_match)]
)
```
Customized usage by overriding the `transform()` method:
```python
from typing import Optional
import typing
from patronus import evaluator, datasets
from patronus.experiments import FuncEvaluatorAdapter, run_experiment
from patronus.experiments.types import TaskResult, EvalParent
@evaluator()
def exact_match(actual, expected):
return actual == expected
class AdaptedExactMatch(FuncEvaluatorAdapter):
def __init__(self):
super().__init__(exact_match)
def transform(
self,
row: datasets.Row,
task_result: Optional[TaskResult],
parent: EvalParent,
**kwargs
) -> tuple[list[typing.Any], dict[str, typing.Any]]:
args = [row.task_output, row.gold_answer]
kwargs = {}
# Alternative: passing arguments via kwargs instead of args
# args = []
# kwargs = {"actual": row.task_output, "expected": row.gold_answer}
return args, kwargs
run_experiment(
dataset=[{"task_output": "string", "gold_answer": "string"}],
evaluators=[AdaptedExactMatch()],
)
```
Source code in src/patronus/experiments/adapters.py
transform
transform(
row: Row,
task_result: Optional[TaskResult],
parent: EvalParent,
**kwargs: Any,
) -> tuple[list[typing.Any], dict[str, typing.Any]]
Transform experiment framework parameters to evaluator function parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row
|
Row
|
The data row being evaluated. |
required |
task_result
|
Optional[TaskResult]
|
The result of the task execution, if available. |
required |
parent
|
EvalParent
|
The parent evaluation context. |
required |
**kwargs
|
Any
|
Additional keyword arguments from the experiment. |
{}
|
Returns:
Type | Description |
---|---|
list[Any]
|
A list of positional arguments to pass to the evaluator function. |
dict[str, Any]
|
A dictionary of keyword arguments to pass to the evaluator function. |
Source code in src/patronus/experiments/adapters.py
evaluate
async
evaluate(
row: Row,
task_result: Optional[TaskResult],
parent: EvalParent,
**kwargs: Any,
) -> EvaluationResult
Evaluate the given row and task result using the adapted evaluator function.
This method implements the BaseEvaluatorAdapter.evaluate() protocol.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row
|
Row
|
The data row being evaluated. |
required |
task_result
|
Optional[TaskResult]
|
The result of the task execution, if available. |
required |
parent
|
EvalParent
|
The parent evaluation context. |
required |
**kwargs
|
Any
|
Additional keyword arguments from the experiment. |
{}
|
Returns:
Type | Description |
---|---|
EvaluationResult
|
An EvaluationResult containing the evaluation outcome. |
Source code in src/patronus/experiments/adapters.py
experiment
Tags
module-attribute
Tags are key-value pairs applied to experiments, task results and evaluation results.
Task
module-attribute
Task = Union[
TaskProtocol[Union[TaskResult, str, None]],
TaskProtocol[Awaitable[Union[TaskResult, str, None]]],
]
A function that processes each dataset row and produces output for evaluation.
ExperimentDataset
module-attribute
ExperimentDataset = Union[
Dataset,
DatasetLoader,
list[dict[str, Any]],
tuple[dict[str, Any], ...],
DataFrame,
Awaitable,
Callable[[], Awaitable],
]
Any object that would "resolve" into Dataset.
TaskProtocol
Bases: Protocol[T]
Defines an interface for a task.
Task is a function that processes each dataset row and produces output for evaluation.
__call__
Processes a dataset row, using the provided context to produce task output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row
|
Row
|
The dataset row to process. |
required |
parent
|
EvalParent
|
Reference to the parent task's output and evaluation results. |
required |
tags
|
Tags
|
Key-value pairs. |
required |
Returns:
Type | Description |
---|---|
T
|
Task output of type T or None to skip the row processing. |
Example
def simple_task(row: datasets.Row, parent: EvalParent, tags: Tags) -> TaskResult:
# Process input from the dataset row
input_text = row.task_input
# Generate output
output = f"Processed: {input_text}"
# Return result
return TaskResult(
output=output,
metadata={"processing_time_ms": 42},
tags={"model": "example-model"}
)
Source code in src/patronus/experiments/experiment.py
ChainLink
Bases: TypedDict
Represents a single stage in an experiment's processing chain.
Each ChainLink contains an optional task function that processes dataset rows and a list of evaluators that assess the task's output.
Attributes:
Name | Type | Description |
---|---|---|
task |
Optional[Task]
|
Function that processes a dataset row and produces output. |
evaluators |
list[AdaptableEvaluators]
|
List of evaluators to assess the task's output. |
Experiment
Experiment(
*,
dataset: Any,
task: Optional[Task] = None,
evaluators: Optional[list[AdaptableEvaluators]] = None,
chain: Optional[list[ChainLink]] = None,
tags: Optional[dict[str, str]] = None,
max_concurrency: int = 10,
project_name: Optional[str] = None,
experiment_name: Optional[str] = None,
service: Optional[str] = None,
api_key: Optional[str] = None,
api_url: Optional[str] = None,
otel_endpoint: Optional[str] = None,
ui_url: Optional[str] = None,
timeout_s: Optional[int] = None,
integrations: Optional[list[Any]] = None,
**kwargs,
)
Manages evaluation experiments across datasets using tasks and evaluators.
An experiment represents a complete evaluation pipeline that processes a dataset using defined tasks, applies evaluators to the outputs, and collects the results. Experiments track progress, create reports, and interface with the Patronus platform.
Create experiment instances using the create()
class method
or through the run_experiment()
convenience function.
Source code in src/patronus/experiments/experiment.py
create
async
classmethod
create(
dataset: ExperimentDataset,
task: Optional[Task] = None,
evaluators: Optional[list[AdaptableEvaluators]] = None,
chain: Optional[list[ChainLink]] = None,
tags: Optional[Tags] = None,
max_concurrency: int = 10,
project_name: Optional[str] = None,
experiment_name: Optional[str] = None,
service: Optional[str] = None,
api_key: Optional[str] = None,
api_url: Optional[str] = None,
otel_endpoint: Optional[str] = None,
ui_url: Optional[str] = None,
timeout_s: Optional[int] = None,
integrations: Optional[list[Any]] = None,
**kwargs: Any,
) -> te.Self
Creates an instance of the class asynchronously with the specified parameters while performing necessary preparations. This method initializes various attributes including dataset, task, evaluators, chain, and additional configurations for managing concurrency, project details, service information, API keys, timeout settings, and integrations.
Use run_experiment for more convenient usage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
ExperimentDataset
|
The dataset to run evaluations against. |
required |
task
|
Optional[Task]
|
A function that processes each dataset row and produces output for evaluation.
Mutually exclusive with the |
None
|
evaluators
|
Optional[list[AdaptableEvaluators]]
|
A list of evaluators to assess the task output. Mutually exclusive with
the |
None
|
chain
|
Optional[list[ChainLink]]
|
A list of processing stages, each containing a task and associated evaluators. Use this for multi-stage evaluation pipelines. |
None
|
tags
|
Optional[Tags]
|
Key-value pairs. All evaluations created by the experiment will contain these tags. |
None
|
max_concurrency
|
int
|
Maximum number of concurrent task and evaluation operations. |
10
|
project_name
|
Optional[str]
|
Name of the project to create or use. Falls back to configuration or environment variables if not provided. |
None
|
experiment_name
|
Optional[str]
|
Custom name for this experiment run. A timestamp will be appended. |
None
|
service
|
Optional[str]
|
OpenTelemetry service name for tracing. Falls back to configuration or environment variables if not provided. |
None
|
api_key
|
Optional[str]
|
API key for Patronus services. Falls back to configuration or environment variables if not provided. |
None
|
api_url
|
Optional[str]
|
URL for the Patronus API. Falls back to configuration or environment variables if not provided. |
None
|
otel_endpoint
|
Optional[str]
|
OpenTelemetry collector endpoint. Falls back to configuration or environment variables if not provided. |
None
|
ui_url
|
Optional[str]
|
URL for the Patronus UI. Falls back to configuration or environment variables if not provided. |
None
|
timeout_s
|
Optional[int]
|
Timeout in seconds for API operations. Falls back to configuration or environment variables if not provided. |
None
|
integrations
|
Optional[list[Any]]
|
A list of OpenTelemetry instrumentors for additional tracing capabilities. |
None
|
**kwargs
|
Any
|
Additional keyword arguments passed to the experiment. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
Experiment |
Self
|
... |
Source code in src/patronus/experiments/experiment.py
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 |
|
run
async
Executes the experiment by processing all dataset items.
Runs the experiment's task chain on each dataset row, applying evaluators to the results and collecting metrics. Progress is displayed with a progress bar and results are logged to the Patronus platform.
Returns:
Type | Description |
---|---|
Self
|
The experiment instance. |
Source code in src/patronus/experiments/experiment.py
to_dataframe
Converts experiment results to a pandas DataFrame.
Creates a tabular representation of all evaluation results with dataset identifiers, task information, evaluation scores, and metadata.
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame containing all experiment results. |
Source code in src/patronus/experiments/experiment.py
to_csv
Saves experiment results to a CSV file.
Converts experiment results to a DataFrame and saves them as a CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_or_buf
|
Union[str, Path, IO[AnyStr]]
|
String path or file-like object where the CSV will be saved. |
required |
**kwargs
|
Any
|
Additional arguments passed to pandas.DataFrame.to_csv(). |
{}
|
Returns:
Type | Description |
---|---|
Optional[str]
|
String path if a path was specified and return_path is True, otherwise None. |
Source code in src/patronus/experiments/experiment.py
run_experiment
run_experiment(
dataset: ExperimentDataset,
task: Optional[Task] = None,
evaluators: Optional[list[AdaptableEvaluators]] = None,
chain: Optional[list[ChainLink]] = None,
tags: Optional[Tags] = None,
max_concurrency: int = 10,
project_name: Optional[str] = None,
experiment_name: Optional[str] = None,
service: Optional[str] = None,
api_key: Optional[str] = None,
api_url: Optional[str] = None,
otel_endpoint: Optional[str] = None,
ui_url: Optional[str] = None,
timeout_s: Optional[int] = None,
integrations: Optional[list[Any]] = None,
**kwargs,
) -> Union[Experiment, typing.Awaitable[Experiment]]
Create and run an experiment.
This function creates an experiment with the specified configuration and runs it to completion. The execution handling is context-aware:
- When called from an asynchronous context (with a running event loop), it returns an awaitable that must be awaited.
- When called from a synchronous context (no running event loop), it blocks until the experiment completes and returns the Experiment object.
Examples:
Synchronous execution:
Asynchronous execution (e.g., in a Jupyter Notebook):
experiment = await run_experiment(dataset, task=some_task)
# Must be awaited within an async function or event loop.
Parameters:
See Experiment.create for list of arguments.
Returns:
Name | Type | Description |
---|---|---|
Experiment |
Experiment
|
In a synchronous context: the completed Experiment object. |
Experiment |
Awaitable[Experiment]
|
In an asynchronous context: an awaitable that resolves to the Experiment object. |
Notes
For manual control of the event loop, you can create and run the experiment as follows:
Source code in src/patronus/experiments/experiment.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
|
types
EvalParent
module-attribute
Type alias representing an optional reference to an evaluation parent, used to track the hierarchy of evaluations and their results
TaskResult
Bases: BaseModel
Represents the result of a task with optional output, metadata, and tags.
This class is used to encapsulate the result of a task, including optional fields for the output of the task, metadata related to the task, and any tags that can provide additional information or context about the task.
Attributes:
Name | Type | Description |
---|---|---|
output |
Optional[str]
|
The output of the task, if any. |
metadata |
Optional[dict[str, Any]]
|
Additional information or metadata associated with the task. |
tags |
Optional[dict[str, str]]
|
Key-value pairs used to tag and describe the task. |
EvalsMap
Bases: dict
A specialized dictionary for storing evaluation results with flexible key handling.
This class extends dict to provide automatic key normalization for evaluation results, allowing lookup by evaluator objects, strings, or any object with a canonical_name attribute.
_EvalParent
Bases: BaseModel
Represents a node in the evaluation parent-child hierarchy, tracking task results and evaluations.
Attributes:
Name | Type | Description |
---|---|---|
task |
Optional[TaskResult]
|
The task result associated with this evaluation node |
evals |
Optional[EvalsMap]
|
A mapping of evaluator IDs to their evaluation results |
parent |
Optional[_EvalParent]
|
Optional reference to a parent evaluation node, forming a linked list |
find_eval_result
find_eval_result(
evaluator_or_name: Union[str, Evaluator],
) -> typing.Union[
api_types.EvaluationResult, EvaluationResult, None
]
Recursively searches for an evaluation result by evaluator ID or name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluator_or_name
|
Union[str, Evaluator]
|
The evaluator ID, name, or object to search for |
required |
Returns:
Type | Description |
---|---|
Union[EvaluationResult, EvaluationResult, None]
|
The matching evaluation result, or None if not found |