Skip to content

experiments

patronus.experiments

adapters

BaseEvaluatorAdapter

Bases: ABC

Abstract base class for all evaluator adapters.

Evaluator adapters provide a standardized interface between the experiment framework and various types of evaluators (function-based, class-based, etc.).

All concrete adapter implementations must inherit from this class and implement the required abstract methods.

EvaluatorAdapter

EvaluatorAdapter(evaluator: Evaluator)

Bases: BaseEvaluatorAdapter

Adapter for class-based evaluators conforming to the Evaluator or AsyncEvaluator protocol.

This adapter enables the use of evaluator classes that implement either the Evaluator or AsyncEvaluator interface within the experiment framework.

Attributes:

Name Type Description
evaluator Union[Evaluator, AsyncEvaluator]

The evaluator instance to adapt.

Examples:

import typing
from typing import Optional

from patronus import datasets
from patronus.evals import Evaluator, EvaluationResult
from patronus.experiments import run_experiment
from patronus.experiments.adapters import EvaluatorAdapter
from patronus.experiments.types import TaskResult, EvalParent


class MatchEvaluator(Evaluator):
    def __init__(self, sanitizer=None):
        if sanitizer is None:
            sanitizer = lambda x: x
        self.sanitizer = sanitizer

    def evaluate(self, actual: str, expected: str) -> EvaluationResult:
        matched = self.sanitizer(actual) == self.sanitizer(expected)
        return EvaluationResult(pass_=matched, score=int(matched))


exact_match = MatchEvaluator()
fuzzy_match = MatchEvaluator(lambda x: x.strip().lower())


class MatchAdapter(EvaluatorAdapter):
    def __init__(self, evaluator: MatchEvaluator):
        super().__init__(evaluator)

    def transform(
        self,
        row: datasets.Row,
        task_result: Optional[TaskResult],
        parent: EvalParent,
        **kwargs
    ) -> tuple[list[typing.Any], dict[str, typing.Any]]:
        args = [row.task_output, row.gold_answer]
        kwargs = {}
        # Passing arguments via kwargs would also work in this case.
        # kwargs = {"actual": row.task_output, "expected": row.gold_answer}
        return args, kwargs


run_experiment(
    dataset=[{"task_output": "string        ", "gold_answer": "string"}],
    evaluators=[MatchAdapter(exact_match), MatchAdapter(fuzzy_match)],
)
Source code in src/patronus/experiments/adapters.py
def __init__(self, evaluator: evals.Evaluator):
    if not isinstance(evaluator, evals.Evaluator):
        raise TypeError(f"{evaluator} is not {evals.Evaluator.__name__}.")
    self.evaluator = evaluator
transform
transform(
    row: Row,
    task_result: Optional[TaskResult],
    parent: EvalParent,
    **kwargs: Any,
) -> tuple[list[typing.Any], dict[str, typing.Any]]

Transform experiment framework arguments to evaluation method arguments.

Parameters:

Name Type Description Default
row Row

The data row being evaluated.

required
task_result Optional[TaskResult]

The result of the task execution, if available.

required
parent EvalParent

The parent evaluation context.

required
**kwargs Any

Additional keyword arguments from the experiment.

{}

Returns:

Type Description
list[Any]

A list of positional arguments to pass to the evaluator function.

dict[str, Any]

A dictionary of keyword arguments to pass to the evaluator function.

Source code in src/patronus/experiments/adapters.py
def transform(
    self,
    row: datasets.Row,
    task_result: Optional[TaskResult],
    parent: EvalParent,
    **kwargs: typing.Any,
) -> tuple[list[typing.Any], dict[str, typing.Any]]:
    """
    Transform experiment framework arguments to evaluation method arguments.

    Args:
        row: The data row being evaluated.
        task_result: The result of the task execution, if available.
        parent: The parent evaluation context.
        **kwargs: Additional keyword arguments from the experiment.

    Returns:
        A list of positional arguments to pass to the evaluator function.
        A dictionary of keyword arguments to pass to the evaluator function.
    """

    return ([], {"row": row, "task_result": task_result, "prent": parent, **kwargs})
evaluate async
evaluate(
    row: Row,
    task_result: Optional[TaskResult],
    parent: EvalParent,
    **kwargs: Any,
) -> EvaluationResult

Evaluate the given row and task result using the adapted evaluator function.

This method implements the BaseEvaluatorAdapter.evaluate() protocol.

Parameters:

Name Type Description Default
row Row

The data row being evaluated.

required
task_result Optional[TaskResult]

The result of the task execution, if available.

required
parent EvalParent

The parent evaluation context.

required
**kwargs Any

Additional keyword arguments from the experiment.

{}

Returns:

Type Description
EvaluationResult

An EvaluationResult containing the evaluation outcome.

Source code in src/patronus/experiments/adapters.py
async def evaluate(
    self, row: datasets.Row, task_result: Optional[TaskResult], parent: EvalParent, **kwargs: typing.Any
) -> EvaluationResult:
    """
    Evaluate the given row and task result using the adapted evaluator function.

    This method implements the BaseEvaluatorAdapter.evaluate() protocol.

    Args:
        row: The data row being evaluated.
        task_result: The result of the task execution, if available.
        parent: The parent evaluation context.
        **kwargs: Additional keyword arguments from the experiment.

    Returns:
        An EvaluationResult containing the evaluation outcome.
    """
    ev_args, ev_kwargs = self.transform(row, task_result, parent, **kwargs)
    return await self._evaluate(*ev_args, **ev_kwargs)

StructuredEvaluatorAdapter

StructuredEvaluatorAdapter(
    evaluator: Union[
        StructuredEvaluator, AsyncStructuredEvaluator
    ],
)

Bases: EvaluatorAdapter

Adapter for structured evaluators.

Source code in src/patronus/experiments/adapters.py
def __init__(self, evaluator: Union[evals.StructuredEvaluator, evals.AsyncStructuredEvaluator]):
    if not isinstance(evaluator, (evals.StructuredEvaluator, evals.AsyncStructuredEvaluator)):
        raise TypeError(
            f"{type(evaluator)} is not "
            f"{evals.AsyncStructuredEvaluator.__name__} nor {evals.StructuredEvaluator.__name__}."
        )
    super().__init__(evaluator)

FuncEvaluatorAdapter

FuncEvaluatorAdapter(fn: Callable[..., Any])

Bases: BaseEvaluatorAdapter

Adapter class that allows using function-based evaluators with the experiment framework.

This adapter serves as a bridge between function-based evaluators decorated with @evaluator() and the experiment framework's evaluation system. It handles both synchronous and asynchronous evaluator functions.

Attributes:

Name Type Description
fn Callable

The evaluator function to be adapted.

Notes
  • The function passed to this adapter must be decorated with @evaluator().
  • The adapter automatically handles the conversion between function results and proper evaluation result objects.

Examples:

Direct usage with a compatible evaluator function:

```python
from patronus import evaluator
from patronus.experiments import FuncEvaluatorAdapter, run_experiment
from patronus.datasets import Row


@evaluator()
def exact_match(row: Row, **kwargs):
    return row.task_output == row.gold_answer

run_experiment(
    dataset=[{"task_output": "string", "gold_answer": "string"}],
    evaluators=[FuncEvaluatorAdapter(exact_match)]
)
```

Customized usage by overriding the `transform()` method:

```python
from typing import Optional
import typing

from patronus import evaluator, datasets
from patronus.experiments import FuncEvaluatorAdapter, run_experiment
from patronus.experiments.types import TaskResult, EvalParent


@evaluator()
def exact_match(actual, expected):
    return actual == expected


class AdaptedExactMatch(FuncEvaluatorAdapter):
    def __init__(self):
        super().__init__(exact_match)

    def transform(
        self,
        row: datasets.Row,
        task_result: Optional[TaskResult],
        parent: EvalParent,
        **kwargs
    ) -> tuple[list[typing.Any], dict[str, typing.Any]]:
        args = [row.task_output, row.gold_answer]
        kwargs = {}

        # Alternative: passing arguments via kwargs instead of args
        # args = []
        # kwargs = {"actual": row.task_output, "expected": row.gold_answer}

        return args, kwargs


run_experiment(
    dataset=[{"task_output": "string", "gold_answer": "string"}],
    evaluators=[AdaptedExactMatch()],
)
```
Source code in src/patronus/experiments/adapters.py
def __init__(self, fn: typing.Callable[..., typing.Any]):
    if not hasattr(fn, "_pat_evaluator"):
        raise ValueError(
            f"Passed function {fn.__qualname__} is not an evaluator. "
            "Hint: add @evaluator decorator to the function."
        )
    self.fn = fn
transform
transform(
    row: Row,
    task_result: Optional[TaskResult],
    parent: EvalParent,
    **kwargs: Any,
) -> tuple[list[typing.Any], dict[str, typing.Any]]

Transform experiment framework parameters to evaluator function parameters.

Parameters:

Name Type Description Default
row Row

The data row being evaluated.

required
task_result Optional[TaskResult]

The result of the task execution, if available.

required
parent EvalParent

The parent evaluation context.

required
**kwargs Any

Additional keyword arguments from the experiment.

{}

Returns:

Type Description
list[Any]

A list of positional arguments to pass to the evaluator function.

dict[str, Any]

A dictionary of keyword arguments to pass to the evaluator function.

Source code in src/patronus/experiments/adapters.py
def transform(
    self, row: datasets.Row, task_result: Optional[TaskResult], parent: EvalParent, **kwargs: typing.Any
) -> tuple[list[typing.Any], dict[str, typing.Any]]:
    """
    Transform experiment framework parameters to evaluator function parameters.

    Args:
        row: The data row being evaluated.
        task_result: The result of the task execution, if available.
        parent: The parent evaluation context.
        **kwargs: Additional keyword arguments from the experiment.

    Returns:
        A list of positional arguments to pass to the evaluator function.
        A dictionary of keyword arguments to pass to the evaluator function.
    """

    return ([], {"row": row, "task_result": task_result, "prent": parent, **kwargs})
evaluate async
evaluate(
    row: Row,
    task_result: Optional[TaskResult],
    parent: EvalParent,
    **kwargs: Any,
) -> EvaluationResult

Evaluate the given row and task result using the adapted evaluator function.

This method implements the BaseEvaluatorAdapter.evaluate() protocol.

Parameters:

Name Type Description Default
row Row

The data row being evaluated.

required
task_result Optional[TaskResult]

The result of the task execution, if available.

required
parent EvalParent

The parent evaluation context.

required
**kwargs Any

Additional keyword arguments from the experiment.

{}

Returns:

Type Description
EvaluationResult

An EvaluationResult containing the evaluation outcome.

Source code in src/patronus/experiments/adapters.py
async def evaluate(
    self,
    row: datasets.Row,
    task_result: Optional[TaskResult],
    parent: EvalParent,
    **kwargs: typing.Any,
) -> EvaluationResult:
    """
    Evaluate the given row and task result using the adapted evaluator function.

    This method implements the BaseEvaluatorAdapter.evaluate() protocol.

    Args:
        row: The data row being evaluated.
        task_result: The result of the task execution, if available.
        parent: The parent evaluation context.
        **kwargs: Additional keyword arguments from the experiment.

    Returns:
        An EvaluationResult containing the evaluation outcome.
    """
    ev_args, ev_kwargs = self.transform(row, task_result, parent, **kwargs)
    return await self._evaluate(*ev_args, **ev_kwargs)

experiment

Tags module-attribute

Tags = dict[str, str]

Tags are key-value pairs applied to experiments, task results and evaluation results.

Task module-attribute

Task = Union[
    TaskProtocol[Union[TaskResult, str, None]],
    TaskProtocol[Awaitable[Union[TaskResult, str, None]]],
]

A function that processes each dataset row and produces output for evaluation.

ExperimentDataset module-attribute

ExperimentDataset = Union[
    Dataset,
    DatasetLoader,
    list[dict[str, Any]],
    tuple[dict[str, Any], ...],
    DataFrame,
    Awaitable,
    Callable[[], Awaitable],
]

Any object that would "resolve" into Dataset.

TaskProtocol

Bases: Protocol[T]

Defines an interface for a task.

Task is a function that processes each dataset row and produces output for evaluation.

__call__
__call__(*, row: Row, parent: EvalParent, tags: Tags) -> T

Processes a dataset row, using the provided context to produce task output.

Parameters:

Name Type Description Default
row Row

The dataset row to process.

required
parent EvalParent

Reference to the parent task's output and evaluation results.

required
tags Tags

Key-value pairs.

required

Returns:

Type Description
T

Task output of type T or None to skip the row processing.

Example
def simple_task(row: datasets.Row, parent: EvalParent, tags: Tags) -> TaskResult:
    # Process input from the dataset row
    input_text = row.task_input

    # Generate output
    output = f"Processed: {input_text}"

    # Return result
    return TaskResult(
        output=output,
        metadata={"processing_time_ms": 42},
        tags={"model": "example-model"}
    )
Source code in src/patronus/experiments/experiment.py
def __call__(self, *, row: datasets.Row, parent: EvalParent, tags: Tags) -> T:
    """
    Processes a dataset row, using the provided context to produce task output.

    Args:
        row: The dataset row to process.
        parent: Reference to the parent task's output and evaluation results.
        tags: Key-value pairs.

    Returns:
        Task output of type T or None to skip the row processing.

    Example:
        ```python
        def simple_task(row: datasets.Row, parent: EvalParent, tags: Tags) -> TaskResult:
            # Process input from the dataset row
            input_text = row.task_input

            # Generate output
            output = f"Processed: {input_text}"

            # Return result
            return TaskResult(
                output=output,
                metadata={"processing_time_ms": 42},
                tags={"model": "example-model"}
            )
        ```
    """

Bases: TypedDict

Represents a single stage in an experiment's processing chain.

Each ChainLink contains an optional task function that processes dataset rows and a list of evaluators that assess the task's output.

Attributes:

Name Type Description
task Optional[Task]

Function that processes a dataset row and produces output.

evaluators list[AdaptableEvaluators]

List of evaluators to assess the task's output.

Experiment

Experiment(
    *,
    dataset: Any,
    task: Optional[Task] = None,
    evaluators: Optional[list[AdaptableEvaluators]] = None,
    chain: Optional[list[ChainLink]] = None,
    tags: Optional[dict[str, str]] = None,
    max_concurrency: int = 10,
    project_name: Optional[str] = None,
    experiment_name: Optional[str] = None,
    service: Optional[str] = None,
    api_key: Optional[str] = None,
    api_url: Optional[str] = None,
    otel_endpoint: Optional[str] = None,
    ui_url: Optional[str] = None,
    timeout_s: Optional[int] = None,
    integrations: Optional[list[Any]] = None,
    **kwargs,
)

Manages evaluation experiments across datasets using tasks and evaluators.

An experiment represents a complete evaluation pipeline that processes a dataset using defined tasks, applies evaluators to the outputs, and collects the results. Experiments track progress, create reports, and interface with the Patronus platform.

Create experiment instances using the create() class method or through the run_experiment() convenience function.

Source code in src/patronus/experiments/experiment.py
def __init__(
    self,
    *,
    dataset: typing.Any,
    task: Optional[Task] = None,
    evaluators: Optional[list[AdaptableEvaluators]] = None,
    chain: Optional[list[ChainLink]] = None,
    tags: Optional[dict[str, str]] = None,
    max_concurrency: int = 10,
    project_name: Optional[str] = None,
    experiment_name: Optional[str] = None,
    service: Optional[str] = None,
    api_key: Optional[str] = None,
    api_url: Optional[str] = None,
    otel_endpoint: Optional[str] = None,
    ui_url: Optional[str] = None,
    timeout_s: Optional[int] = None,
    integrations: Optional[list[typing.Any]] = None,
    **kwargs,
):
    if chain and evaluators:
        raise ValueError("Cannot specify both chain and evaluators")

    self._raw_dataset = dataset

    if not chain:
        chain = [{"task": task, "evaluators": evaluators}]
    self._chain = [
        {"task": _trace_task(link["task"]), "evaluators": _adapt_evaluators(link["evaluators"])} for link in chain
    ]
    self._started = False
    self._finished = False

    self._project_name = project_name
    self.project = None

    self._experiment_name = experiment_name
    self.experiment = None

    self.tags = tags or {}

    self.max_concurrency = max_concurrency

    self._service = service
    self._api_key = api_key
    self._api_url = api_url
    self._otel_endpoint = otel_endpoint
    self._ui_url = ui_url
    self._timeout_s = timeout_s

    self._prepared = False

    self.reporter = Reporter()

    self._integrations = integrations
create async classmethod
create(
    dataset: ExperimentDataset,
    task: Optional[Task] = None,
    evaluators: Optional[list[AdaptableEvaluators]] = None,
    chain: Optional[list[ChainLink]] = None,
    tags: Optional[Tags] = None,
    max_concurrency: int = 10,
    project_name: Optional[str] = None,
    experiment_name: Optional[str] = None,
    service: Optional[str] = None,
    api_key: Optional[str] = None,
    api_url: Optional[str] = None,
    otel_endpoint: Optional[str] = None,
    ui_url: Optional[str] = None,
    timeout_s: Optional[int] = None,
    integrations: Optional[list[Any]] = None,
    **kwargs: Any,
) -> te.Self

Creates an instance of the class asynchronously with the specified parameters while performing necessary preparations. This method initializes various attributes including dataset, task, evaluators, chain, and additional configurations for managing concurrency, project details, service information, API keys, timeout settings, and integrations.

Use run_experiment for more convenient usage.

Parameters:

Name Type Description Default
dataset ExperimentDataset

The dataset to run evaluations against.

required
task Optional[Task]

A function that processes each dataset row and produces output for evaluation. Mutually exclusive with the chain parameter.

None
evaluators Optional[list[AdaptableEvaluators]]

A list of evaluators to assess the task output. Mutually exclusive with the chain parameter.

None
chain Optional[list[ChainLink]]

A list of processing stages, each containing a task and associated evaluators. Use this for multi-stage evaluation pipelines.

None
tags Optional[Tags]

Key-value pairs. All evaluations created by the experiment will contain these tags.

None
max_concurrency int

Maximum number of concurrent task and evaluation operations.

10
project_name Optional[str]

Name of the project to create or use. Falls back to configuration or environment variables if not provided.

None
experiment_name Optional[str]

Custom name for this experiment run. A timestamp will be appended.

None
service Optional[str]

OpenTelemetry service name for tracing. Falls back to configuration or environment variables if not provided.

None
api_key Optional[str]

API key for Patronus services. Falls back to configuration or environment variables if not provided.

None
api_url Optional[str]

URL for the Patronus API. Falls back to configuration or environment variables if not provided.

None
otel_endpoint Optional[str]

OpenTelemetry collector endpoint. Falls back to configuration or environment variables if not provided.

None
ui_url Optional[str]

URL for the Patronus UI. Falls back to configuration or environment variables if not provided.

None
timeout_s Optional[int]

Timeout in seconds for API operations. Falls back to configuration or environment variables if not provided.

None
integrations Optional[list[Any]]

A list of OpenTelemetry instrumentors for additional tracing capabilities.

None
**kwargs Any

Additional keyword arguments passed to the experiment.

{}

Returns:

Name Type Description
Experiment Self

...

Source code in src/patronus/experiments/experiment.py
@classmethod
async def create(
    cls,
    dataset: ExperimentDataset,
    task: Optional[Task] = None,
    evaluators: Optional[list[AdaptableEvaluators]] = None,
    chain: Optional[list[ChainLink]] = None,
    tags: Optional[Tags] = None,
    max_concurrency: int = 10,
    project_name: Optional[str] = None,
    experiment_name: Optional[str] = None,
    service: Optional[str] = None,
    api_key: Optional[str] = None,
    api_url: Optional[str] = None,
    otel_endpoint: Optional[str] = None,
    ui_url: Optional[str] = None,
    timeout_s: Optional[int] = None,
    integrations: Optional[list[typing.Any]] = None,
    **kwargs: typing.Any,
) -> te.Self:
    """
    Creates an instance of the class asynchronously with the specified parameters while performing
    necessary preparations. This method initializes various attributes including dataset, task,
    evaluators, chain, and additional configurations for managing concurrency, project details,
    service information, API keys, timeout settings, and integrations.

    Use [run_experiment][patronus.experiments.experiment.run_experiment] for more convenient usage.

    Args:
        dataset: The dataset to run evaluations against.
        task: A function that processes each dataset row and produces output for evaluation.
            Mutually exclusive with the `chain` parameter.
        evaluators: A list of evaluators to assess the task output. Mutually exclusive with
            the `chain` parameter.
        chain: A list of processing stages, each containing a task and associated evaluators.
            Use this for multi-stage evaluation pipelines.
        tags: Key-value pairs.
            All evaluations created by the experiment will contain these tags.
        max_concurrency: Maximum number of concurrent task and evaluation operations.
        project_name: Name of the project to create or use. Falls back to configuration or
            environment variables if not provided.
        experiment_name: Custom name for this experiment run. A timestamp will be appended.
        service: OpenTelemetry service name for tracing. Falls back to configuration or
            environment variables if not provided.
        api_key: API key for Patronus services. Falls back to configuration or environment
            variables if not provided.
        api_url: URL for the Patronus API. Falls back to configuration or environment
            variables if not provided.
        otel_endpoint: OpenTelemetry collector endpoint. Falls back to configuration or
            environment variables if not provided.
        ui_url: URL for the Patronus UI. Falls back to configuration or environment
            variables if not provided.
        timeout_s: Timeout in seconds for API operations. Falls back to configuration or
            environment variables if not provided.
        integrations: A list of OpenTelemetry instrumentors for additional tracing capabilities.
        **kwargs: Additional keyword arguments passed to the experiment.

    Returns:
        Experiment: ...

    """
    ex = cls(
        dataset=dataset,
        task=task,
        evaluators=evaluators,
        chain=chain,
        tags=tags,
        max_concurrency=max_concurrency,
        project_name=project_name,
        experiment_name=experiment_name,
        service=service,
        api_key=api_key,
        api_url=api_url,
        otel_endpoint=otel_endpoint,
        ui_url=ui_url,
        timeout_s=timeout_s,
        integrations=integrations,
        **kwargs,
    )
    ex._ctx = await ex._prepare()

    return ex
run async
run() -> te.Self

Executes the experiment by processing all dataset items.

Runs the experiment's task chain on each dataset row, applying evaluators to the results and collecting metrics. Progress is displayed with a progress bar and results are logged to the Patronus platform.

Returns:

Type Description
Self

The experiment instance.

Source code in src/patronus/experiments/experiment.py
async def run(self) -> te.Self:
    """
    Executes the experiment by processing all dataset items.

    Runs the experiment's task chain on each dataset row, applying evaluators
    to the results and collecting metrics. Progress is displayed with a progress
    bar and results are logged to the Patronus platform.

    Returns:
        The experiment instance.
    """
    if self._started:
        raise RuntimeError("Experiment already started")
    if self._prepared is False:
        raise ValueError(
            "Experiment must be prepared before starting. "
            "Seems that Experiment was not created using Experiment.create() classmethod."
        )
    self._started = True

    with context._CTX_PAT.using(self._ctx):
        await self._run()
        self._finished = True
        self.reporter.summary()

    await asyncio.to_thread(self._ctx.exporter.force_flush)
    await asyncio.to_thread(self._ctx.tracer_provider.force_flush)

    return self
to_dataframe
to_dataframe() -> pd.DataFrame

Converts experiment results to a pandas DataFrame.

Creates a tabular representation of all evaluation results with dataset identifiers, task information, evaluation scores, and metadata.

Returns:

Type Description
DataFrame

A pandas DataFrame containing all experiment results.

Source code in src/patronus/experiments/experiment.py
def to_dataframe(self) -> pd.DataFrame:
    """
    Converts experiment results to a pandas DataFrame.

    Creates a tabular representation of all evaluation results with
    dataset identifiers, task information, evaluation scores, and metadata.

    Returns:
        A pandas DataFrame containing all experiment results.
    """
    if self._finished is not True:
        raise RuntimeError("Experiment has to be in finished state")
    return self.reporter.to_dataframe()
to_csv
to_csv(
    path_or_buf: Union[str, Path, IO[AnyStr]], **kwargs: Any
) -> Optional[str]

Saves experiment results to a CSV file.

Converts experiment results to a DataFrame and saves them as a CSV file.

Parameters:

Name Type Description Default
path_or_buf Union[str, Path, IO[AnyStr]]

String path or file-like object where the CSV will be saved.

required
**kwargs Any

Additional arguments passed to pandas.DataFrame.to_csv().

{}

Returns:

Type Description
Optional[str]

String path if a path was specified and return_path is True, otherwise None.

Source code in src/patronus/experiments/experiment.py
def to_csv(
    self, path_or_buf: Union[str, pathlib.Path, typing.IO[typing.AnyStr]], **kwargs: typing.Any
) -> Optional[str]:
    """
    Saves experiment results to a CSV file.

    Converts experiment results to a DataFrame and saves them as a CSV file.

    Args:
        path_or_buf: String path or file-like object where the CSV will be saved.
        **kwargs: Additional arguments passed to pandas.DataFrame.to_csv().

    Returns:
        String path if a path was specified and return_path is True, otherwise None.

    """
    return self.to_dataframe().to_csv(path_or_buf, **kwargs)

run_experiment

run_experiment(
    dataset: ExperimentDataset,
    task: Optional[Task] = None,
    evaluators: Optional[list[AdaptableEvaluators]] = None,
    chain: Optional[list[ChainLink]] = None,
    tags: Optional[Tags] = None,
    max_concurrency: int = 10,
    project_name: Optional[str] = None,
    experiment_name: Optional[str] = None,
    service: Optional[str] = None,
    api_key: Optional[str] = None,
    api_url: Optional[str] = None,
    otel_endpoint: Optional[str] = None,
    ui_url: Optional[str] = None,
    timeout_s: Optional[int] = None,
    integrations: Optional[list[Any]] = None,
    **kwargs,
) -> Union[Experiment, typing.Awaitable[Experiment]]

Create and run an experiment.

This function creates an experiment with the specified configuration and runs it to completion. The execution handling is context-aware:

  • When called from an asynchronous context (with a running event loop), it returns an awaitable that must be awaited.
  • When called from a synchronous context (no running event loop), it blocks until the experiment completes and returns the Experiment object.

Examples:

Synchronous execution:

experiment = run_experiment(dataset, task=some_task)
# Blocks until the experiment finishes.

Asynchronous execution (e.g., in a Jupyter Notebook):

experiment = await run_experiment(dataset, task=some_task)
# Must be awaited within an async function or event loop.

Parameters:

See Experiment.create for list of arguments.

Returns:

Name Type Description
Experiment Experiment

In a synchronous context: the completed Experiment object.

Experiment Awaitable[Experiment]

In an asynchronous context: an awaitable that resolves to the Experiment object.

Notes

For manual control of the event loop, you can create and run the experiment as follows:

experiment = await Experiment.create(...)
await experiment.run()
Source code in src/patronus/experiments/experiment.py
def run_experiment(
    dataset: ExperimentDataset,
    task: Optional[Task] = None,
    evaluators: Optional[list[AdaptableEvaluators]] = None,
    chain: Optional[list[ChainLink]] = None,
    tags: Optional[Tags] = None,
    max_concurrency: int = 10,
    project_name: Optional[str] = None,
    experiment_name: Optional[str] = None,
    service: Optional[str] = None,
    api_key: Optional[str] = None,
    api_url: Optional[str] = None,
    otel_endpoint: Optional[str] = None,
    ui_url: Optional[str] = None,
    timeout_s: Optional[int] = None,
    integrations: Optional[list[typing.Any]] = None,
    **kwargs,
) -> Union["Experiment", typing.Awaitable["Experiment"]]:
    """
    Create and run an experiment.

    This function creates an experiment with the specified configuration and runs it to completion.
    The execution handling is context-aware:

    - When called from an asynchronous context (with a running event loop), it returns an
      awaitable that must be awaited.
    - When called from a synchronous context (no running event loop), it blocks until the
      experiment completes and returns the Experiment object.


    **Examples:**

    Synchronous execution:

    ```python
    experiment = run_experiment(dataset, task=some_task)
    # Blocks until the experiment finishes.
    ```

    Asynchronous execution (e.g., in a Jupyter Notebook):

    ```python
    experiment = await run_experiment(dataset, task=some_task)
    # Must be awaited within an async function or event loop.
    ```

    **Parameters:**

    See [Experiment.create][patronus.experiments.experiment.Experiment.create] for list of arguments.

    Returns:
        Experiment (Experiment): In a synchronous context: the completed Experiment object.
        Experiment (Awaitable[Experiment]): In an asynchronous context:
            an awaitable that resolves to the Experiment object.

    Notes:
        For manual control of the event loop, you can create and run the experiment as follows:

        ```python
        experiment = await Experiment.create(...)
        await experiment.run()
        ```

    """

    async def _run_experiment() -> Union[Experiment, typing.Awaitable[Experiment]]:
        ex = await Experiment.create(
            dataset=dataset,
            task=task,
            evaluators=evaluators,
            chain=chain,
            tags=tags,
            max_concurrency=max_concurrency,
            project_name=project_name,
            experiment_name=experiment_name,
            service=service,
            api_key=api_key,
            api_url=api_url,
            otel_endpoint=otel_endpoint,
            ui_url=ui_url,
            timeout_s=timeout_s,
            integrations=integrations,
            **kwargs,
        )
        return await ex.run()

    return run_until_complete(_run_experiment())

types

EvalParent module-attribute

EvalParent = Optional[_EvalParent]

Type alias representing an optional reference to an evaluation parent, used to track the hierarchy of evaluations and their results

TaskResult

Bases: BaseModel

Represents the result of a task with optional output, metadata, and tags.

This class is used to encapsulate the result of a task, including optional fields for the output of the task, metadata related to the task, and any tags that can provide additional information or context about the task.

Attributes:

Name Type Description
output Optional[str]

The output of the task, if any.

metadata Optional[dict[str, Any]]

Additional information or metadata associated with the task.

tags Optional[dict[str, str]]

Key-value pairs used to tag and describe the task.

EvalsMap

Bases: dict

A specialized dictionary for storing evaluation results with flexible key handling.

This class extends dict to provide automatic key normalization for evaluation results, allowing lookup by evaluator objects, strings, or any object with a canonical_name attribute.

_EvalParent

Bases: BaseModel

Represents a node in the evaluation parent-child hierarchy, tracking task results and evaluations.

Attributes:

Name Type Description
task Optional[TaskResult]

The task result associated with this evaluation node

evals Optional[EvalsMap]

A mapping of evaluator IDs to their evaluation results

parent Optional[_EvalParent]

Optional reference to a parent evaluation node, forming a linked list

find_eval_result
find_eval_result(
    evaluator_or_name: Union[str, Evaluator],
) -> typing.Union[
    api_types.EvaluationResult, EvaluationResult, None
]

Recursively searches for an evaluation result by evaluator ID or name.

Parameters:

Name Type Description Default
evaluator_or_name Union[str, Evaluator]

The evaluator ID, name, or object to search for

required

Returns:

Type Description
Union[EvaluationResult, EvaluationResult, None]

The matching evaluation result, or None if not found

Source code in src/patronus/experiments/types.py
def find_eval_result(
    self, evaluator_or_name: typing.Union[str, Evaluator]
) -> typing.Union[api_types.EvaluationResult, EvaluationResult, None]:
    """
    Recursively searches for an evaluation result by evaluator ID or name.

    Args:
        evaluator_or_name: The evaluator ID, name, or object to search for

    Returns:
        The matching evaluation result, or None if not found
    """
    if not self.evals and self.parent:
        return self.parent.find_eval_result(evaluator_or_name)
    if evaluator_or_name in self.evals:
        return self.evals[evaluator_or_name]
    return None