Datasets

patronus.datasets

datasets

Attachment

Bases: TypedDict

Represent an attachment entry. Usually used in context of multimodal evaluation.

Fields

Bases: TypedDict

A TypedDict class representing fields for a structured data entity.

Attributes:

Name	Type	Description
`sid`	`NotRequired[Optional[str]]`	An optional identifier for the system or session.
`system_prompt`	`NotRequired[Optional[str]]`	An optional string representing the system prompt associated with the task.
`task_context`	`NotRequired[Union[str, list[str], None]]`	Optional contextual information for the task in the form of a string or a list of strings.
`task_attachments`	`NotRequired[Optional[list[Attachment]]]`	Optional list of attachments associated with the task.
`task_input`	`NotRequired[Optional[str]]`	An optional string representing the input data for the task. Usually a user input sent to an LLM.
`task_output`	`NotRequired[Optional[str]]`	An optional string representing the output result of the task. Usually a response from an LLM.
`gold_answer`	`NotRequired[Optional[str]]`	An optional string representing the correct or expected answer for evaluation purposes.
`task_metadata`	`NotRequired[Optional[dict[str, Any]]]`	Optional dictionary containing metadata associated with the task.
`tags`	`NotRequired[Optional[dict[str, str]]]`	Optional dictionary holding additional key-value pair tags relevant to the task.

Row `dataclass`

Row(_row: Series)

Represents a data row encapsulating access to properties in a pandas Series.

Provides attribute-based access to underlying pandas Series data with properties that ensure compatibility with structured evaluators through consistent field naming and type handling.

Dataset `dataclass`

Dataset(dataset_id: Optional[str], df: DataFrame)

Represents a dataset.

from_records `classmethod`

from_records(
    records: Union[
        Iterable[Fields], Iterable[dict[str, Any]]
    ],
    dataset_id: Optional[str] = None,
) -> te.Self

Creates an instance of the class by processing and sanitizing provided records and optionally associating them with a specific dataset ID.

Parameters:

Name	Type	Description	Default
`records`	`Union[Iterable[Fields], Iterable[dict[str, Any]]]`	A collection of records to initialize the instance. Each record can either be an instance of `Fields` or a dictionary containing corresponding data.	required
`dataset_id`	`Optional[str]`	An optional identifier for associating the data with a specific dataset.	`None`

Returns:

Type	Description
`Self`	te.Self: A new instance of the class with the processed and sanitized data.

Source code in src/patronus/datasets/datasets.py

@classmethod
def from_records(
    cls,
    records: Union[typing.Iterable[Fields], typing.Iterable[dict[str, typing.Any]]],
    dataset_id: Optional[str] = None,
) -> te.Self:
    """
    Creates an instance of the class by processing and sanitizing provided records
    and optionally associating them with a specific dataset ID.

    Args:
        records:
            A collection of records to initialize the instance. Each record can either
            be an instance of `Fields` or a dictionary containing corresponding data.
        dataset_id:
            An optional identifier for associating the data with a specific dataset.

    Returns:
        te.Self: A new instance of the class with the processed and sanitized data.
    """
    df = pd.DataFrame.from_records(records)
    df = cls.__sanitize_df(df, dataset_id)
    return cls(df=df, dataset_id=dataset_id)

to_csv

to_csv(
    path_or_buf: Union[str, Path, IO[AnyStr]], **kwargs: Any
) -> Optional[str]

Saves dataset to a CSV file.

Parameters:

Name	Type	Description	Default
`path_or_buf`	`Union[str, Path, IO[AnyStr]]`	String path or file-like object where the CSV will be saved.	required
`**kwargs`	`Any`	Additional arguments passed to pandas.DataFrame.to_csv().	`{}`

Returns:

Type	Description
`Optional[str]`	String path if a path was specified and return_path is True, otherwise None.

Source code in src/patronus/datasets/datasets.py

def to_csv(
    self, path_or_buf: Union[str, pathlib.Path, typing.IO[typing.AnyStr]], **kwargs: typing.Any
) -> Optional[str]:
    """
    Saves dataset to a CSV file.

    Args:
        path_or_buf: String path or file-like object where the CSV will be saved.
        **kwargs: Additional arguments passed to pandas.DataFrame.to_csv().

    Returns:
        String path if a path was specified and return_path is True, otherwise None.
    """
    return self.df.to_csv(path_or_buf, **kwargs)

DatasetLoader

DatasetLoader(
    loader: Union[
        Awaitable[Dataset], Callable[[], Awaitable[Dataset]]
    ],
)

Encapsulates asynchronous loading of a dataset.

This class provides a mechanism to lazily load a dataset asynchronously only once, using a provided dataset loader function.

Source code in src/patronus/datasets/datasets.py

def __init__(self, loader: Union[typing.Awaitable[Dataset], typing.Callable[[], typing.Awaitable[Dataset]]]):
    self.__lock = asyncio.Lock()
    self.__loader = loader
    self.dataset: Optional[Dataset] = None

load `async`

load() -> Dataset

Load dataset. Repeated calls will return already loaded dataset.

Source code in src/patronus/datasets/datasets.py

async def load(self) -> Dataset:
    """
    Load dataset. Repeated calls will return already loaded dataset.
    """
    async with self.__lock:
        if self.dataset is not None:
            return self.dataset
        if inspect.iscoroutinefunction(self.__loader):
            self.dataset = await self.__loader()
        else:
            self.dataset = await self.__loader
        return self.dataset

read_csv

read_csv(
    filename_or_buffer: Union[str, Path, IO[AnyStr]],
    *,
    dataset_id: Optional[str] = None,
    sid_field: str = "sid",
    system_prompt_field: str = "system_prompt",
    task_input_field: str = "task_input",
    task_context_field: str = "task_context",
    task_attachments_field: str = "task_attachments",
    task_output_field: str = "task_output",
    gold_answer_field: str = "gold_answer",
    task_metadata_field: str = "task_metadata",
    tags_field: str = "tags",
    **kwargs: Any,
) -> Dataset

Reads a CSV file and converts it into a Dataset object. The CSV file is transformed into a structured dataset where each field maps to a specific aspect of the dataset schema provided via function arguments. You may specify custom field mappings as per your dataset structure, while additional keyword arguments are passed directly to the underlying 'pd.read_csv' function.

Parameters:

Name	Type	Description	Default
`filename_or_buffer`	`Union[str, Path, IO[AnyStr]]`	Path to the CSV file or a file-like object containing the dataset to be read.	required
`dataset_id`	`Optional[str]`	Optional identifier for the dataset being read. Default is None.	`None`
`sid_field`	`str`	Name of the column containing unique sample identifiers.	`'sid'`
`system_prompt_field`	`str`	Name of the column representing the system prompts.	`'system_prompt'`
`task_input_field`	`str`	Name of the column containing the main input for the task.	`'task_input'`
`task_context_field`	`str`	Name of the column describing the broader task context.	`'task_context'`
`task_attachments_field`	`str`	Name of the column with supplementary attachments related to the task.	`'task_attachments'`
`task_output_field`	`str`	Name of the column containing responses or outputs for the task.	`'task_output'`
`gold_answer_field`	`str`	Name of the column detailing the expected or correct answer to the task.	`'gold_answer'`
`task_metadata_field`	`str`	Name of the column storing metadata attributes associated with the task.	`'task_metadata'`
`tags_field`	`str`	Name of the column containing tags or annotations related to each sample.	`'tags'`
`**kwargs`	`Any`	Additional keyword arguments passed to 'pandas.read_csv' for fine-tuning the CSV parsing behavior, such as delimiters, encoding, etc.	`{}`

Returns:

Name	Type	Description
`Dataset`	`Dataset`	The parsed dataset object containing structured data from the input CSV file.

Source code in src/patronus/datasets/datasets.py

def read_csv(
    filename_or_buffer: Union[str, pathlib.Path, typing.IO[typing.AnyStr]],
    *,
    dataset_id: Optional[str] = None,
    sid_field: str = "sid",
    system_prompt_field: str = "system_prompt",
    task_input_field: str = "task_input",
    task_context_field: str = "task_context",
    task_attachments_field: str = "task_attachments",
    task_output_field: str = "task_output",
    gold_answer_field: str = "gold_answer",
    task_metadata_field: str = "task_metadata",
    tags_field: str = "tags",
    **kwargs: typing.Any,
) -> Dataset:
    """
    Reads a CSV file and converts it into a Dataset object. The CSV file is transformed
    into a structured dataset where each field maps to a specific aspect of the dataset
    schema provided via function arguments. You may specify custom field mappings as per
    your dataset structure, while additional keyword arguments are passed directly to the
    underlying 'pd.read_csv' function.

    Args:
        filename_or_buffer: Path to the CSV file or a file-like object containing the
            dataset to be read.
        dataset_id: Optional identifier for the dataset being read. Default is None.
        sid_field: Name of the column containing unique sample identifiers.
        system_prompt_field: Name of the column representing the system prompts.
        task_input_field: Name of the column containing the main input for the task.
        task_context_field: Name of the column describing the broader task context.
        task_attachments_field: Name of the column with supplementary attachments
            related to the task.
        task_output_field: Name of the column containing responses or outputs for the
            task.
        gold_answer_field: Name of the column detailing the expected or correct
            answer to the task.
        task_metadata_field: Name of the column storing metadata attributes
            associated with the task.
        tags_field: Name of the column containing tags or annotations related to each
            sample.
        **kwargs: Additional keyword arguments passed to 'pandas.read_csv' for fine-tuning
            the CSV parsing behavior, such as delimiters, encoding, etc.

    Returns:
        Dataset: The parsed dataset object containing structured data from the input
            CSV file.
    """
    return _read_dataframe(
        pd.read_csv,
        filename_or_buffer,
        dataset_id=dataset_id,
        sid_field=sid_field,
        system_prompt_field=system_prompt_field,
        task_context_field=task_context_field,
        task_attachments_field=task_attachments_field,
        task_input_field=task_input_field,
        task_output_field=task_output_field,
        gold_answer_field=gold_answer_field,
        task_metadata_field=task_metadata_field,
        tags_field=tags_field,
        **kwargs,
    )

read_jsonl

read_jsonl(
    filename_or_buffer: Union[str, Path, IO[AnyStr]],
    *,
    dataset_id: Optional[str] = None,
    sid_field: str = "sid",
    system_prompt_field: str = "system_prompt",
    task_input_field: str = "task_input",
    task_context_field: str = "task_context",
    task_attachments_field: str = "task_attachments",
    task_output_field: str = "task_output",
    gold_answer_field: str = "gold_answer",
    task_metadata_field: str = "task_metadata",
    tags_field: str = "tags",
    **kwargs: Any,
) -> Dataset

Reads a JSONL (JSON Lines) file and transforms it into a Dataset object. This function parses the input data file or buffer in JSON Lines format into a structured format, extracting specified fields and additional metadata for usage in downstream tasks. The field mappings and additional keyword arguments can be customized to accommodate application-specific requirements.

Parameters:

Name	Type	Description	Default
`filename_or_buffer`	`Union[str, Path, IO[AnyStr]]`	The path to the file or a file-like object containing the JSONL data to be read.	required
`dataset_id`	`Optional[str]`	An optional identifier for the dataset being read. Defaults to None.	`None`
`sid_field`	`str`	The field name in the JSON lines representing the unique identifier for a sample. Defaults to "sid".	`'sid'`
`system_prompt_field`	`str`	The field name for the system prompt in the JSON lines file. Defaults to "system_prompt".	`'system_prompt'`
`task_input_field`	`str`	The field name for the task input data in the JSON lines file. Defaults to "task_input".	`'task_input'`
`task_context_field`	`str`	The field name for the task context data in the JSON lines file. Defaults to "task_context".	`'task_context'`
`task_attachments_field`	`str`	The field name for any task attachments in the JSON lines file. Defaults to "task_attachments".	`'task_attachments'`
`task_output_field`	`str`	The field name for task output data in the JSON lines file. Defaults to "task_output".	`'task_output'`
`gold_answer_field`	`str`	The field name for the gold (ground truth) answer in the JSON lines file. Defaults to "gold_answer".	`'gold_answer'`
`task_metadata_field`	`str`	The field name for metadata associated with the task in the JSON lines file. Defaults to "task_metadata".	`'task_metadata'`
`tags_field`	`str`	The field name for tags in the parsed JSON lines file. Defaults to "tags".	`'tags'`
`**kwargs`	`Any`	Additional keyword arguments to be passed to `pd.read_json` for customization. The parameter "lines" will be forcibly set to True if not provided.	`{}`

Returns:

Name	Type	Description
`Dataset`	`Dataset`	A Dataset object containing the parsed and structured data.

Source code in src/patronus/datasets/datasets.py

def read_jsonl(
    filename_or_buffer: Union[str, pathlib.Path, typing.IO[typing.AnyStr]],
    *,
    dataset_id: Optional[str] = None,
    sid_field: str = "sid",
    system_prompt_field: str = "system_prompt",
    task_input_field: str = "task_input",
    task_context_field: str = "task_context",
    task_attachments_field: str = "task_attachments",
    task_output_field: str = "task_output",
    gold_answer_field: str = "gold_answer",
    task_metadata_field: str = "task_metadata",
    tags_field: str = "tags",
    **kwargs: typing.Any,
) -> Dataset:
    """
    Reads a JSONL (JSON Lines) file and transforms it into a Dataset object. This function
    parses the input data file or buffer in JSON Lines format into a structured format,
    extracting specified fields and additional metadata for usage in downstream tasks. The
    field mappings and additional keyword arguments can be customized to accommodate
    application-specific requirements.

    Args:
        filename_or_buffer: The path to the file or a file-like object containing the JSONL
            data to be read.
        dataset_id: An optional identifier for the dataset being read. Defaults to None.
        sid_field: The field name in the JSON lines representing the unique identifier for
            a sample. Defaults to "sid".
        system_prompt_field: The field name for the system prompt in the JSON lines file.
            Defaults to "system_prompt".
        task_input_field: The field name for the task input data in the JSON lines file.
            Defaults to "task_input".
        task_context_field: The field name for the task context data in the JSON lines file.
            Defaults to "task_context".
        task_attachments_field: The field name for any task attachments in the JSON lines
            file. Defaults to "task_attachments".
        task_output_field: The field name for task output data in the JSON lines file.
            Defaults to "task_output".
        gold_answer_field: The field name for the gold (ground truth) answer in the JSON
            lines file. Defaults to "gold_answer".
        task_metadata_field: The field name for metadata associated with the task in the
            JSON lines file. Defaults to "task_metadata".
        tags_field: The field name for tags in the parsed JSON lines file. Defaults to
            "tags".
        **kwargs: Additional keyword arguments to be passed to `pd.read_json` for
            customization. The parameter "lines" will be forcibly set to True if not
            provided.

    Returns:
        Dataset: A Dataset object containing the parsed and structured data.

    """
    kwargs.setdefault("lines", True)
    return _read_dataframe(
        pd.read_json,
        filename_or_buffer,
        dataset_id=dataset_id,
        sid_field=sid_field,
        system_prompt_field=system_prompt_field,
        task_context_field=task_context_field,
        task_attachments_field=task_attachments_field,
        task_input_field=task_input_field,
        task_output_field=task_output_field,
        gold_answer_field=gold_answer_field,
        task_metadata_field=task_metadata_field,
        tags_field=tags_field,
        **kwargs,
    )

remote

DatasetNotFoundError

Bases: Exception

Raised when a dataset with the specified ID or name is not found

RemoteDatasetLoader

RemoteDatasetLoader(
    by_name: Optional[str] = None,
    *,
    by_id: Optional[str] = None,
)

Bases: DatasetLoader

A loader for datasets stored remotely on the Patronus platform.

This class provides functionality to asynchronously load a dataset from the remote API by its name or identifier, handling the fetch operation lazily and ensuring it's only performed once. You can specify either the dataset name or ID, but not both.

Parameters:

Name	Type	Description	Default
`by_name`	`Optional[str]`	The name of the dataset to load.	`None`
`by_id`	`Optional[str]`	The ID of the dataset to load.	`None`

Source code in src/patronus/datasets/remote.py

def __init__(self, by_name: Optional[str] = None, *, by_id: Optional[str] = None):
    """
    Initializes a new RemoteDatasetLoader instance.

    Args:
        by_name: The name of the dataset to load.
        by_id: The ID of the dataset to load.
    """
    if not (bool(by_name) ^ bool(by_id)):
        raise ValueError("Either by_name or by_id must be provided, but not both.")

    self._dataset_name = by_name
    self._dataset_id = by_id
    super().__init__(self._load)

Datasets

patronus.datasets

datasets

Attachment

Fields

Row dataclass

Dataset dataclass

from_records classmethod

to_csv

DatasetLoader

load async

read_csv

read_jsonl

remote

DatasetNotFoundError

RemoteDatasetLoader

Row `dataclass`

Dataset `dataclass`

from_records `classmethod`

load `async`