User-Defined Evaluators

Evaluators are the core building blocks of Patronus's evaluation system. This page covers how to create and use your own custom evaluators to assess LLM outputs according to your specific criteria.

Creating Basic Evaluators

The simplest way to create an evaluator is with the @evaluator() decorator:

from patronus import evaluator

@evaluator()
def keyword_match(text: str, keywords: list[str]) -> float:
    """
    Evaluates whether the text contains the specified keywords.
    Returns a score between 0.0 and 1.0 based on the percentage of matched keywords.
    """
    matches = sum(keyword.lower() in text.lower() for keyword in keywords)
    return matches / len(keywords) if keywords else 0.0

This decorator automatically:

Integrates with the Patronus tracing
Exports evaluation results to the Patronus Platform

Flexible Input and Output

User-defined evaluators can accept any parameters and return several types of results:

# Boolean evaluator (pass/fail)
@evaluator()
def contains_answer(text: str, answer: str) -> bool:
    return answer.lower() in text.lower()


# Numeric evaluator (score)
@evaluator()
def semantic_similarity(text1: str, text2: str) -> float:
    # Simple example - in practice use proper semantic similarity
    words1, words2 = set(text1.lower().split()), set(text2.lower().split())
    intersection = words1.intersection(words2)
    union = words1.union(words2)
    return len(intersection) / len(union) if union else 0.0


# String evaluator
@evaluator()
def tone_classifier(text: str) -> str:
    positive = ['good', 'excellent', 'great', 'helpful']
    negative = ['bad', 'poor', 'unhelpful', 'wrong']

    pos_count = sum(word in text.lower() for word in positive)
    neg_count = sum(word in text.lower() for word in negative)

    if pos_count > neg_count:
        return "positive"
    elif neg_count > pos_count:
        return "negative"
    else:
        return "neutral"

Return Types

Evaluators can return different types which are automatically converted to EvaluationResult objects:

Boolean: True/False indicating pass/fail
Float/Integer: Numerical scores (typically between 0-1)
String: Text output categorizing the result
EvaluationResult: Complete evaluation with scores, explanations, etc.

Using EvaluationResult

For more detailed evaluations, return an EvaluationResult object:

from patronus import evaluator
from patronus.evals import EvaluationResult

@evaluator()
def comprehensive_evaluation(response: str, reference: str) -> EvaluationResult:
    # Example implementation - replace with actual logic
    has_keywords = all(word in response.lower() for word in ["important", "key", "concept"])
    accuracy = 0.85  # Calculated accuracy score

    return EvaluationResult(
        score=accuracy,  # Numeric score (typically 0-1)
        pass_=accuracy >= 0.7,  # Boolean pass/fail
        text_output="Satisfactory" if accuracy >= 0.7 else "Needs improvement",  # Category
        explanation=f"Response {'contains' if has_keywords else 'is missing'} key terms. Accuracy: {accuracy:.2f}",
        metadata={  # Additional structured data
            "has_required_keywords": has_keywords,
            "response_length": len(response),
            "accuracy": accuracy
        }
    )

The EvaluationResult object can include:

score: Numerical assessment (typically 0-1)
pass_: Boolean pass/fail status
text_output: Categorical or textual result
explanation: Human-readable explanation of the result
metadata: Additional structured data for analysis
tags: Key-value pairs for filtering and organization

Using Evaluators

Once defined, evaluators can be used directly:

# Use evaluators as normal function
result = keyword_match("The capital of France is Paris", ["capital", "France", "Paris"])
print(f"Score: {result}")  # Output: Score: 1.0


# Using class-based evaluator
safety_check = ContentSafetyEvaluator()
result = safety_check.evaluate(
    task_output="This is a helpful and safe response."
)
print(f"Safety check passed: {result.pass_}")  # Output: Safety check passed: True