Overview
Braintrust offers three types of scorers:- Autoevals - Pre-built, battle-tested scorers for common evaluation tasks like factuality checking, semantic similarity, and format validation. Best for standard evaluation needs where reliable scorers already exist.
- LLM-as-a-judge - Use language models to evaluate outputs based on natural language criteria and instructions. Best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code.
- Custom code - Write custom evaluation logic in TypeScript or Python with full control over the scoring algorithm. Best for specific business rules, pattern matching, or calculations unique to your use case.
- Inline in SDK code - Define scorers directly in your evaluation scripts for local development, access to complex dependencies, or application-specific logic that’s tightly coupled to your codebase.
- Pushed via CLI - Define scorers in code files and push them to Braintrust for version control in Git, team-wide sharing across projects, and automatic evaluation of production logs.
- Created in UI - Build scorers in the Braintrust web interface for non-technical users to create evaluations, rapid prototyping of scoring ideas, and simple LLM-as-a-judge scorers.
Score with autoevals
Theautoevals library provides pre-built, battle-tested scorers for common evaluation tasks like factuality checking, semantic similarity, and format validation. Autoevals are open-source, deterministic (where possible), and optimized for speed and reliability. They can evaluate individual spans, but not entire traces.
Available scorers include:
- Factuality: Check if output contains factual information
- Semantic: Measure semantic similarity to expected output
- Levenshtein: Calculate edit distance from expected output
- JSON: Validate JSON structure and content
- SQL: Validate SQL query syntax and semantics
- SDK
- UI
Use scorers inline in your evaluation code:Autoevals automatically receive these parameters when used in evaluations:
input: The input to your taskoutput: The output from your taskexpected: The expected output (optional)metadata: Custom metadata from the test case
Score with LLMs
LLM-as-a-judge scorers use a language model to evaluate based on natural language criteria. They are best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in code. They can evaluate individual spans, but not entire traces. Your prompt template can reference these variables:{{input}}: The input to your task{{output}}: The output from your task{{expected}}: The expected output (optional){{metadata}}: Custom metadata from the test case
- SDK
- CLI
- UI
Use scorers inline in your evaluation code:
llm_scorer.eval.ts
Score with custom code
Write custom evaluation logic in TypeScript or Python. Custom code scorers give you full control over the evaluation logic and can use any packages you need. They are best when you have specific rules, patterns, or calculations to implement. Custom code scorers can evaluate individual spans or entire traces.Score spans
Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score.- SDK
- UI
Define scorers in code and push to Braintrust.Your handler function receives these parameters:Push to Braintrust:
input: The input to your taskoutput: The output from your taskexpected: The expected output (optional)metadata: Custom metadata from the test case
score and optional metadata.code_scorer.ts
Important notes for Python scorers:
- Scorers must be pushed from within their directory (e.g.,
braintrust push scorer.py); pushing with relative paths (e.g.,braintrust push path/to/scorer.py) is unsupported and will cause import errors. - Scorers using local imports must be defined at the project root.
- Braintrust uses uv to cross-bundle dependencies to Linux. This works for binary dependencies except libraries requiring on-demand compilation.
TypeScript bundling
TypeScript bundling
In TypeScript, Braintrust uses
esbuild to bundle your code and dependencies. This works for most dependencies but does not support native (compiled) libraries like SQLite.If you have trouble bundling dependencies, file an issue in the braintrust-sdk repo.Python external dependencies
Python external dependencies
Python scorers created via the CLI have these default packages:Create requirements file:Push with requirements:
autoevalsbraintrustopenaipydanticrequests
--requirements flag.For scorers with external dependencies:scorer-with-deps.py
Score traces
Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace. Your handler function receives thetrace parameter, which provides two methods for accessing execution data:
-
trace.getThread()/trace.get_thread(): Returns an array of conversation messages extracted from LLM spans. Use for evaluating conversation quality and multi-turn interactions. -
trace.getSpans({ spanType: ["llm"] })/trace.get_spans(span_type=["llm"]): Returns spans matching the filter. Each span includesinput,output,metadata,span_id, andspan_attributes. Omit the filter to get all spans, or pass multiple types like["llm", "tool"].
- SDK
- CLI
Use scorers inline in your evaluation code:
trace_code_scorer.eval.ts
Set pass thresholds
Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as passing (green highlighting with checkmark), while scores below are marked as failing (red highlighting).- SDK
- UI
Add Example with a custom code scorer:
__pass_threshold to the scorer’s metadata (value between 0 and 1):Create reusable scorers
Test scorers
Scorers need to be developed iteratively against real data. When creating or editing a scorer in the UI, use the Run section to test your scorer with data from different sources. Each variable source populates the scorer’s input parameters (likeinput, output, expected, metadata) from a different location.
Test with manual input
Best for initial development when you have a specific example in mind. Use this to quickly prototype and verify basic scorer logic before testing on larger datasets.- Select Editor in the Run section.
- Enter values for
input,output,expected, andmetadatafields. - Click Test to see how your scorer evaluates the example
- Iterate on your scorer logic based on the results
Test with a dataset
Best for testing specific scenarios, edge cases, or regression testing. Use this when you want controlled, repeatable test cases or need to ensure your scorer handles specific situations correctly.- Select Dataset in the Run section.
- Choose a dataset from your project.
- Select a record to test with.
- Click Test to see how your scorer evaluates the example.
- Review results to identify patterns and edge cases.
Test with logs
Best for testing against actual usage patterns and debugging real-world edge cases. Use this when you want to see how your scorer performs on data your system is actually generating.- Select Logs in the Run section.
- Select the project containing the logs you want to test against.
- Filter logs to find relevant examples:
- Click Add filter and choose just root spans, specific span names, or a more advanced filter based on specific input, output, metadata, or other values.
- Select a timeframe.
- Click Test to see how your scorer evaluates real production data.
- Identify cases where the scorer needs adjustment for real-world scenarios.
Scorer permissions
Both LLM-as-a-judge scorers and custom code scorers automatically receive aBRAINTRUST_API_KEY environment variable that allows them to:
- Make LLM calls using organization and project AI secrets
- Access attachments from the current project
- Read and write logs to the current project
- Read prompts from the organization
PUT /v1/env_var endpoint.
Optimize with Loop
Generate and improve scorers using Loop: Example queries:- “Write an LLM-as-a-judge scorer for a chatbot that answers product questions”
- “Generate a code-based scorer based on project logs”
- “Optimize the Helpfulness scorer”
- “Adjust the scorer to be more lenient”
Best practices
Start with autoevals: Use pre-built scorers when they fit your needs. They’re well-tested and reliable. Be specific: Define clear evaluation criteria in your scorer prompts or code. Use multiple scorers: Measure different aspects (factuality, helpfulness, tone) with separate scorers. Choose the right scope: Use trace scorers (custom code withtrace parameter) for multi-step workflows and agents. Use output scorers for simple quality checks.
Test scorers: Run scorers on known examples to verify they behave as expected.
Version scorers: Like prompts, scorers are versioned automatically. Track what works.
Balance cost and quality: LLM-as-a-judge scorers are more flexible but cost more and take longer than custom code scorers.
Next steps
- Run evaluations using your scorers
- Interpret results to understand scores
- Write prompts to guide model behavior
- Use playgrounds to test scorers interactively