Evaluations
Add automated quality assessment to your AI applications using LLM-based evaluation
LLM Judge
Learn how to add automated quality assessment to your AI applications using LLM-based evaluation. The judge works in two phases:
- Define your evaluation structure and criteria
- Use the judge to assess AI responses
Install Dependencies
Define Your Evaluation Structure
Create table.py
:
Use Your Judge
Create app.py
:
Key Features
Structured Evaluation
Define specific criteria for each prompt to ensure consistent evaluation standards
Numerical Scoring
Get quantitative scores (1-10) along with qualitative feedback
Detailed Feedback
Receive detailed explanations for each evaluation score
Persistent Storage
Automatically store all evaluations for analysis and tracking
Customization Options
Evaluation Criteria
Evaluation Criteria
Customize the evaluation criteria based on your needs:
Scoring System
Scoring System
Modify the scoring system by updating the judge prompt template:
Model Selection
Model Selection
Choose different models for response generation and evaluation:
Best Practices
- Clear Criteria: Define specific, measurable criteria for each prompt
- Consistent Scale: Use a consistent scoring scale across all evaluations
- Detailed Feedback: Request specific explanations for scores
- Regular Monitoring: Track scores over time to identify patterns
- Iterative Improvement: Use feedback to refine prompts and criteria