Evaluation of Large Language Models
Learn how to evaluate large language models for accuracy, safety, alignment, and performance using human and automated metrics to ensure reliable, ethical, and high-quality AI systems.
Overview
Rationale for LLM and Agent Evaluation
Components of LLM Evaluation
Tasks and Benchmark Datasets for Evaluation
Challenges in LLM Evaluation
Quiz: LLM Evaluation Fundamentals
Classic and Contextual Embedding Approaches
BLUE, ROUGE and BERT Score
Evaluating RAG-Based Applications
Faithfulness
Answer Relevancy
Context Precision
Context Recall
Evaluation of RAG Applications Using RAGAS
Evaluating a RAG Application Using RAGAs
LLM-As-A-Judge Evaluation
Classic Evaluation Metrics
Semantic Similarity With BERTScore
Slide Deck
OpenAI API Key Setup