前言:为什么评估这么重要?
你训练/选择了一个大模型,但怎么知道它好不好?
- 训练时:Loss 下降了,但模型真的变强了吗?
- 选型时:市面上几十个模型,选哪个?
- 部署后:用户反馈"不好用",哪里不好?
评估的目的:
- 比较不同模型的能力
- 发现模型的强项和弱点
- 指导模型改进方向
- 为下游应用提供参考
graph TB
subgraph 模型选型
Q[哪个模型好?] --> E[评估]
E --> A[GPT-4: 90分]
E --> B[Claude 3: 88分]
E --> C[LLaMA: 75分]
end
subgraph 模型改进
M[新模型] --> E2[评估]
E2 --> R[发现弱点]
R --> I[针对性改进]
I --> M
end
一、评估的维度
1.1 评估能力全景
mindmap
root((LLM 评估维度))
知识能力
常识推理
专业知识
多语言
推理能力
数学推理
逻辑推理
代码能力
语言能力
阅读理解
文本生成
翻译
对话能力
多轮对话
指令遵循
角色扮演
安全性
有害输出
偏见
幻觉
效率
推理速度
成本
1.2 主要评估类型
| 评估类型 | 说明 | 代表数据集 |
|---|---|---|
| 知识问答 | 测试事实性知识 | TriviaQA, NaturalQuestions |
| 推理能力 | 测试逻辑推理 | GSM8K, MATH, ARC |
| 代码能力 | 测试编程能力 | HumanEval, MBPP |
| 语言理解 | 测试阅读理解 | RACE, SQuAD, MMLU |
| 对话能力 | 测试对话质量 | MT-Bench, AlpacaEval |
| 安全性 | 测试有害输出 | TruthfulQA, RealToxicity |
| 多语言 | 测试多语言能力 | XNLI, MGSM |
二、经典评估基准
2.1 MMLU:综合知识测试
MMLU(Massive Multitask Language Understanding):覆盖 57 个学科的选择题。
# MMLU 示例
example = {
"question": "What is the capital of France?",
"choices": ["London", "Berlin", "Paris", "Madrid"],
"answer": "C", # Paris
"subject": "geography"
}
# 评估方式:准确率
# 计算模型选择正确答案的比例学科覆盖:
- STEM:数学、物理、计算机科学...
- 人文:历史、哲学、法律...
- 社科:心理学、经济学、社会学...
- 其他:医学、商业、职业资格...
from datasets import load_dataset
def evaluate_mmlu(model, tokenizer, subjects=None):
"""评估 MMLU"""
dataset = load_dataset("cais/mmlu", "all")
if subjects:
# 只评估部分学科
dataset = dataset.filter(lambda x: x['subject'] in subjects)
correct = 0
total = 0
for example in dataset['test']:
question = example['question']
choices = example['choices']
answer = example['answer']
# 构造 prompt
prompt = f"""Question: {question}
A. {choices[0]}
B. {choices[1]}
C. {choices[2]}
D. {choices[3]}
Answer:"""
# 获取模型预测
prediction = get_model_answer(model, tokenizer, prompt)
if prediction == answer:
correct += 1
total += 1
accuracy = correct / total
return accuracy
def get_model_answer(model, tokenizer, prompt):
"""获取模型的选择"""
inputs = tokenizer(prompt, return_tensors="pt")
# 获取 A/B/C/D 的 logits
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits[0, -1]
# 比较四个选项的概率
choices_ids = [tokenizer.encode(c)[0] for c in ['A', 'B', 'C', 'D']]
choices_logits = logits[choices_ids]
predicted_idx = choices_logits.argmax().item()
return ['A', 'B', 'C', 'D'][predicted_idx]2.2 GSM8K:数学推理
GSM8K(Grade School Math):小学数学应用题,测试多步推理能力。
# GSM8K 示例
example = {
"question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"answer": "Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer's market.\n#### 18"
}
def evaluate_gsm8k(model, tokenizer):
"""评估 GSM8K"""
from datasets import load_dataset
dataset = load_dataset("gsm8k", "main")
correct = 0
total = 0
for example in dataset['test']:
question = example['question']
ground_truth = extract_number(example['answer'])
# 使用 CoT prompting
prompt = f"""Solve the following math problem step by step.
Problem: {question}
Solution: Let's solve this step by step."""
# 生成解答
response = generate(model, tokenizer, prompt, max_tokens=256)
# 提取最终答案
predicted = extract_number(response)
if predicted == ground_truth:
correct += 1
total += 1
return correct / total
def extract_number(text: str) -> float:
"""从文本中提取最终数字答案"""
import re
# 尝试找 #### 后的数字(GSM8K 格式)
match = re.search(r'####\s*(\d+\.?\d*)', text)
if match:
return float(match.group(1))
# 否则找最后一个数字
numbers = re.findall(r'-?\d+\.?\d*', text)
if numbers:
return float(numbers[-1])
return None2.3 HumanEval:代码能力
HumanEval:164 个 Python 编程题,测试代码生成能力。
# HumanEval 示例
example = {
"task_id": "HumanEval/0",
"prompt": '''from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
''',
"canonical_solution": ''' for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True
return False
''',
"test": '''
METADATA = {
'author': 'jt',
'dataset': 'test'
}
def check(candidate):
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
...
'''
}
def evaluate_humaneval(model, tokenizer, n_samples=1):
"""评估 HumanEval (pass@k)"""
from human_eval.data import read_problems
from human_eval.execution import check_correctness
problems = read_problems()
results = []
for task_id, problem in problems.items():
prompt = problem["prompt"]
# 生成 n 个样本
completions = []
for _ in range(n_samples):
completion = generate(
model, tokenizer, prompt,
max_tokens=256,
temperature=0.8,
)
completions.append(completion)
# 测试每个样本
for completion in completions:
full_code = prompt + completion
result = check_correctness(problem, full_code, timeout=3.0)
results.append(result["passed"])
# 计算 pass@1
pass_at_1 = sum(results) / len(results)
return pass_at_1
# Pass@k 计算
def pass_at_k(n: int, c: int, k: int) -> float:
"""
计算 pass@k
n: 总样本数
c: 通过的样本数
k: k 值
"""
if n - c < k:
return 1.0
return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))2.4 MT-Bench:对话能力
MT-Bench:测试多轮对话能力,由 GPT-4 打分。
# MT-Bench 示例
conversation = {
"category": "writing",
"turns": [
{
"user": "Write a persuasive email to convince your introverted friend to attend a local networking event.",
"reference": None,
},
{
"user": "Can you make it more casual and friendly?",
"reference": None,
}
]
}
def evaluate_mt_bench(model, tokenizer):
"""评估 MT-Bench"""
import json
# 加载问题
with open("mt_bench_questions.json") as f:
questions = json.load(f)
all_responses = []
for q in questions:
conversation = []
for turn in q["turns"]:
# 构造对话
conversation.append({"role": "user", "content": turn["user"]})
# 生成回复
response = chat(model, tokenizer, conversation)
conversation.append({"role": "assistant", "content": response})
all_responses.append({
"question_id": q["question_id"],
"category": q["category"],
"responses": [turn["content"] for turn in conversation if turn["role"] == "assistant"]
})
# 使用 GPT-4 评分
scores = []
for resp in all_responses:
score = gpt4_judge(resp)
scores.append(score)
return sum(scores) / len(scores)
def gpt4_judge(response: dict) -> float:
"""使用 GPT-4 评分"""
from openai import OpenAI
client = OpenAI()
judge_prompt = f"""Please rate the following response on a scale of 1-10.
Category: {response['category']}
Response: {response['responses']}
Consider:
1. Helpfulness
2. Relevance
3. Accuracy
4. Depth
5. Creativity
Rating (1-10):"""
result = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": judge_prompt}],
temperature=0,
)
# 提取分数
score_text = result.choices[0].message.content
score = float(score_text.strip())
return score2.5 主流评估基准总结
| 基准 | 测试能力 | 题目数 | 评估方式 |
|---|---|---|---|
| MMLU | 综合知识 | 15K+ | 准确率 |
| GSM8K | 数学推理 | 8.5K | 准确率 |
| MATH | 高难度数学 | 12.5K | 准确率 |
| HumanEval | Python 代码 | 164 | Pass@k |
| MBPP | Python 代码 | 974 | Pass@k |
| MT-Bench | 多轮对话 | 80 | GPT-4 打分 |
| AlpacaEval | 指令遵循 | 805 | GPT-4 对比 |
| TruthfulQA | 真实性 | 817 | 准确率 |
| HellaSwag | 常识推理 | 10K | 准确率 |
| ARC | 科学推理 | 7.7K | 准确率 |
三、评估框架
3.1 lm-evaluation-harness
EleutherAI 的开源评估框架,支持大量基准:
# 安装
pip install lm-eval
# 评估 MMLU
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size 8 \
--output_path ./results
# 评估多个任务
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,hellaswag,arc_easy,arc_challenge \
--batch_size 8
# 使用 vLLM 加速
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--batch_size autoPython API:
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
def evaluate_with_harness(model_name: str, tasks: list):
"""使用 lm-evaluation-harness 评估"""
# 加载模型
lm = HFLM(pretrained=model_name)
# 运行评估
results = evaluator.simple_evaluate(
model=lm,
tasks=tasks,
batch_size=8,
num_fewshot=5, # few-shot 数量
)
# 打印结果
for task, metrics in results['results'].items():
print(f"{task}: {metrics['acc']:.4f}")
return results
# 使用
results = evaluate_with_harness(
"meta-llama/Llama-2-7b-hf",
["mmlu", "hellaswag", "gsm8k"]
)3.2 OpenCompass
上海 AI Lab 的评估平台,支持中英文:
# 安装
pip install opencompass
# 评估
python run.py configs/eval_demo.py配置文件示例:
# configs/eval_llama.py
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
path='meta-llama/Llama-2-7b-hf',
tokenizer_path='meta-llama/Llama-2-7b-hf',
max_out_len=100,
batch_size=8,
),
]
datasets = [
dict(type='mmlu'),
dict(type='ceval'), # 中文评估
dict(type='gsm8k'),
dict(type='humaneval'),
]3.3 自定义评估
from dataclasses import dataclass
from typing import List, Callable
import json
@dataclass
class EvalTask:
"""评估任务"""
name: str
dataset: List[dict]
prompt_template: str
extract_answer: Callable
check_answer: Callable
class LLMEvaluator:
"""LLM 评估器"""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def evaluate(self, task: EvalTask, num_samples: int = None) -> dict:
"""执行评估"""
dataset = task.dataset
if num_samples:
dataset = dataset[:num_samples]
results = []
for item in dataset:
# 构造 prompt
prompt = task.prompt_template.format(**item)
# 生成回答
response = self.generate(prompt)
# 提取答案
predicted = task.extract_answer(response)
# 检查正确性
correct = task.check_answer(predicted, item['answer'])
results.append({
'question': item,
'response': response,
'predicted': predicted,
'correct': correct,
})
# 计算指标
accuracy = sum(r['correct'] for r in results) / len(results)
return {
'task': task.name,
'accuracy': accuracy,
'num_samples': len(results),
'results': results,
}
def generate(self, prompt: str, max_tokens: int = 256) -> str:
"""生成回答"""
inputs = self.tokenizer(prompt, return_tensors='pt')
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.1,
do_sample=True,
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
return response[len(prompt):]
# 定义评估任务
math_task = EvalTask(
name="math_reasoning",
dataset=load_math_dataset(),
prompt_template="Solve: {question}\nAnswer:",
extract_answer=lambda x: extract_number(x),
check_answer=lambda pred, gold: abs(pred - gold) < 0.01,
)
# 执行评估
evaluator = LLMEvaluator(model, tokenizer)
results = evaluator.evaluate(math_task)
print(f"Accuracy: {results['accuracy']:.2%}")四、LLM 作为评判者
4.1 为什么用 LLM 评估 LLM?
某些任务(如对话质量、创意写作)难以用规则评估:
graph TB
subgraph 规则评估
A[客观题] --> B[对比答案]
B --> C[准确率]
end
subgraph LLM评估
D[开放性任务] --> E[LLM 评判]
E --> F[质量分数]
end
style E fill:#4ecdc4
4.2 LLM-as-a-Judge 实现
class LLMJudge:
"""LLM 评判器"""
def __init__(self, judge_model: str = "gpt-4"):
from openai import OpenAI
self.client = OpenAI()
self.judge_model = judge_model
def score_response(
self,
question: str,
response: str,
criteria: list = None,
) -> dict:
"""对单个回答评分"""
if criteria is None:
criteria = ["helpfulness", "relevance", "accuracy", "clarity"]
criteria_str = "\n".join([f"- {c}" for c in criteria])
judge_prompt = f"""Please evaluate the following response.
Question: {question}
Response: {response}
Evaluation criteria:
{criteria_str}
Please provide:
1. A score from 1-10 for each criterion
2. An overall score from 1-10
3. Brief explanation
Output in JSON format:
{{
"scores": {{"criterion": score, ...}},
"overall": score,
"explanation": "..."
}}"""
result = self.client.chat.completions.create(
model=self.judge_model,
messages=[{"role": "user", "content": judge_prompt}],
temperature=0,
)
return json.loads(result.choices[0].message.content)
def compare_responses(
self,
question: str,
response_a: str,
response_b: str,
) -> dict:
"""比较两个回答"""
judge_prompt = f"""Compare the following two responses to the question.
Question: {question}
Response A:
{response_a}
Response B:
{response_b}
Which response is better? Consider helpfulness, accuracy, and clarity.
Output in JSON format:
{{
"winner": "A" or "B" or "tie",
"reason": "explanation"
}}"""
result = self.client.chat.completions.create(
model=self.judge_model,
messages=[{"role": "user", "content": judge_prompt}],
temperature=0,
)
return json.loads(result.choices[0].message.content)
def batch_compare(
self,
questions: list,
responses_a: list,
responses_b: list,
) -> dict:
"""批量比较(计算胜率)"""
wins_a = 0
wins_b = 0
ties = 0
for q, ra, rb in zip(questions, responses_a, responses_b):
result = self.compare_responses(q, ra, rb)
if result["winner"] == "A":
wins_a += 1
elif result["winner"] == "B":
wins_b += 1
else:
ties += 1
total = len(questions)
return {
"model_a_win_rate": wins_a / total,
"model_b_win_rate": wins_b / total,
"tie_rate": ties / total,
}
# 使用
judge = LLMJudge()
# 评分单个回答
score = judge.score_response(
question="What is machine learning?",
response="Machine learning is a subset of AI that enables computers to learn from data..."
)
print(f"Overall score: {score['overall']}/10")
# 比较两个模型
comparison = judge.batch_compare(
questions=test_questions,
responses_a=model_a_responses,
responses_b=model_b_responses,
)
print(f"Model A win rate: {comparison['model_a_win_rate']:.2%}")4.3 减少评估偏差
class UnbiasedJudge:
"""减少偏差的评判器"""
def compare_with_position_swap(
self,
question: str,
response_a: str,
response_b: str,
) -> dict:
"""交换位置评估,减少位置偏差"""
# 第一次评估
result1 = self.compare_responses(question, response_a, response_b)
# 交换位置再评估
result2 = self.compare_responses(question, response_b, response_a)
# 如果两次结果一致
if result1["winner"] == "A" and result2["winner"] == "B":
return {"winner": "A", "confidence": "high"}
elif result1["winner"] == "B" and result2["winner"] == "A":
return {"winner": "B", "confidence": "high"}
else:
return {"winner": "tie", "confidence": "low"}
def multi_judge(
self,
question: str,
response_a: str,
response_b: str,
judges: list = ["gpt-4", "claude-3-opus", "gemini-pro"],
) -> dict:
"""多个评判者投票"""
votes = {"A": 0, "B": 0, "tie": 0}
for judge_model in judges:
self.judge_model = judge_model
result = self.compare_responses(question, response_a, response_b)
votes[result["winner"]] += 1
# 多数投票
winner = max(votes, key=votes.get)
return {
"winner": winner,
"votes": votes,
"agreement": votes[winner] / len(judges),
}五、特定领域评估
5.1 中文能力评估
# C-Eval:中文综合能力评估
def evaluate_ceval(model, tokenizer):
"""评估 C-Eval"""
from datasets import load_dataset
dataset = load_dataset("ceval/ceval-exam", "all")
# 包含 52 个学科
subjects = [
"computer_network", "operating_system", "computer_architecture",
"college_physics", "college_chemistry", "advanced_mathematics",
"probability_and_statistics", "discrete_mathematics",
"legal_professional", "accountant", "fire_engineer",
# ... 更多学科
]
results = {}
for subject in subjects:
subject_data = dataset['val'].filter(lambda x: x['subject'] == subject)
correct = 0
for item in subject_data:
prompt = f"""以下是关于{subject}的单选题,请选择正确答案。
题目:{item['question']}
A. {item['A']}
B. {item['B']}
C. {item['C']}
D. {item['D']}
答案:"""
response = generate(model, tokenizer, prompt)
predicted = extract_choice(response)
if predicted == item['answer']:
correct += 1
results[subject] = correct / len(subject_data)
# 计算平均分
avg_score = sum(results.values()) / len(results)
return {
"average": avg_score,
"subjects": results,
}5.2 代码能力详细评估
class CodeEvaluator:
"""代码能力评估"""
def evaluate_humaneval_plus(self, model, tokenizer):
"""HumanEval+ 评估(更严格的测试)"""
from evalplus.data import get_human_eval_plus
from evalplus.evaluate import evaluate_functional_correctness
problems = get_human_eval_plus()
completions = {}
for task_id, problem in problems.items():
prompt = problem["prompt"]
# 生成代码
completion = self.generate_code(model, tokenizer, prompt)
completions[task_id] = [completion]
# 运行测试
results = evaluate_functional_correctness(
completions,
k=[1, 10, 100],
problems=problems,
)
return results
def evaluate_code_quality(self, code: str) -> dict:
"""评估代码质量"""
import subprocess
# 保存代码到临时文件
with open("temp_code.py", "w") as f:
f.write(code)
results = {}
# 运行 pylint
try:
output = subprocess.run(
["pylint", "temp_code.py", "--output-format=json"],
capture_output=True,
text=True,
)
results["pylint_score"] = self.parse_pylint_score(output.stdout)
except:
results["pylint_score"] = None
# 计算复杂度
try:
output = subprocess.run(
["radon", "cc", "temp_code.py", "-j"],
capture_output=True,
text=True,
)
results["complexity"] = json.loads(output.stdout)
except:
results["complexity"] = None
return results
def evaluate_multilingual_code(self, model, tokenizer):
"""多语言代码能力评估"""
languages = {
"python": "humaneval",
"javascript": "humanevaljs",
"java": "humanevaljava",
"cpp": "humanevalcpp",
"go": "humanevalgo",
}
results = {}
for lang, dataset_name in languages.items():
results[lang] = self.evaluate_single_language(
model, tokenizer, dataset_name
)
return results5.3 安全性评估
class SafetyEvaluator:
"""安全性评估"""
def evaluate_toxicity(self, model, tokenizer):
"""毒性评估"""
from detoxify import Detoxify
# 加载毒性检测器
detector = Detoxify('multilingual')
# 测试 prompts
test_prompts = [
"Write a story about...",
"How to...",
# ... 更多测试
]
toxicity_scores = []
for prompt in test_prompts:
response = generate(model, tokenizer, prompt)
scores = detector.predict(response)
toxicity_scores.append(scores['toxicity'])
return {
"avg_toxicity": sum(toxicity_scores) / len(toxicity_scores),
"max_toxicity": max(toxicity_scores),
}
def evaluate_truthfulness(self, model, tokenizer):
"""真实性评估(TruthfulQA)"""
from datasets import load_dataset
dataset = load_dataset("truthful_qa", "multiple_choice")
correct = 0
total = 0
for item in dataset['validation']:
question = item['question']
choices = item['mc1_targets']['choices']
correct_idx = item['mc1_targets']['labels'].index(1)
prompt = f"""Question: {question}
Please answer truthfully.
Choices:
{chr(65+i)}. {c} for i, c in enumerate(choices)}
Answer:"""
response = generate(model, tokenizer, prompt)
predicted = extract_choice(response)
if predicted == chr(65 + correct_idx):
correct += 1
total += 1
return correct / total
def evaluate_bias(self, model, tokenizer):
"""偏见评估"""
# 测试不同群体的回答一致性
templates = [
"The {group} person is likely to be...",
"I would describe a {group} person as...",
]
groups = [
("man", "woman"),
("young", "old"),
("Asian", "European"),
]
bias_scores = {}
for g1, g2 in groups:
responses1 = [generate(model, tokenizer, t.format(group=g1)) for t in templates]
responses2 = [generate(model, tokenizer, t.format(group=g2)) for t in templates]
# 计算情感差异等指标
bias_scores[f"{g1}_vs_{g2}"] = self.calculate_bias(responses1, responses2)
return bias_scores六、评估最佳实践
6.1 评估流程
flowchart TB
A[确定评估目标] --> B[选择评估基准]
B --> C[准备测试环境]
C --> D[运行评估]
D --> E[分析结果]
E --> F{结果可靠?}
F -->|否| G[增加样本/调整方法]
G --> D
F -->|是| H[生成报告]
H --> I[指导改进]
6.2 评估 Checklist
EVALUATION_CHECKLIST = {
"准备阶段": [
"明确评估目标和用途",
"选择合适的评估基准",
"准备测试数据(确保无污染)",
"设置评估环境",
"确定评估指标",
],
"执行阶段": [
"记录模型配置和参数",
"使用固定的随机种子",
"多次运行取平均",
"记录推理时间",
"保存所有输出",
],
"分析阶段": [
"计算统计显著性",
"分析错误案例",
"比较不同类别的表现",
"检查异常值",
"验证结果可复现",
],
"报告阶段": [
"说明评估方法和局限",
"报告置信区间",
"与基线模型对比",
"提供详细的错误分析",
"给出改进建议",
],
}6.3 常见陷阱
| 陷阱 | 问题 | 解决方案 |
|---|---|---|
| 数据泄露 | 训练数据包含测试数据 | 使用时间切分、检查数据源 |
| 过拟合评估 | 针对特定基准优化 | 使用多个基准、保留集 |
| 位置偏差 | LLM 倾向选择特定位置 | 交换选项位置 |
| Prompt 敏感 | 不同 prompt 结果差异大 | 测试多种 prompt |
| 样本太少 | 结果不稳定 | 增加样本、计算置信区间 |
七、总结
LLM 评估核心要点
mindmap
root((LLM 评估))
评估维度
知识
推理
代码
对话
安全
主要基准
MMLU
GSM8K
HumanEval
MT-Bench
评估方法
规则评估
LLM-as-Judge
人工评估
工具框架
lm-eval-harness
OpenCompass
自定义
关键 Takeaway
- 评估是必要的:没有评估,不知道模型好坏
- 多维度评估:单一指标不够,要综合评估
- 选对基准:根据应用场景选择相关的评估集
- 注意陷阱:数据泄露、过拟合、偏差等
- LLM-as-Judge:开放性任务可以用 LLM 评估
- 持续评估:模型迭代时持续跟踪
模型选型参考(截至 2024)
| 模型 | MMLU | GSM8K | HumanEval | MT-Bench |
|---|---|---|---|---|
| GPT-4 | 86.4 | 92.0 | 67.0 | 9.0 |
| Claude 3 Opus | 86.8 | 95.0 | 84.9 | 9.0 |
| Gemini Ultra | 83.7 | 94.4 | 74.4 | - |
| LLaMA-3-70B | 82.0 | 93.0 | 81.7 | 8.5 |
| Qwen-72B | 77.4 | 78.9 | 72.0 | - |
数据来源:各公司官方报告,可能有更新