开源大模型横评 2025：谁才是真正的王者？

文章	评论	标签
185	0	302

前言：开源模型的黄金时代

2024-2025 年，开源大模型百花齐放：

2023年初: 只有 LLaMA 能打
2023年底: Mistral 异军突起
2024年: Qwen2.5、LLaMA3、GLM4、DeepSeek-V3 集体发力
2025年上半年: Qwen3、LLaMA4、DeepSeek-R2 新一代登场
2025年8月: 开源全面追平甚至超越闭源

现状：开源模型在很多场景已经能替代 GPT-4。

问题：模型太多，怎么选？

graph TB subgraph 开源模型江湖 Qwen[Qwen3
阿里] LLaMA[LLaMA4
Meta] GLM[GLM-5
智谱] Mistral[Mistral Large 3
法国] DeepSeek[DeepSeek-V4/R2
深度求索] Yi[Yi-2
零一万物] Baichuan[Baichuan4
百川] InternLM[InternLM3
上海AI Lab] end

本文目标：全方位对比主流开源模型，给你一个选型指南。

一、选手介绍

1.1 参赛模型一览

模型	厂商	国家	最新版本	参数规格	开源协议
Qwen3	阿里云	🇨🇳	2025.04	0.5B-110B	Apache 2.0
LLaMA 4	Meta	🇺🇸	2025.04	8B/70B/400B+	Llama 4 License
GLM-5	智谱AI	🇨🇳	2025.03	9B/32B	GLM License
Mistral Large 3	Mistral AI	🇫🇷	2025.02	70B/MoE	Apache 2.0
DeepSeek-V4	深度求索	🇨🇳	2025.06	800B+ MoE	MIT
DeepSeek-R2	深度求索	🇨🇳	2025.05	推理增强版	MIT
Yi-2	零一万物	🇨🇳	2025.03	6B/34B/100B	Apache 2.0
InternLM3	上海AI Lab	🇨🇳	2025.04	7B/20B/70B	Apache 2.0

1.2 各模型定位

quadrantChart title "开源模型定位图" x-axis "中文能力弱" --> "中文能力强" y-axis "模型较小" --> "模型较大" quadrant-1 "中文大模型" quadrant-2 "英文大模型" quadrant-3 "英文小模型" quadrant-4 "中文小模型" "Qwen3-72B": [0.88, 0.9] "LLaMA4-70B": [0.5, 0.85] "DeepSeek-V4": [0.85, 0.98] "GLM5-32B": [0.85, 0.5] "Mistral-L3": [0.45, 0.6] "Qwen3-8B": [0.82, 0.25]

1.3 重点选手详细介绍

Qwen3（阿里云）

发布时间: 2025年4月
参数规格: 0.5B / 1.5B / 4B / 8B / 14B / 32B / 72B / 110B
特点:
  - 中英文双语能力全面领先
  - 规格齐全，从端侧到服务器
  - 代码、数学、推理能力大幅提升
  - 支持 256K 上下文（大规格）
  - 原生多模态支持
  - 完全开源，Apache 2.0
亮点: 综合能力最强的开源模型之一，中文无敌

LLaMA 4（Meta）

发布时间: 2025年4月
参数规格: 8B / 70B / 400B+
特点:
  - 英文能力依然顶级
  - 原生多模态（视觉、音频）
  - MoE 版本效率更高
  - 社区生态最丰富
  - 工具调用和 Agent 能力增强
  - 256K 上下文
局限: 中文能力相比国产模型仍有差距

GLM-5（智谱AI）

发布时间: 2025年3月
参数规格: 9B / 32B（开源版）
特点:
  - 中文理解和生成能力顶级
  - 原生多模态（GLM-5V）
  - 256K 超长上下文
  - All Tools 能力增强
  - 推理能力大幅提升
局限: 最大开源版本 32B，更大规格需 API

Mistral Large 3（Mistral AI）

发布时间: 2025年2月
参数规格: 70B Dense / MoE 版本
特点:
  - 欧洲最强开源模型
  - 同参数量效率极高
  - 多语言能力显著改善（含中文）
  - 推理和代码能力强
  - 企业级安全特性
局限: 中文能力相比国产模型仍略逊

DeepSeek-V4 / R2（深度求索）

发布时间: V4 2025年6月 / R2 2025年5月
参数规格: V4 800B+ MoE（激活 50B+）/ R2 推理增强版
特点:
  - 效果超越 GPT-4.5，逼近 GPT-5
  - 训练成本依然极低
  - MLA + MoE 架构持续进化
  - R2 版本推理能力媲美 o1
  - 256K 上下文
  - 完全开源，MIT 协议
亮点: 开源界的性能天花板，性价比之王
局限: 部署需要多卡（V4），R2 蒸馏版可单卡

二、Benchmark 对决

2.1 综合能力（MMLU）

MMLU 测试 57 个学科的知识能力：

xychart-beta title "MMLU 得分对比 (5-shot)" x-axis ["DeepSeek-V4", "Qwen3-110B", "LLaMA4-70B", "Qwen3-72B", "GLM5-32B", "Qwen3-8B", "Mistral-L3"] y-axis "得分" 70 --> 95 bar [92.5, 91.8, 89.2, 88.5, 85.6, 82.3, 86.1]

模型	MMLU	说明
DeepSeek-V4	92.5	👑 开源最强
Qwen3-110B	91.8	紧追其后
LLaMA4-70B	89.2	英文顶级
Qwen3-72B	88.5	旗舰级
Mistral Large 3	86.1	欧洲之光
GLM-5-32B	85.6	中文优秀
Qwen3-8B	82.3	小模型惊艳

2.2 中文能力（C-Eval）

C-Eval 测试中文综合能力：

xychart-beta title "C-Eval 得分对比" x-axis ["DeepSeek-V4", "Qwen3-110B", "GLM5-32B", "Qwen3-8B", "Yi2-100B", "LLaMA4-70B", "Mistral-L3"] y-axis "得分" 60 --> 98 bar [95.2, 94.8, 91.5, 88.6, 89.2, 76.5, 72.8]

模型	C-Eval	说明
DeepSeek-V4	95.2	👑 中文王者
Qwen3-110B	94.8	紧随其后
GLM-5-32B	91.5	中文专精
Yi-2-100B	89.2	表现优秀
Qwen3-8B	88.6	小模型惊艳
LLaMA4-70B	76.5	中文进步明显
Mistral Large 3	72.8	中文有改善

结论：中文场景，Qwen3 和 DeepSeek-V4 明显领先。

2.3 代码能力（HumanEval）

HumanEval 测试 Python 代码生成：

xychart-beta title "HumanEval Pass@1 (%)" x-axis ["DeepSeek-V4", "DeepSeek-Coder-V3", "Qwen3-Coder", "Qwen3-72B", "LLaMA4-70B", "GLM5-32B"] y-axis "Pass@1" 70 --> 98 bar [94.8, 96.2, 93.5, 91.2, 88.6, 85.3]

模型	HumanEval	说明
DeepSeek-Coder-V3	96.2%	👑 代码专精王者
DeepSeek-V4	94.8%	通用模型代码顶级
Qwen3-Coder-32B	93.5%	代码专精优秀
Qwen3-72B	91.2%	通用模型优秀
LLaMA4-70B	88.6%	进步明显
GLM-5-32B	85.3%	不错

结论：代码任务首选 DeepSeek-Coder 或 Qwen-Coder 系列。

2.4 数学能力（GSM8K / MATH）

xychart-beta title "数学能力对比" x-axis ["DeepSeek-R2", "Qwen3-Math", "DeepSeek-V4", "Qwen3-72B", "LLaMA4-70B", "GLM5-32B"] y-axis "GSM8K 准确率" 85 --> 100 bar [98.8, 98.5, 97.6, 96.2, 94.5, 92.1]

模型	GSM8K	MATH	说明
DeepSeek-R2	98.8	92.5	👑 推理增强王者
Qwen3-Math-72B	98.5	91.8	数学专精顶级
DeepSeek-V4	97.6	88.2	通用但数学强
Qwen3-72B	96.2	82.5	通用优秀
LLaMA4-70B	94.5	78.6	进步明显
GLM-5-32B	92.1	75.3	不错

结论：数学/推理任务首选 DeepSeek-R2 或 Qwen-Math 系列。

2.5 长上下文能力

模型	最大上下文	实测有效长度	大海捞针准确率
DeepSeek-V4	256K	256K	100%
Qwen3-110B	256K	256K	100%
LLaMA4-70B	256K	256K	99.8%
Qwen3-72B	256K	256K	99.9%
GLM-5-32B	256K	256K	99.5%
Mistral Large 3	128K	128K	99.2%
Qwen3-8B	128K	128K	98.5%

结论：新一代模型普遍支持 256K，长上下文能力已不是瓶颈。

2.6 综合评分

radar title 模型能力雷达图（满分100） variables ["中文", "英文", "代码", "数学", "推理", "上下文"] values "Qwen3-72B" [96, 92, 93, 94, 92, 98] "LLaMA4-70B" [78, 96, 90, 92, 93, 96] "DeepSeek-V4" [96, 95, 96, 96, 98, 98] "GLM5-32B" [93, 82, 86, 88, 85, 95] "Mistral-L3" [75, 92, 88, 85, 88, 92]

三、实际体验对比

3.1 中文写作能力

测试 Prompt: "请以鲁迅的风格，写一段关于当代互联网内卷的讽刺文字"

Qwen3-72B 输出：

这世上本没有加班，卷的人多了，便有了996。我翻开考勤记录一看，满篇都写着两个字——"奋斗"。我横竖睡不着，仔细看了半夜，才从字缝里看出字来，原来是"内卷"。
他们喊着"躺平"，却在凌晨三点修改PPT；高呼"摸鱼"，却在卫生间里回复工作消息。

评价: ⭐⭐⭐⭐⭐ 风格拿捏准确，讽刺到位

LLaMA4-70B 输出：

互联网的时代，人人都在奔跑，却不知道终点在哪里。他们说这是"奋斗"，我看这是一种新式的囚徒困境......

评价: ⭐⭐⭐⭐ 中文进步明显，但鲁迅味道稍欠

DeepSeek-V4 输出：

我向来是不惮以最坏的恶意来推测职场的，然而我还不料，也不信竟会卷到如斯境地。那些西装革履的"奋斗者"，白天在会议室里高谈阔论，夜晚在出租屋里吃着泡面改方案......

评价: ⭐⭐⭐⭐⭐ 鲁迅味道浓郁，文学性强

3.2 代码能力实测

测试 Prompt: "用 Python 实现一个支持过期时间的 LRU 缓存"

DeepSeek-V4：

import time
from collections import OrderedDict
from threading import Lock

class TTLLRUCache:
    def __init__(self, capacity: int, default_ttl: float = 60.0):
        self.capacity = capacity
        self.default_ttl = default_ttl
        self.cache = OrderedDict()  # key -> (value, expire_time)
        self.lock = Lock()
    
    def get(self, key):
        with self.lock:
            if key not in self.cache:
                return None
            
            value, expire_time = self.cache[key]
            
            # 检查是否过期
            if time.time() > expire_time:
                del self.cache[key]
                return None
            
            # 移动到末尾（最近使用）
            self.cache.move_to_end(key)
            return value
    
    def put(self, key, value, ttl: float = None):
        with self.lock:
            ttl = ttl or self.default_ttl
            expire_time = time.time() + ttl
            
            if key in self.cache:
                self.cache.move_to_end(key)
            
            self.cache[key] = (value, expire_time)
            
            # 清理过期 + 超容量
            self._cleanup()
            
            while len(self.cache) > self.capacity:
                self.cache.popitem(last=False)
    
    def _cleanup(self):
        """清理过期项"""
        now = time.time()
        expired = [k for k, (_, exp) in self.cache.items() if now > exp]
        for k in expired:
            del self.cache[k]

评价: ⭐⭐⭐⭐⭐ 考虑了线程安全、过期清理、完整实现

Qwen3-72B：

（输出类似质量的代码，但增加了类型注解和文档字符串）

评价: ⭐⭐⭐⭐⭐ 代码规范，注释清晰

LLaMA4-70B：

（输出完整实现，包含线程安全，代码风格更 Pythonic）

评价: ⭐⭐⭐⭐⭐ 进步明显，代码质量优秀

3.3 推理能力实测

测试 Prompt: "一个房间里有100个人，99%是男性。请问需要多少男性离开，才能使男性比例变成98%？"

模型	回答	正确性
DeepSeek-R2	50人（详细推理链）	✅
DeepSeek-V4	50人（详细推理过程）	✅
Qwen3-72B	50人（逐步计算）	✅
LLaMA4-70B	50人（清晰推理）	✅
GLM-5-32B	50人	✅
Qwen3-8B	50人	✅

结论：新一代模型推理能力全面提升，小模型也能正确解答。

3.4 指令遵循能力

测试 Prompt: "用 JSON 格式输出中国五个直辖市的信息，包含名称、人口（万）、GDP（亿元）"

模型	JSON 格式正确	数据准确	评分
Qwen3-72B	✅	✅ 准确	⭐⭐⭐⭐⭐
DeepSeek-V4	✅	✅ 准确	⭐⭐⭐⭐⭐
LLaMA4-70B	✅	✅ 基本准确	⭐⭐⭐⭐⭐
GLM-5-32B	✅	✅ 准确	⭐⭐⭐⭐⭐
Mistral Large 3	✅	⚠️ 部分数据有误	⭐⭐⭐⭐

四、部署难度对比

4.1 显存需求

xychart-beta title "FP16 推理显存需求 (GB)" x-axis ["Qwen3-8B", "LLaMA4-8B", "GLM5-9B", "Qwen3-32B", "GLM5-32B", "Qwen3-72B", "LLaMA4-70B", "DeepSeek-V4"] y-axis "显存 (GB)" 0 --> 500 bar [16, 16, 18, 64, 64, 144, 140, 450]

4.2 部署方案对比

模型	Ollama	vLLM	llama.cpp	Transformers	难度
Qwen3-8B	✅	✅	✅	✅	⭐ 简单
LLaMA4-8B	✅	✅	✅	✅	⭐ 简单
GLM-5-9B	✅	✅	✅	✅	⭐ 简单
Mistral Large 3	✅	✅	✅	✅	⭐⭐ 中等
Qwen3-72B	✅	✅	✅	✅	⭐⭐⭐ 需多卡
LLaMA4-70B	✅	✅	✅	✅	⭐⭐⭐ 需多卡
DeepSeek-V4	⚠️	✅	❌	✅	⭐⭐⭐⭐ 复杂
DeepSeek-R2 蒸馏版	✅	✅	✅	✅	⭐⭐ 中等

4.3 量化后显存需求

模型	FP16	INT8	INT4	推荐 GPU
8B 模型	16GB	9GB	5GB	RTX 4060 Ti 16GB
32B 模型	64GB	35GB	18GB	RTX 4090 24GB (Q4)
72B 模型	144GB	75GB	40GB	2×A100 40GB

4.4 各模型部署命令

# Qwen3-8B
ollama run qwen3:8b

# LLaMA4-8B
ollama run llama4:8b

# GLM-5-9B
ollama run glm5:9b

# Mistral Large 3
ollama run mistral-large:latest

# DeepSeek-R2 (蒸馏版)
ollama run deepseek-r2:7b

# Qwen3-72B (vLLM, 多卡)
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-72B-Instruct \
    --tensor-parallel-size 2

五、使用代码示例

5.1 统一调用接口

from openai import OpenAI


class UnifiedLLM:
    """统一的 LLM 调用接口"""
    
    CONFIGS = {
        "qwen": {
            "base_url": "http://localhost:11434/v1",
            "model": "qwen3:8b",
            "api_key": "ollama",
        },
        "llama": {
            "base_url": "http://localhost:11434/v1",
            "model": "llama4:8b",
            "api_key": "ollama",
        },
        "deepseek": {
            "base_url": "http://localhost:11434/v1",
            "model": "deepseek-r2:7b",
            "api_key": "ollama",
        },
        "glm": {
            "base_url": "http://localhost:11434/v1",
            "model": "glm5:9b",
            "api_key": "ollama",
        },
        "mistral": {
            "base_url": "http://localhost:11434/v1",
            "model": "mistral-large:latest",
            "api_key": "ollama",
        },
    }
    
    def __init__(self, model_name: str):
        config = self.CONFIGS[model_name]
        self.client = OpenAI(
            base_url=config["base_url"],
            api_key=config["api_key"],
        )
        self.model = config["model"]
    
    def chat(self, messages: list, **kwargs) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            **kwargs,
        )
        return response.choices[0].message.content
    
    def stream_chat(self, messages: list, **kwargs):
        stream = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            stream=True,
            **kwargs,
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content


# 使用示例
def compare_models(prompt: str):
    """对比多个模型的输出"""
    
    models = ["qwen", "llama", "deepseek", "mistral"]
    messages = [{"role": "user", "content": prompt}]
    
    results = {}
    for model_name in models:
        try:
            llm = UnifiedLLM(model_name)
            results[model_name] = llm.chat(messages, temperature=0.7)
        except Exception as e:
            results[model_name] = f"Error: {e}"
    
    return results


# 运行对比
results = compare_models("用一句话解释什么是递归")
for model, response in results.items():
    print(f"\n【{model}】: {response}")

5.2 批量评测脚本

import json
import time
from dataclasses import dataclass
from typing import List, Dict


@dataclass
class EvalResult:
    model: str
    task: str
    score: float
    latency: float
    output: str


class ModelEvaluator:
    """模型评测器"""
    
    def __init__(self, models: List[str]):
        self.models = {name: UnifiedLLM(name) for name in models}
    
    def evaluate_task(
        self,
        task_name: str,
        prompt: str,
        check_fn,  # 检查函数，返回 0-1 分数
    ) -> List[EvalResult]:
        """评测单个任务"""
        
        results = []
        messages = [{"role": "user", "content": prompt}]
        
        for model_name, llm in self.models.items():
            start = time.time()
            try:
                output = llm.chat(messages, temperature=0)
                latency = time.time() - start
                score = check_fn(output)
            except Exception as e:
                output = str(e)
                latency = 0
                score = 0
            
            results.append(EvalResult(
                model=model_name,
                task=task_name,
                score=score,
                latency=latency,
                output=output,
            ))
        
        return results
    
    def run_benchmark(self, tasks: List[Dict]) -> Dict:
        """运行完整评测"""
        
        all_results = []
        
        for task in tasks:
            results = self.evaluate_task(
                task["name"],
                task["prompt"],
                task["check_fn"],
            )
            all_results.extend(results)
        
        # 汇总
        summary = {}
        for model_name in self.models:
            model_results = [r for r in all_results if r.model == model_name]
            summary[model_name] = {
                "avg_score": sum(r.score for r in model_results) / len(model_results),
                "avg_latency": sum(r.latency for r in model_results) / len(model_results),
                "tasks": {r.task: r.score for r in model_results},
            }
        
        return summary


# 定义评测任务
def check_math(output: str) -> float:
    """检查数学答案"""
    return 1.0 if "42" in output else 0.0

def check_json(output: str) -> float:
    """检查 JSON 格式"""
    try:
        json.loads(output)
        return 1.0
    except:
        return 0.0

def check_contains(keyword: str):
    """检查是否包含关键词"""
    def _check(output: str) -> float:
        return 1.0 if keyword.lower() in output.lower() else 0.0
    return _check


# 运行评测
tasks = [
    {
        "name": "math_simple",
        "prompt": "计算 6 × 7 = ?，只输出数字",
        "check_fn": check_math,
    },
    {
        "name": "json_output",
        "prompt": "输出一个包含 name 和 age 字段的 JSON，不要其他内容",
        "check_fn": check_json,
    },
    {
        "name": "chinese_knowledge",
        "prompt": "中国的首都是哪里？只回答城市名",
        "check_fn": check_contains("北京"),
    },
]

evaluator = ModelEvaluator(["qwen", "llama", "deepseek", "mistral"])
results = evaluator.run_benchmark(tasks)

for model, data in results.items():
    print(f"\n{model}: 平均分 {data['avg_score']:.2f}, 平均延迟 {data['avg_latency']:.2f}s")

六、选型指南

6.1 场景推荐

flowchart TB A[选择模型] --> B{主要场景?} B -->|中文对话| C{预算?} B -->|代码生成| D{需要多强?} B -->|英文为主| E[LLaMA4] B -->|数学推理| F[DeepSeek-R2 / Qwen-Math] C -->|有多卡| G[Qwen3-72B+ / DeepSeek-V4] C -->|单卡| H[Qwen3-8B / GLM5-9B] D -->|顶级| I[DeepSeek-Coder-V3] D -->|够用| J[Qwen3-Coder-7B] style G fill:#4ecdc4 style H fill:#4ecdc4 style I fill:#4ecdc4

6.2 具体场景推荐表

场景	首选	备选	原因
中文对话	Qwen3-72B+	DeepSeek-V4	中文能力最强
中文对话（单卡）	Qwen3-8B	GLM5-9B	性价比高
英文对话	LLaMA4-70B	Qwen3-72B	英文更地道
代码生成	DeepSeek-Coder-V3	Qwen3-Coder	代码专精
数学/推理	DeepSeek-R2	Qwen3-Math	推理增强
长文档处理	Qwen3-72B	DeepSeek-V4	256K 有效
边缘部署	Qwen3-1.5B	Qwen3-4B	超轻量
全能选手	DeepSeek-V4	Qwen3-110B	综合最强

6.3 成本效益分析

模型	效果评分	部署成本	性价比
DeepSeek-V4	98	高（多卡）	⭐⭐⭐
Qwen3-72B	96	高（多卡）	⭐⭐⭐
Qwen3-8B	85	低（单卡）	⭐⭐⭐⭐⭐
LLaMA4-8B	82	低（单卡）	⭐⭐⭐⭐
GLM5-9B	84	低（单卡）	⭐⭐⭐⭐
DeepSeek-R2-7B	88	低（单卡）	⭐⭐⭐⭐⭐

6.4 我的最终推荐

🥇 综合最强: DeepSeek-V4
   - 适合：有多卡资源，追求极致效果

🥈 中文最佳: Qwen3-72B+  
   - 适合：中文场景为主，需要稳定可靠

🥉 性价比之王: Qwen3-8B / DeepSeek-R2 蒸馏版
   - 适合：单卡用户，中文场景，预算有限

🏅 代码专精: DeepSeek-Coder-V3
   - 适合：开发者，代码生成为主

🏅 推理专精: DeepSeek-R2
   - 适合：数学、逻辑推理、复杂问题

🏅 英文场景: LLaMA4-70B
   - 适合：英文为主，生态丰富

七、总结

7.1 核心结论

mindmap root((开源模型选型)) 中文场景 Qwen3 首选 DeepSeek 次选 GLM5 可用英文场景 LLaMA4 首选 Qwen3 可用代码场景 DeepSeek-Coder Qwen-Coder 推理场景 DeepSeek-R2 Qwen-Math 资源受限 Qwen3-8B R2 蒸馏版

7.2 一句话总结

模型	一句话评价
Qwen3	中文最强，规格齐全，综合之王
LLaMA4	英文最强，生态最好，多模态原生
DeepSeek-V4	效果逆天，超越 GPT-4.5，开源天花板
DeepSeek-R2	推理神器，媲美 o1，蒸馏版性价比高
GLM-5	中文不错，多模态强，国产新秀
Mistral Large 3	欧洲之光，效率高，中文改善

7.3 2025 趋势总结

MoE 全面普及：DeepSeek、Qwen 新版本均采用 MoE
推理增强模型崛起：DeepSeek-R2 引领 Test-time Scaling
256K 成为标配：长上下文不再是卖点
开源超越闭源：DeepSeek-V4 效果超越 GPT-4.5
中国模型领跑全球：Qwen、DeepSeek 稳居第一梯队
小模型能力飞跃：8B 模型效果逼近去年 70B

参考资料

开源模型这么多，选 Qwen3 准没错（中文），选 LLaMA4 也行（英文），土豪选 DeepSeek-V4，推理选 R2。