DeepSeek 本地部署与训练：穷人的 GPT-4 平替

文章	评论	标签
190	0	310

前言：DeepSeek 凭什么这么火？

今年初，DeepSeek 横空出世：

GPT-4 级别的能力 + 开源 + 可商用 + 训练成本低到离谱
= DeepSeek

DeepSeek 的"逆天"之处：

维度	GPT-4	DeepSeek-V3
参数量	~1.8T（传闻）	671B（MoE，激活 37B）
训练成本	~$100M+	~$5.5M
开源	❌	✅
可商用	API only	✅ 免费商用
本地部署	❌	✅
效果	顶级	顶级（部分超越）

本文目标：手把手教你在本地跑起来 DeepSeek，甚至微调它。

graph LR A[DeepSeek 模型] --> B{部署方式} B --> C[Ollama
最简单] B --> D[vLLM
高性能] B --> E[llama.cpp
CPU/低显存] B --> F[Transformers
微调] style C fill:#4ecdc4

一、DeepSeek 模型家族

1.1 模型版本一览

mindmap root((DeepSeek)) 基座模型 DeepSeek-V3 671B MoE 激活37B 最强开源 DeepSeek-V2 236B MoE 激活21B DeepSeek-V2-Lite 16B 单卡可跑对话模型 DeepSeek-Chat DeepSeek-R1 推理增强思维链代码模型 DeepSeek-Coder-V2 236B MoE 代码专精数学模型 DeepSeek-Math

1.2 模型选择指南

模型	参数量	显存需求	适用场景
DeepSeek-V3	671B	400GB+	多卡集群
DeepSeek-V2	236B	150GB+	多卡服务器
DeepSeek-V2-Lite	16B	32GB	单卡 A100
DeepSeek-Coder-V2-Lite	16B	32GB	代码任务
DeepSeek-R1-Distill-Qwen-7B	7B	16GB	推理任务，消费级 GPU
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	4GB	轻量级，CPU 可跑

普通人的选择：

有 4090 24GB → DeepSeek-R1-Distill-7B / DeepSeek-Coder-V2-Lite（量化）
有 3090 24GB → DeepSeek-R1-Distill-7B（量化）
只有 CPU → DeepSeek-R1-Distill-1.5B

1.3 DeepSeek-R1 的特别之处

DeepSeek-R1 是推理增强模型，特点是会思考：

普通模型：
用户: 9.11 和 9.9 哪个大？
模型: 9.11 大（错误！被小数点位数迷惑）

DeepSeek-R1：
用户: 9.11 和 9.9 哪个大？
模型: <think>
让我仔细比较这两个数...
9.11 = 9 + 0.11
9.9 = 9 + 0.9
0.9 > 0.11
所以 9.9 > 9.11
</think>
9.9 更大。

二、Ollama 部署（最简单）

2.1 安装 Ollama

# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# 下载安装包：https://ollama.com/download

# 验证安装
ollama --version

2.2 运行 DeepSeek

# 运行 DeepSeek-R1 蒸馏版（推荐，7B 大小）
ollama run deepseek-r1:7b

# 或者 1.5B 轻量版（CPU 也能跑）
ollama run deepseek-r1:1.5b

# 代码模型
ollama run deepseek-coder-v2:16b

# 查看可用的 DeepSeek 模型
ollama search deepseek

2.3 API 调用

import ollama


def chat_with_deepseek(prompt: str) -> str:
    """使用 Ollama 调用 DeepSeek"""
    
    response = ollama.chat(
        model='deepseek-r1:7b',
        messages=[
            {'role': 'user', 'content': prompt}
        ]
    )
    
    return response['message']['content']


def stream_chat(prompt: str):
    """流式输出"""
    
    stream = ollama.chat(
        model='deepseek-r1:7b',
        messages=[{'role': 'user', 'content': prompt}],
        stream=True,
    )
    
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)


# 使用
response = chat_with_deepseek("用 Python 写一个快速排序")
print(response)

# 流式输出
stream_chat("解释一下什么是 Transformer")

2.4 OpenAI 兼容 API

from openai import OpenAI


# Ollama 提供 OpenAI 兼容接口
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # 任意字符串
)

response = client.chat.completions.create(
    model="deepseek-r1:7b",
    messages=[
        {"role": "system", "content": "你是一个有帮助的助手。"},
        {"role": "user", "content": "什么是机器学习？"}
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

2.5 使用自定义 GGUF 模型

# 1. 下载 GGUF 模型
# 从 HuggingFace 下载，如：
# https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF

# 2. 创建 Modelfile
cat > Modelfile << 'EOF'
FROM ./deepseek-coder-6.7b-instruct.Q4_K_M.gguf

TEMPLATE """{{ if .System }}<｜system｜>
{{ .System }}
{{ end }}<｜user｜>
{{ .Prompt }}
<｜assistant｜>
"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<｜end▁of▁sentence｜>"
EOF

# 3. 创建模型
ollama create my-deepseek -f Modelfile

# 4. 运行
ollama run my-deepseek

三、vLLM 部署（高性能）

3.1 安装 vLLM

pip install vllm

3.2 部署 DeepSeek-V2-Lite

from vllm import LLM, SamplingParams


def deploy_deepseek_v2_lite():
    """部署 DeepSeek-V2-Lite"""
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-V2-Lite-Chat",
        tensor_parallel_size=1,  # GPU 数量
        dtype="float16",
        max_model_len=8192,
        trust_remote_code=True,  # DeepSeek 需要这个
    )
    
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=512,
    )
    
    prompts = [
        "你好，请介绍一下你自己。",
    ]
    
    outputs = llm.generate(prompts, sampling_params)
    
    for output in outputs:
        print(output.outputs[0].text)
    
    return llm


# 启动 API 服务
# python -m vllm.entrypoints.openai.api_server \
#     --model deepseek-ai/DeepSeek-V2-Lite-Chat \
#     --trust-remote-code \
#     --port 8000

3.3 部署 DeepSeek-R1 蒸馏版

from vllm import LLM, SamplingParams


def deploy_deepseek_r1():
    """部署 DeepSeek-R1 蒸馏版"""
    
    # 7B 版本
    llm = LLM(
        model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        dtype="float16",
        max_model_len=32768,  # R1 支持长上下文
        trust_remote_code=True,
    )
    
    # R1 需要特殊的 prompt 格式来触发思考
    prompt = """<|begin_of_text|><|User|>
9.11 和 9.9 哪个大？请一步步思考。
<|Assistant|>
"""
    
    sampling_params = SamplingParams(
        temperature=0.6,
        max_tokens=2048,
    )
    
    outputs = llm.generate([prompt], sampling_params)
    print(outputs[0].outputs[0].text)


# 带量化的部署（节省显存）
def deploy_deepseek_r1_quantized():
    """量化部署"""
    
    llm = LLM(
        model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        quantization="awq",  # 或 "gptq"
        dtype="float16",
        max_model_len=16384,
        trust_remote_code=True,
    )
    
    return llm

3.4 多 GPU 部署 DeepSeek-V2

# DeepSeek-V2 需要多卡
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V2-Chat \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --max-model-len 32768 \
    --port 8000

# 客户端调用
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V2-Chat",
    messages=[
        {"role": "user", "content": "写一首关于编程的诗"}
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

四、llama.cpp 部署（CPU/低显存）

4.1 安装 llama.cpp

# 克隆仓库
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 编译（CPU）
make -j

# 编译（CUDA，如果有 GPU）
make -j LLAMA_CUDA=1

# 编译（Metal，macOS）
make -j LLAMA_METAL=1

4.2 下载 GGUF 模型

# 从 HuggingFace 下载预转换的 GGUF
# 推荐来源：
# - https://huggingface.co/TheBloke
# - https://huggingface.co/QuantFactory

# 示例：下载 DeepSeek-R1 蒸馏版 GGUF
huggingface-cli download \
    QuantFactory/DeepSeek-R1-Distill-Qwen-7B-GGUF \
    DeepSeek-R1-Distill-Qwen-7B.Q4_K_M.gguf \
    --local-dir ./models

4.3 运行推理

# 交互式对话
./main -m ./models/DeepSeek-R1-Distill-Qwen-7B.Q4_K_M.gguf \
    -n 512 \
    -t 8 \
    --interactive \
    --color \
    -c 4096

# 单次推理
./main -m ./models/DeepSeek-R1-Distill-Qwen-7B.Q4_K_M.gguf \
    -n 256 \
    -p "用 Python 实现二分查找"

4.4 启动 API 服务

# 启动服务
./server -m ./models/DeepSeek-R1-Distill-Qwen-7B.Q4_K_M.gguf \
    -c 4096 \
    --host 0.0.0.0 \
    --port 8080

# 调用
curl http://localhost:8080/completion \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "什么是深度学习？",
        "n_predict": 256
    }'

4.5 Python 调用

from llama_cpp import Llama


def use_deepseek_gguf():
    """使用 llama-cpp-python 调用 DeepSeek"""
    
    llm = Llama(
        model_path="./models/DeepSeek-R1-Distill-Qwen-7B.Q4_K_M.gguf",
        n_ctx=4096,        # 上下文长度
        n_threads=8,       # CPU 线程数
        n_gpu_layers=35,   # GPU 层数（0=纯 CPU）
        verbose=False,
    )
    
    # 生成
    output = llm(
        "解释一下什么是 Transformer 架构",
        max_tokens=512,
        temperature=0.7,
        stop=["<|end_of_text|>"],
    )
    
    print(output["choices"][0]["text"])
    
    return llm


def chat_with_gguf(llm, messages: list) -> str:
    """对话模式"""
    
    # 构造 prompt
    prompt = ""
    for msg in messages:
        if msg["role"] == "user":
            prompt += f"<|User|>\n{msg['content']}\n"
        elif msg["role"] == "assistant":
            prompt += f"<|Assistant|>\n{msg['content']}\n"
    
    prompt += "<|Assistant|>\n"
    
    output = llm(
        prompt,
        max_tokens=512,
        temperature=0.7,
        stop=["<|User|>", "<|end_of_text|>"],
    )
    
    return output["choices"][0]["text"]


# 使用
llm = use_deepseek_gguf()

messages = [
    {"role": "user", "content": "你好！"}
]
response = chat_with_gguf(llm, messages)
print(response)

五、Transformers 加载与微调

5.1 加载模型

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


def load_deepseek():
    """加载 DeepSeek 模型"""
    
    model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True,
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
    )
    
    return model, tokenizer


def generate(model, tokenizer, prompt: str) -> str:
    """生成文本"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response[len(prompt):]


# 使用
model, tokenizer = load_deepseek()
response = generate(model, tokenizer, "什么是机器学习？")
print(response)

5.2 量化加载（省显存）

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch


def load_deepseek_quantized():
    """4-bit 量化加载"""
    
    model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True,
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    
    return model, tokenizer


# 显存对比
# FP16: ~14GB
# INT4: ~5GB

5.3 LoRA 微调

from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch


def finetune_deepseek():
    """微调 DeepSeek"""
    
    model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
    
    # 1. 量化配置
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    
    # 2. 加载模型
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True,
    )
    tokenizer.pad_token = tokenizer.eos_token
    
    # 3. 准备训练
    model = prepare_model_for_kbit_training(model)
    
    # 4. LoRA 配置
    lora_config = LoraConfig(
        r=64,
        lora_alpha=128,
        lora_dropout=0.05,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        bias="none",
        task_type="CAUSAL_LM",
    )
    
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    
    # 5. 加载数据集
    dataset = load_dataset("your-dataset", split="train")
    
    # 6. 格式化函数
    def format_prompt(example):
        return f"""<|User|>
{example['instruction']}
<|Assistant|>
{example['output']}<|end_of_text|>"""
    
    # 7. 训练配置
    training_args = TrainingArguments(
        output_dir="./deepseek-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
        optim="paged_adamw_8bit",
        gradient_checkpointing=True,
    )
    
    # 8. 训练器
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
        formatting_func=format_prompt,
        max_seq_length=2048,
    )
    
    # 9. 开始训练
    trainer.train()
    
    # 10. 保存
    trainer.save_model()
    
    return model


# 运行微调
model = finetune_deepseek()

5.4 自定义数据集格式

def prepare_dataset():
    """准备微调数据集"""
    
    # 数据格式示例
    data = [
        {
            "instruction": "用 Python 写一个冒泡排序",
            "output": """def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(0, n-i-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
    return arr"""
        },
        {
            "instruction": "解释什么是递归",
            "output": "递归是一种编程技术，指函数在其定义中调用自身..."
        },
        # ... 更多数据
    ]
    
    # 保存为 JSON
    import json
    with open("train_data.json", "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    
    # 加载为 Dataset
    from datasets import Dataset
    dataset = Dataset.from_list(data)
    
    return dataset


# 从 JSON 文件加载
def load_custom_dataset(file_path: str):
    """加载自定义数据集"""
    from datasets import load_dataset
    
    dataset = load_dataset("json", data_files=file_path, split="train")
    return dataset

六、完整部署方案

6.1 Docker 部署

# Dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04

# 安装依赖
RUN apt-get update && apt-get install -y \
    python3 python3-pip git curl \
    && rm -rf /var/lib/apt/lists/*

# 安装 Python 包
RUN pip3 install vllm torch transformers

# 复制启动脚本
COPY start.sh /start.sh
RUN chmod +x /start.sh

EXPOSE 8000

CMD ["/start.sh"]

# start.sh
#!/bin/bash
python3 -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

# 构建和运行
docker build -t deepseek-server .
docker run --gpus all -p 8000:8000 deepseek-server

6.2 Docker Compose 完整方案

# docker-compose.yml
version: '3.8'

services:
  deepseek:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
      --trust-remote-code
      --host 0.0.0.0
      --port 8000
    restart: unless-stopped

  # 可选：Web UI
  webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OPENAI_API_BASE_URL=http://deepseek:8000/v1
      - OPENAI_API_KEY=not-needed
    depends_on:
      - deepseek
    restart: unless-stopped

6.3 性能优化配置

# 高性能部署配置
VLLM_CONFIG = {
    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    "tensor_parallel_size": 1,
    "gpu_memory_utilization": 0.9,
    "max_model_len": 16384,
    "trust_remote_code": True,
    
    # 性能优化
    "enable_prefix_caching": True,  # 前缀缓存
    "max_num_seqs": 256,            # 最大并发
    "max_num_batched_tokens": 8192, # 批处理大小
}

# 启动命令
# python -m vllm.entrypoints.openai.api_server \
#     --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
#     --trust-remote-code \
#     --gpu-memory-utilization 0.9 \
#     --enable-prefix-caching \
#     --max-num-seqs 256 \
#     --port 8000

七、硬件配置推荐

7.1 不同预算的配置

graph TB subgraph 入门级 A1[RTX 3060 12GB
或 RTX 4060 Ti 16GB] A1 --> B1[DeepSeek-R1-1.5B
或 7B Q4 量化] end subgraph 进阶级 A2[RTX 4090 24GB] A2 --> B2[DeepSeek-R1-7B FP16
或 V2-Lite Q4] end subgraph 专业级 A3[2-4x A100 40GB] A3 --> B3[DeepSeek-V2
或 Coder-V2] end subgraph 土豪级 A4[8x H100 80GB] A4 --> B4[DeepSeek-V3 完整版] end

7.2 具体配置表

预算	GPU	推荐模型	量化方式	效果
¥3000	RTX 3060 12GB	R1-Distill-1.5B	FP16	基础对话
¥5000	RTX 4060 Ti 16GB	R1-Distill-7B	Q4	不错
¥15000	RTX 4090 24GB	R1-Distill-7B	FP16	很好
¥50000	2x RTX 4090	V2-Lite 16B	FP16	优秀
云服务器	A100 40GB	R1-Distill-7B + V2-Lite	FP16	专业

7.3 内存和存储需求

def estimate_requirements(model_params_b: float, quantization: str = "fp16"):
    """估算硬件需求"""
    
    bytes_per_param = {
        "fp32": 4,
        "fp16": 2,
        "int8": 1,
        "int4": 0.5,
    }
    
    # 模型大小
    model_size_gb = model_params_b * bytes_per_param[quantization]
    
    # 显存需求（推理）
    # 模型 + KV Cache + 激活值
    vram_inference = model_size_gb * 1.2
    
    # 显存需求（训练，全量）
    # 模型 + 梯度 + 优化器状态（Adam: 2x）
    vram_training = model_size_gb * 6
    
    # 显存需求（QLoRA）
    vram_qlora = model_size_gb * 1.5
    
    print(f"模型: {model_params_b}B 参数")
    print(f"量化: {quantization}")
    print(f"模型大小: {model_size_gb:.1f} GB")
    print(f"推理显存: ~{vram_inference:.1f} GB")
    print(f"全量训练显存: ~{vram_training:.1f} GB")
    print(f"QLoRA 训练显存: ~{vram_qlora:.1f} GB")


# DeepSeek-R1-7B
estimate_requirements(7, "fp16")
# 模型: 7B 参数
# 量化: fp16
# 模型大小: 14.0 GB
# 推理显存: ~16.8 GB
# 全量训练显存: ~84.0 GB
# QLoRA 训练显存: ~21.0 GB

# DeepSeek-R1-7B 量化
estimate_requirements(7, "int4")
# 模型大小: 3.5 GB
# 推理显存: ~4.2 GB

八、常见问题

8.1 问题排查

问题	原因	解决方案
CUDA OOM	显存不足	使用量化、减小 batch_size
模型加载慢	网络问题	使用镜像、提前下载
输出乱码	tokenizer 问题	确保使用正确的 tokenizer
推理很慢	未使用 GPU	检查 device_map
trust_remote_code 报错	安全限制	添加 trust_remote_code=True

8.2 使用国内镜像

import os

# 设置 HuggingFace 镜像
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# 或者使用 ModelScope
from modelscope import snapshot_download

model_dir = snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-7B')

8.3 显存不足的解决方案

# 方案 1: 量化
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,  # 4-bit 量化
    device_map="auto",
)

# 方案 2: CPU offload
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    offload_folder="offload",  # 部分参数放 CPU
)

# 方案 3: 减小上下文长度
llm = LLM(
    model=model_name,
    max_model_len=2048,  # 减小上下文
)

# 方案 4: 使用更小的模型
# R1-7B → R1-1.5B

九、总结

DeepSeek 部署方案对比

关键 Takeaway

Ollama 最简单：一行命令跑起来
vLLM 性能最好：生产部署首选
llama.cpp 最省资源：CPU 也能跑
量化是关键：4-bit 量化让 7B 模型在 8GB 显存上跑
微调用 QLoRA：24GB 显存可以微调 7B 模型
R1 蒸馏版性价比高：7B 大小，推理能力强