前言:从"读文字"到"看世界"
GPT-3 时代的大模型只能处理文本:
用户: [发送一张猫的图片]
GPT-3: 抱歉,我无法查看图片。请用文字描述一下。GPT-4V 时代的多模态大模型:
用户: [发送一张猫的图片] 这是什么品种的猫?
GPT-4V: 这是一只英国短毛猫(British Shorthair),特征是圆圆的脸、
铜色的眼睛和蓝灰色的毛发。这只猫看起来大约2-3岁...多模态(Multimodal) = 多种信息形式:文本、图像、音频、视频...
graph TB
subgraph 单模态时代
T1[文本输入] --> LLM1[语言模型]
LLM1 --> T2[文本输出]
end
subgraph 多模态时代
I[图像] --> MLLM[多模态大模型]
A[音频] --> MLLM
V[视频] --> MLLM
T[文本] --> MLLM
MLLM --> O1[文本回答]
MLLM --> O2[图像生成]
MLLM --> O3[语音合成]
end
style MLLM fill:#4ecdc4
一、多模态大模型全景
1.1 多模态的类型
mindmap
root((多模态))
理解型
图像理解
GPT-4V
Claude 3
Gemini
Qwen-VL
视频理解
Gemini 1.5
GPT-4o
音频理解
Whisper
GPT-4o
生成型
文生图
DALL-E 3
Midjourney
Stable Diffusion
文生视频
Sora
Runway
Pika
文生音频
Suno
ElevenLabs
双向型
GPT-4o
Gemini
1.2 主流多模态模型对比
| 模型 | 图像理解 | 图像生成 | 音频理解 | 视频理解 | 厂商 |
|---|---|---|---|---|---|
| GPT-4o | ✅ | ✅ | ✅ | ✅ | OpenAI |
| Claude 3 | ✅ | ❌ | ❌ | ❌ | Anthropic |
| Gemini 1.5 | ✅ | ✅ | ✅ | ✅ | |
| Qwen-VL | ✅ | ❌ | ❌ | ❌ | 阿里 |
| GLM-4V | ✅ | ❌ | ❌ | ❌ | 智谱 |
| LLaVA | ✅ | ❌ | ❌ | ❌ | 开源 |
1.3 多模态的技术路线
flowchart TB
subgraph 技术路线
A[图像编码器
Vision Encoder] B[投影层
Projector] C[语言模型
LLM] A --> B --> C end subgraph 代表模型 A1[CLIP ViT] A2[SigLIP] A3[EVA-CLIP] B1[MLP] B2[Q-Former] B3[Perceiver] C1[LLaMA] C2[Vicuna] C3[Qwen] end style B fill:#ffe66d
Vision Encoder] B[投影层
Projector] C[语言模型
LLM] A --> B --> C end subgraph 代表模型 A1[CLIP ViT] A2[SigLIP] A3[EVA-CLIP] B1[MLP] B2[Q-Former] B3[Perceiver] C1[LLaMA] C2[Vicuna] C3[Qwen] end style B fill:#ffe66d
核心思路:
- 视觉编码器:把图像转换为向量序列
- 投影层:将视觉向量对齐到语言空间
- 语言模型:统一处理视觉和文本信息
二、视觉语言模型(VLM)原理
2.1 CLIP:连接视觉和语言
CLIP(Contrastive Language-Image Pre-training)是多模态的基石。
核心思想:让图像和对应的文本描述在向量空间中距离更近。
graph LR
subgraph CLIP训练
I[图像] --> VE[Vision Encoder]
VE --> IV[图像向量]
T[文本描述] --> TE[Text Encoder]
TE --> TV[文本向量]
IV --> Sim[对比学习]
TV --> Sim
end
import torch
import torch.nn.functional as F
from transformers import CLIPProcessor, CLIPModel
def clip_demo():
"""CLIP 基础使用"""
# 加载模型
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# 准备输入
from PIL import Image
image = Image.open("cat.jpg")
texts = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
# 处理输入
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
padding=True,
)
# 获取特征
outputs = model(**inputs)
image_embeds = outputs.image_embeds # (1, 512)
text_embeds = outputs.text_embeds # (3, 512)
# 计算相似度
similarity = F.cosine_similarity(
image_embeds.unsqueeze(1),
text_embeds.unsqueeze(0),
dim=-1
)
# 输出概率
probs = F.softmax(similarity * 100, dim=-1) # temperature=100
for text, prob in zip(texts, probs[0]):
print(f"{text}: {prob:.2%}")
# 输出:
# a photo of a cat: 95.23%
# a photo of a dog: 4.12%
# a photo of a car: 0.65%
# 零样本图像分类
def zero_shot_classification(image, labels):
"""零样本分类"""
texts = [f"a photo of a {label}" for label in labels]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
return {label: prob.item() for label, prob in zip(labels, probs[0])}2.2 视觉语言模型架构
以 LLaVA 为例:
flowchart TB
subgraph LLaVA架构
Image[输入图像] --> ViT[CLIP ViT
视觉编码器] ViT --> Features[图像特征
N × D] Features --> Projector[MLP 投影层] Projector --> VisualTokens[视觉 Tokens] Text[用户问题] --> Tokenizer[Tokenizer] Tokenizer --> TextTokens[文本 Tokens] VisualTokens --> Concat[拼接] TextTokens --> Concat Concat --> LLM[LLaMA/Vicuna] LLM --> Answer[回答] end
视觉编码器] ViT --> Features[图像特征
N × D] Features --> Projector[MLP 投影层] Projector --> VisualTokens[视觉 Tokens] Text[用户问题] --> Tokenizer[Tokenizer] Tokenizer --> TextTokens[文本 Tokens] VisualTokens --> Concat[拼接] TextTokens --> Concat Concat --> LLM[LLaMA/Vicuna] LLM --> Answer[回答] end
关键设计:
- 视觉编码器:通常使用预训练的 CLIP ViT
- 投影层:将视觉特征映射到语言空间(维度对齐)
- 统一序列:图像 token + 文本 token 一起送入 LLM
2.3 LLaVA 代码解析
import torch
import torch.nn as nn
from transformers import CLIPVisionModel, LlamaForCausalLM
class LLaVA(nn.Module):
"""简化版 LLaVA 架构"""
def __init__(
self,
vision_model_name: str = "openai/clip-vit-large-patch14",
llm_model_name: str = "meta-llama/Llama-2-7b-hf",
vision_hidden_size: int = 1024,
llm_hidden_size: int = 4096,
):
super().__init__()
# 1. 视觉编码器(冻结)
self.vision_encoder = CLIPVisionModel.from_pretrained(vision_model_name)
for param in self.vision_encoder.parameters():
param.requires_grad = False
# 2. 投影层(可训练)
self.projector = nn.Sequential(
nn.Linear(vision_hidden_size, llm_hidden_size),
nn.GELU(),
nn.Linear(llm_hidden_size, llm_hidden_size),
)
# 3. 语言模型
self.llm = LlamaForCausalLM.from_pretrained(llm_model_name)
def encode_image(self, images: torch.Tensor) -> torch.Tensor:
"""编码图像为视觉 token"""
# 获取视觉特征
vision_outputs = self.vision_encoder(images)
image_features = vision_outputs.last_hidden_state # (B, N, D_v)
# 投影到语言空间
visual_tokens = self.projector(image_features) # (B, N, D_l)
return visual_tokens
def forward(
self,
images: torch.Tensor,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
labels: torch.Tensor = None,
):
"""前向传播"""
# 1. 编码图像
visual_tokens = self.encode_image(images) # (B, N_img, D)
# 2. 获取文本嵌入
text_embeds = self.llm.get_input_embeddings()(input_ids) # (B, N_txt, D)
# 3. 拼接视觉和文本 token
# 假设格式: [视觉tokens] [文本tokens]
inputs_embeds = torch.cat([visual_tokens, text_embeds], dim=1)
# 4. 扩展 attention mask
visual_attention = torch.ones(
visual_tokens.shape[:2],
device=attention_mask.device,
)
attention_mask = torch.cat([visual_attention, attention_mask], dim=1)
# 5. 前向 LLM
outputs = self.llm(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
labels=labels,
)
return outputs
@torch.no_grad()
def generate(
self,
images: torch.Tensor,
input_ids: torch.Tensor,
**generate_kwargs,
):
"""生成回答"""
# 编码图像
visual_tokens = self.encode_image(images)
# 获取文本嵌入
text_embeds = self.llm.get_input_embeddings()(input_ids)
# 拼接
inputs_embeds = torch.cat([visual_tokens, text_embeds], dim=1)
# 生成
outputs = self.llm.generate(
inputs_embeds=inputs_embeds,
**generate_kwargs,
)
return outputs2.4 训练策略
LLaVA 的两阶段训练:
flowchart LR
subgraph Stage1["阶段1: 预训练"]
D1[图文对数据
595K] --> T1[只训练投影层] T1 --> M1[对齐视觉和语言] end subgraph Stage2["阶段2: 指令微调"] D2[多模态指令数据
158K] --> T2[训练投影层+LLM] T2 --> M2[学会对话] end Stage1 --> Stage2
595K] --> T1[只训练投影层] T1 --> M1[对齐视觉和语言] end subgraph Stage2["阶段2: 指令微调"] D2[多模态指令数据
158K] --> T2[训练投影层+LLM] T2 --> M2[学会对话] end Stage1 --> Stage2
三、使用多模态模型
3.1 OpenAI GPT-4 Vision
from openai import OpenAI
import base64
def encode_image(image_path: str) -> str:
"""将图片编码为 base64"""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def analyze_image(image_path: str, question: str) -> str:
"""使用 GPT-4V 分析图片"""
client = OpenAI()
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": question,
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high", # low, high, auto
},
},
],
}
],
max_tokens=1000,
)
return response.choices[0].message.content
# 使用示例
result = analyze_image(
"chart.png",
"请分析这张图表,总结主要数据趋势。"
)
print(result)
# 多图分析
def analyze_multiple_images(image_paths: list, question: str) -> str:
"""分析多张图片"""
client = OpenAI()
content = [{"type": "text", "text": question}]
for path in image_paths:
base64_image = encode_image(path)
content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
},
})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
max_tokens=2000,
)
return response.choices[0].message.content
# 比较两张图片
result = analyze_multiple_images(
["before.jpg", "after.jpg"],
"请比较这两张图片,描述主要的变化。"
)3.2 Claude 3 Vision
import anthropic
import base64
def analyze_with_claude(image_path: str, question: str) -> str:
"""使用 Claude 3 分析图片"""
client = anthropic.Anthropic()
# 读取图片
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
# 确定媒体类型
if image_path.endswith(".png"):
media_type = "image/png"
elif image_path.endswith(".gif"):
media_type = "image/gif"
else:
media_type = "image/jpeg"
message = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{
"type": "text",
"text": question,
}
],
}
],
)
return message.content[0].text
# 使用
result = analyze_with_claude(
"document.png",
"请提取这份文档中的所有文字内容。"
)3.3 本地部署 LLaVA
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import torch
def load_llava():
"""加载 LLaVA 模型"""
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
return model, processor
def chat_with_image(model, processor, image_path: str, question: str) -> str:
"""与图片对话"""
image = Image.open(image_path)
# 构造对话
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question},
],
},
]
# 处理输入
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
# 生成
output = model.generate(
**inputs,
max_new_tokens=500,
do_sample=True,
temperature=0.7,
)
# 解码
response = processor.decode(output[0], skip_special_tokens=True)
# 提取回答部分
answer = response.split("[/INST]")[-1].strip()
return answer
# 使用
model, processor = load_llava()
answer = chat_with_image(
model, processor,
"photo.jpg",
"描述这张图片中的场景。"
)
print(answer)3.4 Qwen-VL 使用
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
def use_qwen_vl():
"""使用 Qwen-VL"""
model_name = "Qwen/Qwen-VL-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True,
).eval()
# 单张图片问答
query = tokenizer.from_list_format([
{'image': 'image.jpg'},
{'text': '这张图片里有什么?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 多轮对话
response2, history = model.chat(
tokenizer,
query='图片中的人在做什么?',
history=history,
)
print(response2)
# 多图理解
query = tokenizer.from_list_format([
{'image': 'image1.jpg'},
{'image': 'image2.jpg'},
{'text': '比较这两张图片的异同。'},
])
response, _ = model.chat(tokenizer, query=query, history=None)
print(response)四、多模态应用场景
4.1 文档理解与提取
class DocumentAnalyzer:
"""文档分析器"""
def __init__(self, model_client):
self.client = model_client
def extract_text(self, image_path: str) -> str:
"""OCR 提取文字"""
return self.analyze(image_path, "请提取图片中的所有文字内容,保持原有格式。")
def extract_table(self, image_path: str) -> str:
"""提取表格"""
return self.analyze(
image_path,
"请将图片中的表格转换为 Markdown 格式。"
)
def summarize_document(self, image_path: str) -> str:
"""文档摘要"""
return self.analyze(image_path, "请总结这份文档的主要内容。")
def extract_key_info(self, image_path: str, fields: list) -> dict:
"""提取关键信息"""
fields_str = "、".join(fields)
response = self.analyze(
image_path,
f"请从文档中提取以下信息:{fields_str}。以 JSON 格式返回。"
)
return json.loads(response)
def analyze(self, image_path: str, prompt: str) -> str:
"""通用分析"""
# 调用多模态模型
return analyze_image(image_path, prompt)
# 使用示例
analyzer = DocumentAnalyzer(client)
# 发票信息提取
invoice_info = analyzer.extract_key_info(
"invoice.jpg",
["发票号码", "开票日期", "金额", "购买方", "销售方"]
)
print(invoice_info)
# {'发票号码': '12345678', '开票日期': '2024-01-15', ...}
# 简历解析
resume_text = analyzer.extract_text("resume.png")4.2 图像内容审核
class ContentModerator:
"""内容审核"""
MODERATION_PROMPT = """请分析这张图片是否包含以下不当内容:
1. 暴力内容
2. 色情内容
3. 仇恨言论
4. 危险行为
5. 虚假信息
请以 JSON 格式返回,包含以下字段:
- is_safe: 是否安全 (true/false)
- categories: 检测到的问题类别列表
- confidence: 置信度 (0-1)
- reason: 判断理由
"""
def moderate(self, image_path: str) -> dict:
"""审核图片"""
response = analyze_image(image_path, self.MODERATION_PROMPT)
return json.loads(response)
def batch_moderate(self, image_paths: list) -> list:
"""批量审核"""
results = []
for path in image_paths:
result = self.moderate(path)
result["image_path"] = path
results.append(result)
return results
# 使用
moderator = ContentModerator()
result = moderator.moderate("user_upload.jpg")
if not result["is_safe"]:
print(f"检测到不当内容: {result['categories']}")
print(f"原因: {result['reason']}")4.3 图表分析
class ChartAnalyzer:
"""图表分析器"""
def analyze_chart(self, image_path: str) -> dict:
"""分析图表"""
prompt = """请分析这张图表,提供以下信息:
1. 图表类型(折线图/柱状图/饼图等)
2. 数据主题
3. 主要数据点
4. 趋势分析
5. 关键洞察
请以 JSON 格式返回。"""
response = analyze_image(image_path, prompt)
return json.loads(response)
def extract_data(self, image_path: str) -> list:
"""提取图表数据"""
prompt = """请从这张图表中提取数据,以 JSON 数组格式返回。
例如:[{"x": "2020", "y": 100}, {"x": "2021", "y": 150}]"""
response = analyze_image(image_path, prompt)
return json.loads(response)
def generate_insights(self, image_path: str) -> str:
"""生成业务洞察"""
prompt = """作为数据分析师,请分析这张图表并提供:
1. 3个关键发现
2. 可能的原因分析
3. 建议的行动项
请用简洁的商业语言表达。"""
return analyze_image(image_path, prompt)
# 使用
chart_analyzer = ChartAnalyzer()
# 分析销售图表
analysis = chart_analyzer.analyze_chart("sales_chart.png")
print(f"图表类型: {analysis['chart_type']}")
print(f"趋势: {analysis['trend']}")
# 生成报告
insights = chart_analyzer.generate_insights("quarterly_report.png")
print(insights)4.4 多模态 RAG
class MultimodalRAG:
"""多模态 RAG 系统"""
def __init__(self, text_retriever, image_retriever, vlm_client):
self.text_retriever = text_retriever
self.image_retriever = image_retriever
self.vlm = vlm_client
def query(self, question: str, image: str = None) -> str:
"""多模态查询"""
context_parts = []
# 1. 文本检索
text_docs = self.text_retriever.retrieve(question)
context_parts.append("相关文档:\n" + "\n".join(text_docs))
# 2. 图像检索(如果问题涉及图像)
if self.needs_image_search(question):
relevant_images = self.image_retriever.retrieve(question)
for img_path in relevant_images:
# 用 VLM 描述图片
description = self.vlm.describe(img_path)
context_parts.append(f"相关图片描述:{description}")
# 3. 如果用户提供了图片
if image:
image_analysis = self.vlm.analyze(image, question)
context_parts.append(f"用户图片分析:{image_analysis}")
# 4. 生成回答
context = "\n\n".join(context_parts)
prompt = f"""基于以下信息回答问题:
{context}
问题:{question}
回答:"""
return self.vlm.generate(prompt)
def needs_image_search(self, question: str) -> bool:
"""判断是否需要图像检索"""
image_keywords = ["图", "照片", "看起来", "外观", "样子", "图表"]
return any(kw in question for kw in image_keywords)
# 图像检索器
class ImageRetriever:
"""基于 CLIP 的图像检索"""
def __init__(self, image_embeddings, image_paths):
self.embeddings = image_embeddings
self.paths = image_paths
self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def retrieve(self, query: str, top_k: int = 3) -> list:
"""检索相关图片"""
# 编码查询
inputs = self.processor(text=[query], return_tensors="pt", padding=True)
text_embeds = self.model.get_text_features(**inputs)
# 计算相似度
similarities = F.cosine_similarity(
text_embeds.unsqueeze(1),
torch.tensor(self.embeddings).unsqueeze(0),
dim=-1
)
# 获取 top-k
top_indices = similarities[0].argsort(descending=True)[:top_k]
return [self.paths[i] for i in top_indices]五、图像生成
5.1 DALL-E 3
from openai import OpenAI
def generate_image(prompt: str, size: str = "1024x1024") -> str:
"""使用 DALL-E 3 生成图片"""
client = OpenAI()
response = client.images.generate(
model="dall-e-3",
prompt=prompt,
size=size, # 1024x1024, 1792x1024, 1024x1792
quality="hd", # standard, hd
n=1,
)
image_url = response.data[0].url
revised_prompt = response.data[0].revised_prompt
print(f"优化后的提示词: {revised_prompt}")
return image_url
def edit_image(image_path: str, mask_path: str, prompt: str) -> str:
"""编辑图片"""
client = OpenAI()
response = client.images.edit(
model="dall-e-2", # 目前只有 DALL-E 2 支持编辑
image=open(image_path, "rb"),
mask=open(mask_path, "rb"),
prompt=prompt,
size="1024x1024",
n=1,
)
return response.data[0].url
def create_variation(image_path: str) -> str:
"""创建图片变体"""
client = OpenAI()
response = client.images.create_variation(
model="dall-e-2",
image=open(image_path, "rb"),
size="1024x1024",
n=1,
)
return response.data[0].url
# 使用示例
image_url = generate_image(
"一只橘猫坐在窗台上看着窗外的雨,水彩画风格,柔和的光线"
)
print(f"生成的图片: {image_url}")5.2 Stable Diffusion(本地部署)
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
def setup_stable_diffusion():
"""设置 Stable Diffusion"""
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
)
# 使用更快的调度器
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
# 启用内存优化
pipe.enable_attention_slicing()
return pipe
def generate_with_sd(
pipe,
prompt: str,
negative_prompt: str = None,
num_steps: int = 25,
guidance_scale: float = 7.5,
seed: int = None,
) -> "Image":
"""使用 Stable Diffusion 生成图片"""
generator = torch.Generator("cuda").manual_seed(seed) if seed else None
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt or "blurry, bad quality, distorted",
num_inference_steps=num_steps,
guidance_scale=guidance_scale,
generator=generator,
).images[0]
return image
# 使用
pipe = setup_stable_diffusion()
image = generate_with_sd(
pipe,
prompt="a beautiful sunset over mountains, photorealistic, 8k, detailed",
negative_prompt="cartoon, anime, drawing",
num_steps=30,
guidance_scale=7.5,
seed=42,
)
image.save("generated_image.png")5.3 提示词工程(图像生成)
class ImagePromptEngineer:
"""图像生成提示词工程"""
# 风格修饰词
STYLES = {
"写实": "photorealistic, 8k, highly detailed, professional photography",
"油画": "oil painting, impressionist style, textured brushstrokes",
"水彩": "watercolor painting, soft colors, flowing, artistic",
"动漫": "anime style, vibrant colors, detailed illustration",
"赛博朋克": "cyberpunk, neon lights, futuristic, dark atmosphere",
"极简": "minimalist, clean, simple, white background",
}
# 质量修饰词
QUALITY_BOOSTERS = [
"masterpiece",
"best quality",
"highly detailed",
"sharp focus",
"professional",
]
# 负面提示词
NEGATIVE_PROMPTS = {
"通用": "blurry, bad quality, distorted, ugly, deformed",
"人物": "bad anatomy, bad hands, missing fingers, extra limbs",
"风景": "oversaturated, unrealistic colors, poor composition",
}
def enhance_prompt(
self,
base_prompt: str,
style: str = None,
add_quality: bool = True,
) -> str:
"""增强提示词"""
parts = [base_prompt]
# 添加风格
if style and style in self.STYLES:
parts.append(self.STYLES[style])
# 添加质量词
if add_quality:
parts.extend(self.QUALITY_BOOSTERS[:3])
return ", ".join(parts)
def get_negative_prompt(self, category: str = "通用") -> str:
"""获取负面提示词"""
return self.NEGATIVE_PROMPTS.get(category, self.NEGATIVE_PROMPTS["通用"])
# 使用
engineer = ImagePromptEngineer()
base = "一只猫坐在窗台上"
enhanced = engineer.enhance_prompt(base, style="水彩", add_quality=True)
negative = engineer.get_negative_prompt("通用")
print(f"增强后: {enhanced}")
print(f"负面提示: {negative}")六、总结
多模态核心要点
mindmap
root((多模态))
理解
CLIP 基础
VLM 架构
图像编码器+投影+LLM
模型选择
GPT-4o 商业首选
LLaVA 开源首选
Qwen-VL 中文首选
应用场景
文档理解
图表分析
内容审核
多模态RAG
图像生成
DALL-E 3
Stable Diffusion
提示词工程
关键 Takeaway
- 多模态 = 视觉 + 语言的融合:让 AI 真正"看懂"世界
- VLM 架构:视觉编码器 + 投影层 + LLM
- API 使用简单:GPT-4V、Claude 3 都支持图片输入
- 本地部署可行:LLaVA、Qwen-VL 可以在消费级 GPU 运行
- 应用场景丰富:文档理解、图表分析、内容审核、多模态 RAG
- 图像生成成熟:DALL-E 3、Stable Diffusion 效果优秀
模型选择建议
| 场景 | 推荐模型 | 备注 |
|---|---|---|
| 商业应用 | GPT-4o | 效果最好,成本高 |
| 开源部署 | LLaVA 1.6 | 7B 即可,效果不错 |
| 中文场景 | Qwen-VL | 中文理解更好 |
| 图像生成 | DALL-E 3 | 质量高,提示词自动优化 |
| 本地生成 | SD XL | 开源,可定制 |
下一步学习
- [ ] 视频理解:Gemini 1.5, GPT-4o
- [ ] 语音交互:Whisper, TTS
- [ ] 多模态 Agent:自动操作界面
参考资料
- CLIP 论文 - Learning Transferable Visual Models
- LLaVA 论文 - Visual Instruction Tuning
- GPT-4V 技术报告
- Qwen-VL - 阿里多模态模型
- Stable Diffusion - 开源图像生成