Scikit-learn：从入门到放弃

文章	评论	标签
180	0	297

第一章：相遇——"这也太简单了吧！"

每个机器学习初学者都曾有过这样的蜜月期。

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X_test)

三行代码，一个模型。你开始觉得自己是天选之子，机器学习不过如此。你甚至开始在简历上写"精通机器学习"。

这时候的你，笑容灿烂，头发茂密。

第二章：热恋——"Sklearn真是个宝藏！"

你发现Sklearn就像一个百宝箱：


# 数据预处理？有！
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

# 特征选择？有！
from sklearn.feature_selection import SelectKBest, RFE

# 模型选择？应有尽有！
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# 交叉验证？当然有！
from sklearn.model_selection import cross_val_score, GridSearchCV

你开始疯狂收藏各种模型，就像收集宝可梦一样。每天的口头禅变成了"这个问题用RandomForest试试"。

第三章：磨合期——"等等，为什么报错了？"

3.1 维度地狱


model.fit(X, y)
# ValueError: Expected 2D array, got 1D array instead`

你盯着屏幕，心想：我的X明明是个数组啊？

然后你学会了：

python

`X = X.reshape(-1, 1)  # 从此reshape成为你最熟悉的朋友

3.2 类型陷阱


# 你以为这样就行了
df['category'] = df['category'].astype('category')

# Sklearn冷笑一声
# ValueError: could not convert string to float

原来Sklearn对字符串过敏，你必须：


from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# 或者更现代的方式
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_columns),
        ('num', StandardScaler(), numerical_columns)
    ])

此时你开始感受到一丝不安。

3.3 缺失值恐惧症


model.fit(X, y)
# ValueError: Input contains NaN`

Sklearn对NaN的态度就像你妈对你单身的态度——零容忍。

python

`from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')  # 或者 'median', 'most_frequent'
X_clean = imputer.fit_transform(X)

第四章：争吵期——"Pipeline是什么鬼？"

当你的预处理步骤越来越多，代码开始变成意大利面：


# 你的代码
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # 注意！是transform不是fit_transform

selector = SelectKBest(k=10)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

model = RandomForestClassifier()
model.fit(X_train_selected, y_train)
predictions = model.predict(X_test_selected)

然后有人告诉你应该用Pipeline：


from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(k=10)),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

"哇，这么优雅！"你想。

直到你需要调参：


from sklearn.model_selection import GridSearchCV

param_grid = {
    'selector__k': [5, 10, 15],
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [5, 10, None]
}

# 双下划线是什么鬼命名convention？？？
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

你开始怀疑人生。

第五章：冷战期——"调参使我快乐（并不）"

5.1 GridSearchCV：暴力美学


param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 5 × 5 × 3 × 3 × 5折交叉验证 = 1125次训练
# 预计完成时间：明年

你开启GridSearchCV，然后去喝了杯咖啡，吃了顿午饭，看了部电影，睡了一觉……它还在跑。

5.2 RandomizedSearchCV：玄学调参


from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(5, 50),
    'min_samples_split': uniform(0.01, 0.1)
}

random_search = RandomizedSearchCV(model, param_distributions, n_iter=100, cv=5)

本质上就是：我不知道什么参数好，随便试试吧。

5.3 经典场景

`你：GridSearchCV跑完了吗？
电脑：还有47小时。
你：最佳参数是什么？
电脑：n_estimators=100（默认值）
你：……`

第六章：濒临分手——"为什么准确率上不去？"

6.1 过拟合：训练集上的王者


print(f"训练集准确率: {model.score(X_train, y_train):.4f}")  # 0.9999
print(f"测试集准确率: {model.score(X_test, y_test):.4f}")   # 0.6543

# 你的模型在训练集上是学霸
# 在测试集上是学渣

6.2 欠拟合：平庸的全面发展


print(f"训练集准确率: {model.score(X_train, y_train):.4f}")  # 0.6012
print(f"测试集准确率: {model.score(X_test, y_test):.4f}")   # 0.5987

# 训练集和测试集都一样差
# 一视同仁的平庸

6.3 数据泄露：甜蜜的谎言


# 错误示范（但你一开始肯定这么干过）
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # 用全部数据fit！大忌！

X_train, X_test = train_test_split(X_scaled, test_size=0.2)
model.fit(X_train, y_train)

# 准确率99%！我是天才！
# 实际上你作弊了

第七章：复合——"原来是我不够了解你"

经历了无数个调参的深夜，你终于悟了：

7.1 理解fit、transform、fit_transform


# fit：学习数据的统计特性
# transform：应用已学习的特性
# fit_transform：学习并应用

# 训练集：fit_transform（学习并应用）
# 测试集：transform（只应用，用训练集学到的参数）

7.2 交叉验证的正确姿势


from sklearn.model_selection import cross_validate

cv_results = cross_validate(
    model, X, y, 
    cv=5,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    return_train_score=True
)

# 不要只看准确率！
print(f"准确率: {cv_results['test_accuracy'].mean():.4f}")
print(f"精确率: {cv_results['test_precision'].mean():.4f}")
print(f"召回率: {cv_results['test_recall'].mean():.4f}")
print(f"F1分数: {cv_results['test_f1'].mean():.4f}")

7.3 学习曲线：看穿模型的灵魂


from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, test_scores = learning_curve(
    model, X, y, cv=5, 
    train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='训练集')
plt.plot(train_sizes, test_scores.mean(axis=1), label='验证集')
plt.xlabel('训练样本数')
plt.ylabel('准确率')
plt.legend()
plt.title('学习曲线 - 模型的成长轨迹')

第八章：成熟——"放弃是不可能放弃的"

最终，你形成了自己的最佳实践：

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 1. 明确数值列和分类列
numerical_features = ['age', 'income', 'score']
categorical_features = ['gender', 'city', 'category']

# 2. 构建预处理器
numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# 3. 构建完整Pipeline
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 4. 交叉验证
cv_scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='f1_macro')
print(f"交叉验证F1分数: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# 5. 最终训练和评估
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

尾声：那些年我们踩过的坑

坑	症状	解药
忘记划分数据集	准确率99.9%	train_test_split
用测试集调参	准确率虚高	使用验证集或交叉验证
忘记处理缺失值	ValueError	SimpleImputer
特征没有标准化	SVM/KNN效果差	StandardScaler
类别不平衡	少数类预测全错	class_weight='balanced'
随机种子不固定	结果不可复现	random_state=42

终极感悟

Sklearn不是让你放弃的，而是让你在放弃的边缘反复横跳。

当你第108次修改参数，第67次重构Pipeline，第23次怀疑人生之后，你会发现：

机器学习的本质不是让机器学习，而是让你学习。

而Sklearn，就是那个既给你糖吃又打你屁股的严厉老师。

"调参一时爽，一直调参一直爽。"

—— 致所有在Sklearn中迷失又找到自我的数据科学家