导航菜单

特征工程

特征选择

1. 过滤法(Filter)

过滤法是一种基于特征统计特性的特征选择方法,不依赖于具体的机器学习算法。

# 过滤法特征选择
1. 方差分析
   - 删除方差接近0的特征
   - 适用于数值型特征
   - 简单高效

2. 相关性分析
   - 计算特征与目标变量的相关性
   - 选择相关性强的特征
   - 适用于数值型特征

3. 信息增益
   - 计算特征的信息增益
   - 选择信息增益大的特征
   - 适用于分类问题

4. 卡方检验
   - 计算特征与目标变量的卡方值
   - 选择卡方值大的特征
   - 适用于分类问题

# 代码示例
import numpy as np
import pandas as pd
from sklearn.feature_selection import (
    VarianceThreshold, SelectKBest, f_classif,
    mutual_info_classif, chi2
)

# 1. 方差分析
def variance_selection(X, threshold=0.01):
    selector = VarianceThreshold(threshold=threshold)
    return selector.fit_transform(X)

# 2. 相关性分析
def correlation_selection(X, y, k=10):
    selector = SelectKBest(f_classif, k=k)
    return selector.fit_transform(X, y)

# 3. 信息增益
def information_gain_selection(X, y, k=10):
    selector = SelectKBest(mutual_info_classif, k=k)
    return selector.fit_transform(X, y)

# 4. 卡方检验
def chi2_selection(X, y, k=10):
    selector = SelectKBest(chi2, k=k)
    return selector.fit_transform(X, y)

# 实际应用示例
def filter_feature_selection(X, y, method='variance', **kwargs):
    if method == 'variance':
        return variance_selection(X, **kwargs)
    elif method == 'correlation':
        return correlation_selection(X, y, **kwargs)
    elif method == 'information_gain':
        return information_gain_selection(X, y, **kwargs)
    elif method == 'chi2':
        return chi2_selection(X, y, **kwargs)
    else:
        raise ValueError("Invalid filter method")

2. 包装法(Wrapper)

包装法是一种基于具体机器学习算法的特征选择方法,通过评估特征子集的性能来选择特征。

# 包装法特征选择
1. 递归特征消除(RFE)
   - 递归地删除特征
   - 基于模型的特征重要性
   - 适用于任何模型

2. 前向选择
   - 逐步添加特征
   - 评估每个特征的影响
   - 适用于特征数量较少

3. 后向消除
   - 逐步删除特征
   - 评估每个特征的影响
   - 适用于特征数量较多

4. 遗传算法
   - 模拟自然选择过程
   - 优化特征子集
   - 适用于大规模特征选择

# 代码示例
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# 1. 递归特征消除
def recursive_feature_elimination(X, y, n_features_to_select=10):
    estimator = RandomForestClassifier(n_estimators=100)
    selector = RFE(estimator, n_features_to_select=n_features_to_select)
    return selector.fit_transform(X, y)

# 2. 前向选择
def forward_selection(X, y, n_features_to_select=10):
    n_features = X.shape[1]
    selected_features = []
    
    for i in range(n_features_to_select):
        best_score = -np.inf
        best_feature = None
        
        for feature in range(n_features):
            if feature not in selected_features:
                current_features = selected_features + [feature]
                X_subset = X[:, current_features]
                score = cross_val_score(
                    LogisticRegression(),
                    X_subset,
                    y,
                    cv=5
                ).mean()
                
                if score > best_score:
                    best_score = score
                    best_feature = feature
        
        selected_features.append(best_feature)
    
    return X[:, selected_features]

# 3. 后向消除
def backward_elimination(X, y, n_features_to_select=10):
    n_features = X.shape[1]
    selected_features = list(range(n_features))
    
    while len(selected_features) > n_features_to_select:
        worst_score = np.inf
        worst_feature = None
        
        for feature in selected_features:
            current_features = selected_features.copy()
            current_features.remove(feature)
            X_subset = X[:, current_features]
            score = cross_val_score(
                LogisticRegression(),
                X_subset,
                y,
                cv=5
            ).mean()
            
            if score < worst_score:
                worst_score = score
                worst_feature = feature
        
        selected_features.remove(worst_feature)
    
    return X[:, selected_features]

# 实际应用示例
def wrapper_feature_selection(X, y, method='rfe', **kwargs):
    if method == 'rfe':
        return recursive_feature_elimination(X, y, **kwargs)
    elif method == 'forward':
        return forward_selection(X, y, **kwargs)
    elif method == 'backward':
        return backward_elimination(X, y, **kwargs)
    else:
        raise ValueError("Invalid wrapper method")

3. 嵌入法(Embedded)

嵌入法是一种将特征选择过程嵌入到模型训练过程中的方法。

# 嵌入法特征选择
1. Lasso正则化
   - 使用L1正则化
   - 自动进行特征选择
   - 适用于线性模型

2. Ridge正则化
   - 使用L2正则化
   - 缩小特征系数
   - 适用于线性模型

3. 决策树特征重要性
   - 基于特征对不纯度的贡献
   - 适用于决策树模型
   - 可以处理非线性关系

4. 随机森林特征重要性
   - 基于特征对预测的贡献
   - 适用于随机森林模型
   - 可以处理高维特征

# 代码示例
import numpy as np
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# 1. Lasso正则化
def lasso_selection(X, y, alpha=0.01):
    selector = SelectFromModel(Lasso(alpha=alpha))
    return selector.fit_transform(X, y)

# 2. Ridge正则化
def ridge_selection(X, y, alpha=1.0):
    selector = SelectFromModel(Ridge(alpha=alpha))
    return selector.fit_transform(X, y)

# 3. 决策树特征重要性
def tree_importance_selection(X, y, threshold='median'):
    selector = SelectFromModel(
        RandomForestClassifier(n_estimators=100),
        threshold=threshold
    )
    return selector.fit_transform(X, y)

# 4. 随机森林特征重要性
def forest_importance_selection(X, y, threshold='median'):
    selector = SelectFromModel(
        RandomForestClassifier(n_estimators=100),
        threshold=threshold
    )
    return selector.fit_transform(X, y)

# 实际应用示例
def embedded_feature_selection(X, y, method='lasso', **kwargs):
    if method == 'lasso':
        return lasso_selection(X, y, **kwargs)
    elif method == 'ridge':
        return ridge_selection(X, y, **kwargs)
    elif method == 'tree':
        return tree_importance_selection(X, y, **kwargs)
    elif method == 'forest':
        return forest_importance_selection(X, y, **kwargs)
    else:
        raise ValueError("Invalid embedded method")