特征工程
特征选择
1. 过滤法(Filter)
过滤法是一种基于特征统计特性的特征选择方法,不依赖于具体的机器学习算法。
# 过滤法特征选择 1. 方差分析 - 删除方差接近0的特征 - 适用于数值型特征 - 简单高效 2. 相关性分析 - 计算特征与目标变量的相关性 - 选择相关性强的特征 - 适用于数值型特征 3. 信息增益 - 计算特征的信息增益 - 选择信息增益大的特征 - 适用于分类问题 4. 卡方检验 - 计算特征与目标变量的卡方值 - 选择卡方值大的特征 - 适用于分类问题 # 代码示例 import numpy as np import pandas as pd from sklearn.feature_selection import ( VarianceThreshold, SelectKBest, f_classif, mutual_info_classif, chi2 ) # 1. 方差分析 def variance_selection(X, threshold=0.01): selector = VarianceThreshold(threshold=threshold) return selector.fit_transform(X) # 2. 相关性分析 def correlation_selection(X, y, k=10): selector = SelectKBest(f_classif, k=k) return selector.fit_transform(X, y) # 3. 信息增益 def information_gain_selection(X, y, k=10): selector = SelectKBest(mutual_info_classif, k=k) return selector.fit_transform(X, y) # 4. 卡方检验 def chi2_selection(X, y, k=10): selector = SelectKBest(chi2, k=k) return selector.fit_transform(X, y) # 实际应用示例 def filter_feature_selection(X, y, method='variance', **kwargs): if method == 'variance': return variance_selection(X, **kwargs) elif method == 'correlation': return correlation_selection(X, y, **kwargs) elif method == 'information_gain': return information_gain_selection(X, y, **kwargs) elif method == 'chi2': return chi2_selection(X, y, **kwargs) else: raise ValueError("Invalid filter method")
2. 包装法(Wrapper)
包装法是一种基于具体机器学习算法的特征选择方法,通过评估特征子集的性能来选择特征。
# 包装法特征选择 1. 递归特征消除(RFE) - 递归地删除特征 - 基于模型的特征重要性 - 适用于任何模型 2. 前向选择 - 逐步添加特征 - 评估每个特征的影响 - 适用于特征数量较少 3. 后向消除 - 逐步删除特征 - 评估每个特征的影响 - 适用于特征数量较多 4. 遗传算法 - 模拟自然选择过程 - 优化特征子集 - 适用于大规模特征选择 # 代码示例 import numpy as np from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score # 1. 递归特征消除 def recursive_feature_elimination(X, y, n_features_to_select=10): estimator = RandomForestClassifier(n_estimators=100) selector = RFE(estimator, n_features_to_select=n_features_to_select) return selector.fit_transform(X, y) # 2. 前向选择 def forward_selection(X, y, n_features_to_select=10): n_features = X.shape[1] selected_features = [] for i in range(n_features_to_select): best_score = -np.inf best_feature = None for feature in range(n_features): if feature not in selected_features: current_features = selected_features + [feature] X_subset = X[:, current_features] score = cross_val_score( LogisticRegression(), X_subset, y, cv=5 ).mean() if score > best_score: best_score = score best_feature = feature selected_features.append(best_feature) return X[:, selected_features] # 3. 后向消除 def backward_elimination(X, y, n_features_to_select=10): n_features = X.shape[1] selected_features = list(range(n_features)) while len(selected_features) > n_features_to_select: worst_score = np.inf worst_feature = None for feature in selected_features: current_features = selected_features.copy() current_features.remove(feature) X_subset = X[:, current_features] score = cross_val_score( LogisticRegression(), X_subset, y, cv=5 ).mean() if score < worst_score: worst_score = score worst_feature = feature selected_features.remove(worst_feature) return X[:, selected_features] # 实际应用示例 def wrapper_feature_selection(X, y, method='rfe', **kwargs): if method == 'rfe': return recursive_feature_elimination(X, y, **kwargs) elif method == 'forward': return forward_selection(X, y, **kwargs) elif method == 'backward': return backward_elimination(X, y, **kwargs) else: raise ValueError("Invalid wrapper method")
3. 嵌入法(Embedded)
嵌入法是一种将特征选择过程嵌入到模型训练过程中的方法。
# 嵌入法特征选择 1. Lasso正则化 - 使用L1正则化 - 自动进行特征选择 - 适用于线性模型 2. Ridge正则化 - 使用L2正则化 - 缩小特征系数 - 适用于线性模型 3. 决策树特征重要性 - 基于特征对不纯度的贡献 - 适用于决策树模型 - 可以处理非线性关系 4. 随机森林特征重要性 - 基于特征对预测的贡献 - 适用于随机森林模型 - 可以处理高维特征 # 代码示例 import numpy as np from sklearn.linear_model import Lasso, Ridge from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFromModel # 1. Lasso正则化 def lasso_selection(X, y, alpha=0.01): selector = SelectFromModel(Lasso(alpha=alpha)) return selector.fit_transform(X, y) # 2. Ridge正则化 def ridge_selection(X, y, alpha=1.0): selector = SelectFromModel(Ridge(alpha=alpha)) return selector.fit_transform(X, y) # 3. 决策树特征重要性 def tree_importance_selection(X, y, threshold='median'): selector = SelectFromModel( RandomForestClassifier(n_estimators=100), threshold=threshold ) return selector.fit_transform(X, y) # 4. 随机森林特征重要性 def forest_importance_selection(X, y, threshold='median'): selector = SelectFromModel( RandomForestClassifier(n_estimators=100), threshold=threshold ) return selector.fit_transform(X, y) # 实际应用示例 def embedded_feature_selection(X, y, method='lasso', **kwargs): if method == 'lasso': return lasso_selection(X, y, **kwargs) elif method == 'ridge': return ridge_selection(X, y, **kwargs) elif method == 'tree': return tree_importance_selection(X, y, **kwargs) elif method == 'forest': return forest_importance_selection(X, y, **kwargs) else: raise ValueError("Invalid embedded method")