该项目使用 shapely 值来选择与 scikit 学习管道兼容的 Top n 特征
项目描述
佐伊什
Zoish 是一个旨在简化机器学习开发的软件包。它的主要部分之一是使用 SHAP(SHapley Additive exPlanation)进行更好的特征选择的类。它与scikit-learn管道兼容。这个包在计算形状值时使用FastTreeSHAP和用于绘图的SHAP。
介绍
Zoish 包的 ScallyShapFeatureSelector 可以接收各种参数。从基于树的估计器类到其调整参数,从网格搜索、随机搜索或 Optuna 到它们的参数。样本将被拆分为训练集和验证集,然后优化将估计最佳相关参数。
之后,将返回具有较高 shap 值的最佳特征子集。该子集可用作 Sklearn 管道的后续步骤。
安装
Zoish 包在 PyPI 上可用,可以使用 pip 安装:
pip install zoish
支持的估计器
- XGBRegressor XGBoost
- XGBClassifier XGBoost
- 随机森林分类器
- 随机森林回归器
- CatBoost分类器
- CatBoost 回归器
- 平衡随机森林分类器
- LGBMClassifier LightGBM
- LGBMRegressor LightGBM
- XGBSEKaplanNeighbors XGBoost 生存嵌入
- XGBSEDebiasedBCE XGBoost 生存嵌入
- XGBSEBootstrapEstimator XGBoost 生存嵌入
用法
- 使用超参数优化后具有最高形状值的特定基于树的模型查找特征
- 绘制选定特征的形状摘要图
- 返回具有特征和形状值列表的已排序的两列 Pandas 数据框。
例子
导入所需的库
from zoish.feature_selectors.optunashap import OptunaShapFeatureSelector
import xgboost
from optuna.pruners import HyperbandPruner
from optuna.samplers._tpe.sampler import TPESampler
from sklearn.model_selection import KFold,train_test_split
import pandas as pd
from sklearn.pipeline import Pipeline
from feature_engine.imputation import (
CategoricalImputer,
MeanMedianImputer
)
from category_encoders import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
classification_report,
confusion_matrix,
f1_score)
import lightgbm
import matplotlib.pyplot as plt
import optuna
计算机硬件数据集(分类问题)
urldata= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
# column names
col_names=["age", "workclass", "fnlwgt" , "education" ,"education-num",
"marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week",
"native-country","label"
]
# read data
data = pd.read_csv(urldata,header=None,names=col_names,sep=',')
data.head()
data.loc[data['label']=='<=50K','label']=0
data.loc[data['label']==' <=50K','label']=0
data.loc[data['label']=='>50K','label']=1
data.loc[data['label']==' >50K','label']=1
data['label']=data['label'].astype(int)
训练测试拆分
X = data.loc[:, data.columns != "label"]
y = data.loc[:, data.columns == "label"]
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.33, stratify=y['label'], random_state=42)
查找特征类型以供以后使用
int_cols = X_train.select_dtypes(include=['int']).columns.tolist()
float_cols = X_train.select_dtypes(include=['float']).columns.tolist()
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()
定义特征选择器并设置其参数
optuna_classification_lgb = OptunaShapFeatureSelector(
# general argument setting
verbose=1,
random_state=0,
logging_basicConfig = None,
# general argument setting
n_features=4,
list_of_obligatory_features_that_must_be_in_model=[],
list_of_features_to_drop_before_any_selection=[],
# shap argument setting
estimator=lightgbm.LGBMClassifier(),
estimator_params={
"max_depth": [4, 9],
"reg_alpha": [0, 1],
},
# shap arguments
model_output="raw",
feature_perturbation="interventional",
algorithm="auto",
shap_n_jobs=-1,
memory_tolerance=-1,
feature_names=None,
approximate=False,
shortcut=False,
plot_shap_summary=False,
save_shap_summary_plot=True,
path_to_save_plot = './summary_plot.png',
shap_fig = plt.figure(),
## optuna params
test_size=0.33,
with_stratified = False,
performance_metric = 'f1',
# optuna study init params
study = optuna.create_study(
storage = None,
sampler = TPESampler(),
pruner= HyperbandPruner(),
study_name = None,
direction = "maximize",
load_if_exists = False,
directions = None,
),
study_optimize_objective_n_trials=10,
)
构建 sklearn 管道
pipeline =Pipeline([
# int missing values imputers
('intimputer', MeanMedianImputer(
imputation_method='median', variables=int_cols)),
# category missing values imputers
('catimputer', CategoricalImputer(variables=cat_cols)),
#
('catencoder', OrdinalEncoder()),
# feature selection
('optuna_classification_lgb', optuna_classification_lgb),
# classification model
('logistic', LogisticRegression())
])
pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)
print('F1 score : ')
print(f1_score(y_test,y_pred))
print('Classification report : ')
print(classification_report(y_test,y_pred))
print('Confusion matrix : ')
print(confusion_matrix(y_test,y_pred))
示例中提供了更多示例。
执照
根据BSD 2-Clause许可证获得许可。
项目详情
下载文件
下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。
源分布
zoish-1.62.0.tar.gz
(342.3 kB
查看哈希)
内置分布
zoish-1.62.0-py3-none-any.whl
(345.5 kB
查看哈希)