Skip to main content

该项目使用 shapely 值来选择与 scikit 学习管道兼容的 Top n 特征

项目描述

佐伊什

Zoish 是一个旨在简化机器学习开发的软件包。它的主要部分之一是使用 SHAP(SHapley Additive exPlanation)进行更好的特征选择的类。它与scikit-learn管道兼容。这个包在计算形状值时使用FastTreeSHAP和用于绘图的SHAP

介绍

Zoish 包的 ScallyShapFeatureSelector 可以接收各种参数。从基于树的估计器类到其调整参数,从网格搜索、随机搜索或 Optuna 到它们的参数。样本将被拆分为训练集和验证集,然后优化将估计最佳相关参数。

之后,将返回具有较高 shap 值的最佳特征子集。该子集可用作 Sklearn 管道的后续步骤。

安装

Zoish 包在 PyPI 上可用,可以使用 pip 安装:

pip install zoish

支持的估计器

用法

  • 使用超参数优化后具有最高形状值的特定基于树的模型查找特征
  • 绘制选定特征的形状摘要图
  • 返回具有特征和形状值列表的已排序的两列 Pandas 数据框。

例子

导入所需的库

from zoish.feature_selectors.optunashap import OptunaShapFeatureSelector
import xgboost
from optuna.pruners import HyperbandPruner
from optuna.samplers._tpe.sampler import TPESampler
from sklearn.model_selection import KFold,train_test_split
import pandas as pd
from sklearn.pipeline import Pipeline
from feature_engine.imputation import (
    CategoricalImputer,
    MeanMedianImputer
    )
from category_encoders import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score)
import lightgbm
import matplotlib.pyplot as plt
import optuna

计算机硬件数据集(分类问题)

urldata= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
# column names
col_names=["age", "workclass", "fnlwgt" , "education" ,"education-num",
"marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week",
"native-country","label"
]
# read data
data = pd.read_csv(urldata,header=None,names=col_names,sep=',')
data.head()

data.loc[data['label']=='<=50K','label']=0
data.loc[data['label']==' <=50K','label']=0

data.loc[data['label']=='>50K','label']=1
data.loc[data['label']==' >50K','label']=1

data['label']=data['label'].astype(int)

训练测试拆分

X = data.loc[:, data.columns != "label"]
y = data.loc[:, data.columns == "label"]

X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.33, stratify=y['label'], random_state=42)


查找特征类型以供以后使用

int_cols =  X_train.select_dtypes(include=['int']).columns.tolist()
float_cols =  X_train.select_dtypes(include=['float']).columns.tolist()
cat_cols =  X_train.select_dtypes(include=['object']).columns.tolist()

定义特征选择器并设置其参数

optuna_classification_lgb = OptunaShapFeatureSelector(
        # general argument setting        
        verbose=1,
        random_state=0,
        logging_basicConfig = None,
        # general argument setting        
        n_features=4,
        list_of_obligatory_features_that_must_be_in_model=[],
        list_of_features_to_drop_before_any_selection=[],
        # shap argument setting        
        estimator=lightgbm.LGBMClassifier(),
        estimator_params={
        "max_depth": [4, 9],
        "reg_alpha": [0, 1],

        },
        # shap arguments
        model_output="raw", 
        feature_perturbation="interventional", 
        algorithm="auto", 
        shap_n_jobs=-1, 
        memory_tolerance=-1, 
        feature_names=None, 
        approximate=False, 
        shortcut=False, 
        plot_shap_summary=False,
        save_shap_summary_plot=True,
        path_to_save_plot = './summary_plot.png',
        shap_fig = plt.figure(),
        ## optuna params
        test_size=0.33,
        with_stratified = False,
        performance_metric = 'f1',
        # optuna study init params
        study = optuna.create_study(
            storage = None,
            sampler = TPESampler(),
            pruner= HyperbandPruner(),
            study_name  = None,
            direction = "maximize",
            load_if_exists = False,
            directions  = None,
            ),
        study_optimize_objective_n_trials=10, 

)

构建 sklearn 管道



pipeline =Pipeline([
            # int missing values imputers
            ('intimputer', MeanMedianImputer(
                imputation_method='median', variables=int_cols)),
            # category missing values imputers
            ('catimputer', CategoricalImputer(variables=cat_cols)),
            #
            ('catencoder', OrdinalEncoder()),
            # feature selection
            ('optuna_classification_lgb', optuna_classification_lgb),
            # classification model
            ('logistic', LogisticRegression())


 ])


pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)


print('F1 score : ')
print(f1_score(y_test,y_pred))
print('Classification report : ')
print(classification_report(y_test,y_pred))
print('Confusion matrix : ')
print(confusion_matrix(y_test,y_pred))

示例中提供了更多示例

执照

根据BSD 2-Clause许可证获得许可。

项目详情


下载文件

下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。

源分布

zoish-1.62.0.tar.gz (342.3 kB 查看哈希

已上传 source

内置分布

zoish-1.62.0-py3-none-any.whl (345.5 kB 查看哈希

已上传 py3