该项目使用 shapely 值来选择与 scikit 学习管道兼容的 Top n 特征

项目描述

佐伊什

Zoish 是一个旨在简化机器学习开发的软件包。它的主要部分之一是使用 SHAP（SHapley Additive exPlanation）进行更好的特征选择的类。它与scikit-learn管道兼容。这个包在计算形状值时使用FastTreeSHAP和用于绘图的 SHAP。

介绍

Zoish 包的 ScallyShapFeatureSelector 可以接收各种参数。从基于树的估计器类到其调整参数，从网格搜索、随机搜索或 Optuna 到它们的参数。样本将被拆分为训练集和验证集，然后优化将估计最佳相关参数。

之后，将返回具有较高 shap 值的最佳特征子集。该子集可用作 Sklearn 管道的后续步骤。

安装

Zoish 包在 PyPI 上可用，可以使用 pip 安装：

pip install zoish

支持的估计器

XGBRegressor XGBoost
XGBClassifier XGBoost
随机森林分类器
随机森林回归器
CatBoost分类器
CatBoost 回归器
平衡随机森林分类器
LGBMClassifier LightGBM
LGBMRegressor LightGBM
XGBSEKaplanNeighbors XGBoost 生存嵌入
XGBSEDebiasedBCE XGBoost 生存嵌入
XGBSEBootstrapEstimator XGBoost 生存嵌入

用法

使用超参数优化后具有最高形状值的特定基于树的模型查找特征
绘制选定特征的形状摘要图
返回具有特征和形状值列表的已排序的两列 Pandas 数据框。

例子

导入所需的库

from zoish.feature_selectors.optunashap import OptunaShapFeatureSelector
import xgboost
from optuna.pruners import HyperbandPruner
from optuna.samplers._tpe.sampler import TPESampler
from sklearn.model_selection import KFold,train_test_split
import pandas as pd
from sklearn.pipeline import Pipeline
from feature_engine.imputation import (
    CategoricalImputer,
    MeanMedianImputer
    )
from category_encoders import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    f1_score)
import lightgbm
import matplotlib.pyplot as plt
import optuna

计算机硬件数据集（分类问题）

urldata= "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
# column names
col_names=["age", "workclass", "fnlwgt" , "education" ,"education-num",
"marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week",
"native-country","label"
]
# read data
data = pd.read_csv(urldata,header=None,names=col_names,sep=',')
data.head()

data.loc[data['label']=='<=50K','label']=0
data.loc[data['label']==' <=50K','label']=0

data.loc[data['label']=='>50K','label']=1
data.loc[data['label']==' >50K','label']=1

data['label']=data['label'].astype(int)

训练测试拆分

X = data.loc[:, data.columns != "label"]
y = data.loc[:, data.columns == "label"]

X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.33, stratify=y['label'], random_state=42)

查找特征类型以供以后使用

int_cols =  X_train.select_dtypes(include=['int']).columns.tolist()
float_cols =  X_train.select_dtypes(include=['float']).columns.tolist()
cat_cols =  X_train.select_dtypes(include=['object']).columns.tolist()

定义特征选择器并设置其参数

optuna_classification_lgb = OptunaShapFeatureSelector(
        # general argument setting        
        verbose=1,
        random_state=0,
        logging_basicConfig = None,
        # general argument setting        
        n_features=4,
        list_of_obligatory_features_that_must_be_in_model=[],
        list_of_features_to_drop_before_any_selection=[],
        # shap argument setting        
        estimator=lightgbm.LGBMClassifier(),
        estimator_params={
        "max_depth": [4, 9],
        "reg_alpha": [0, 1],

        },
        # shap arguments
        model_output="raw", 
        feature_perturbation="interventional", 
        algorithm="auto", 
        shap_n_jobs=-1, 
        memory_tolerance=-1, 
        feature_names=None, 
        approximate=False, 
        shortcut=False, 
        plot_shap_summary=False,
        save_shap_summary_plot=True,
        path_to_save_plot = './summary_plot.png',
        shap_fig = plt.figure(),
        ## optuna params
        test_size=0.33,
        with_stratified = False,
        performance_metric = 'f1',
        # optuna study init params
        study = optuna.create_study(
            storage = None,
            sampler = TPESampler(),
            pruner= HyperbandPruner(),
            study_name  = None,
            direction = "maximize",
            load_if_exists = False,
            directions  = None,
            ),
        study_optimize_objective_n_trials=10, 

)

构建 sklearn 管道



pipeline =Pipeline([
            # int missing values imputers
            ('intimputer', MeanMedianImputer(
                imputation_method='median', variables=int_cols)),
            # category missing values imputers
            ('catimputer', CategoricalImputer(variables=cat_cols)),
            #
            ('catencoder', OrdinalEncoder()),
            # feature selection
            ('optuna_classification_lgb', optuna_classification_lgb),
            # classification model
            ('logistic', LogisticRegression())


 ])


pipeline.fit(X_train,y_train)
y_pred = pipeline.predict(X_test)


print('F1 score : ')
print(f1_score(y_test,y_pred))
print('Classification report : ')
print(classification_report(y_test,y_pred))
print('Confusion matrix : ')
print(confusion_matrix(y_test,y_pred))

示例中提供了更多示例。

执照

根据BSD 2-Clause许可证获得许可。

项目详情

发布历史发布通知| RSS订阅

这个版本

1.62.0

2022 年 9 月 7 日

1.61.0

2022 年 8 月 28 日

1.60.0

2022 年 8 月 28 日

1.59.0

2022 年 8 月 16 日

1.58.0

2022 年 8 月 10 日

1.57.0

2022 年 7 月 27 日

1.56.0

2022 年 7 月 27 日

1.55.0

2022 年 7 月 19 日

1.54.0

2022 年 7 月 19 日

1.52.0

2022 年 7 月 19 日

1.51.0

2022 年 7 月 9 日

1.30.0

2022 年 7 月 8 日

1.24.0

2022 年 7 月 8 日

0.1.3

2022 年 6 月 26 日

0.1.0

2022 年 6 月 24 日

下载文件

下载适用于您平台的文件。如果您不确定要选择哪个，请了解有关安装包的更多信息。

源分布

zoish-1.62.0.tar.gz （342.3 kB 查看哈希）

已上传 2022 年 9 月 7 日 source

内置分布

zoish-1.62.0-py3-none-any.whl （345.5 kB 查看哈希）

已上传 2022 年 9 月 7 日 py3

zoish -1.62.0.tar.gz 的哈希值

zoish-1.62.0.tar.gz 的哈希值
算法	哈希摘要
SHA256	`b5cf02417df8a3627efde217627f1bb88c26933acfa67928bac344bad5138353`
MD5	`9469029622b0d60d920a477c83d9ad9d`
布莱克2-256	`38bd23083af5585899d2a9f07d84a8e1599a7db02b1d07d75716fcf215f3c330`

zoish -1.62.0-py3-none-any.whl 的哈希值

zoish-1.62.0-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`bcdfee641bde4e8cc60bd709d4e6c3ebe95cbca6d22d4897def6bf863686b658`
MD5	`243f3fb86ee92cfbc2bb3273b3f632d2`
布莱克2-256	`038ef7903bac6d65d127fe70b3428c5b402562bdcebca682c29cb50ffbf6c1ef`

zoish 1.62.0

导航

项目链接

统计数据

Meta

Maintainers

分类

项目描述

佐伊什

介绍

安装

支持的估计器

用法

例子

导入所需的库

计算机硬件数据集（分类问题）

训练测试拆分

查找特征类型以供以后使用

定义特征选择器并设置其参数

构建 sklearn 管道

执照

项目详情

项目链接

统计数据

元

维护者

分类器

发布历史发布通知| RSS订阅

下载文件

源分布

内置分布

zoish 1.62.0

导航

项目链接

统计数据

Meta

Maintainers

分类

项目描述

佐伊什

介绍

安装

支持的估计器

用法

例子

导入所需的库

计算机硬件数据集（分类问题）

训练测试拆分

查找特征类型以供以后使用

定义特征选择器并设置其参数

构建 sklearn 管道

执照

项目详情

项目链接

统计数据

元

维护者

分类器

发布历史 发布通知| RSS订阅

下载文件

源分布

内置分布

发布历史发布通知| RSS订阅