Skip to main content

快速组成单机分析流水线。

项目描述

CoRelAy – 组合相关性分析

CoRelAy 徽标

文件状态 测试 PyPI 版本 执照

CoRelAy 是一种组合小规模(单机)分析管道的工具。管道设计有许多步骤(任务)和默认操作(处理器)。然后可以通过分配新的操作员(处理器)来单独更改管道的任何步骤。处理器具有定义其操作的参数。

创建 CoRelAy 是为了快速实施管道以生成分析数据,然后可以使用 ViRelAy 对其进行可视化。

如果您发现 CoRelAy 对您的研究有用,为什么不引用我们的相关论文

@article{anders2021software,
      author  = {Anders, Christopher J. and
                 Neumann, David and
                 Samek, Wojciech and
                 Müller, Klaus-Robert and
                 Lapuschkin, Sebastian},
      title   = {Software for Dataset-wide XAI: From Local Explanations to Global Insights with {Zennit}, {CoRelAy}, and {ViRelAy}},
      journal = {CoRR},
      volume  = {abs/2106.13200},
      year    = {2021},
}

文档

最新的文档托管在 corelay.readthedocs.io 上

安装

可以使用 pip 安装 CoRelAy

$ pip install corelay

要安装可选的 HDBSCAN 和 UMAP 支持,请使用

$ pip install corelay[umap,hdbscan]

用法

可以在 中找到突出显示CoRelAy某些功能的示例example/

我们主要使用 HDF5 文件来存储结果。ViRelAy使用的结构记录在ViRelAy 存储库中,位于docs/database_specification.md. 创建可与ViRelAy一起使用的 HDF5 文件的示例显示在example/hdf5_structure.py

要进行可使用 ViRelAy 可视化的完整 SpRAy 分析,可以在找到高级脚本 example/virelay_analysis.py

以下显示 的内容example/memoize_spectral_pipeline.py

'''Example using memoization to store (intermediate) results.'''
import time

import h5py
import numpy as np

from corelay.base import Param
from corelay.processor.base import Processor
from corelay.processor.flow import Sequential, Parallel
from corelay.pipeline.spectral import SpectralClustering
from corelay.processor.clustering import KMeans
from corelay.processor.embedding import TSNEEmbedding, EigenDecomposition
from corelay.io.storage import HashedHDF5


# custom processors can be implemented by defining a function attribute
class Flatten(Processor):
    def function(self, data):
        return data.reshape(data.shape[0], np.prod(data.shape[1:]))


class SumChannel(Processor):
    # parameters can be assigned by defining a class-owned Param instance
    axis = Param(int, 1)
    def function(self, data):
        return data.sum(1)


class Normalize(Processor):
    def function(self, data):
        data = data / data.sum((1, 2), keepdims=True)
        return data


def main():
    np.random.seed(0xDEADBEEF)
    fpath = 'test.analysis.h5'
    with h5py.File(fpath, 'a') as fd:
        # HashedHDF5 is an io-object that stores outputs of Processors based on hashes in hdf5
        iobj = HashedHDF5(fd.require_group('proc_data'))

        # generate some exemplary data
        data = np.random.normal(size=(64, 3, 32, 32))
        n_clusters = range(2, 20)

        # SpectralClustering is an Example for a pre-defined Pipeline
        pipeline = SpectralClustering(
            # processors, such as EigenDecomposition, can be assigned to pre-defined tasks
            embedding=EigenDecomposition(n_eigval=8, io=iobj),
            # flow-based Processors, such as Parallel, can combine multiple Processors
            # broadcast=True copies the input as many times as there are Processors
            # broadcast=False instead attempts to match each input to a Processor
            clustering=Parallel([
                Parallel([
                    KMeans(n_clusters=k, io=iobj) for k in n_clusters
                ], broadcast=True),
                # io-objects will be used during computation when supplied to Processors
                # if a corresponding output value (here identified by hashes) already exists,
                # the value is not computed again but instead loaded from the io object
                TSNEEmbedding(io=iobj)
            ], broadcast=True, is_output=True)
        )
        # Processors (and Params) can be updated by simply assigning corresponding attributes
        pipeline.preprocessing = Sequential([
            SumChannel(),
            Normalize(),
            Flatten()
        ])

        start_time = time.perf_counter()

        # Processors flagged with "is_output=True" will be accumulated in the output
        # the output will be a tree of tuples, with the same hierachy as the pipeline
        # (i.e. clusterings here contains a tuple of the k-means outputs)
        clusterings, tsne = pipeline(data)

        # since we memoize our results in a hdf5 file, subsequent calls will not compute
        # the values (for the same inputs), but rather load them from the hdf5 file
        # try running the script multiple times
        duration = time.perf_counter() - start_time
        print(f'Pipeline execution time: {duration:.4f} seconds')


if __name__ == '__main__':
    main()

项目详情


下载文件

下载适用于您平台的文件。如果您不确定要选择哪个,请了解有关安装包的更多信息。

源分布

corelay-0.2.1.tar.gz (154.1 kB 查看哈希

已上传 source

内置分布

corelay-0.2.1-py3-none-any.whl (45.0 kB 查看哈希

已上传 py3