针对离散和有界数据优化的离散 DBSCAN 算法。
项目描述
这是针对离散、有界数据优化的DBSCAN聚类算法版本,我们称之为离散 DBSCAN (DDBSCAN) 的原因。当前实现的基础来自此来源。算法代码在文件ddbscan/ddbscan.py中,很容易阅读。主要算法本身在方法compute()中,可以通过上面的链接或阅读描述它的论文来理解。
这个实现的另一个特点是它是为在线学习而设计的。因此,当我们向 DDBSCAN 对象添加点时,我们必须每次将一个点传递给方法add_point。请参阅下面的用法。
离散和有界数据的优化
我们对上述链接中描述的普通算法的主要优化是基于这样一个事实,即对于离散和有界数据,我们希望看到同一点出现多次,因此我们可以跟踪该点出现的次数并优化我们的算法来使用该信息。
为了加快新点的插入和簇的计算,每个 DDBSCAN 对象为每个点保留其邻居的索引和邻居大小(邻居点的计数总和)。因此,当我们插入一个新点时,我们会查看它是否是一个已经存在的对,然后只增加它的计数器和它的邻居的邻域大小。我们用这些点重新计算一个 KDTree,以防插入新的对,更新其邻居的点数据。
参数
DBSCAN 模型有两个参数:
min_pts:创建集群的点的最小邻居数量。
eps : 寻找邻居的半径。
通过调整这两个参数,我们实际上是在设置异常(离群值)检测灵敏度。min_pts的较大值意味着要将新模式识别为集群,而不是异常,我们必须看到具有该模式的大量点。eps的值越大,意味着更大的集群更容易形成,因此在给定这个大的eps的情况下,可以将不太密集区域中的点识别为集群成员。鉴于调整这些参数的重要性,我们有一个方法来设置它们,称为 set_params(),它会相应地更新模型的内部状态。
安装
要安装软件包,最简单的方法是使用 pip:
$ pip install ddbscan
另一种选择是克隆这个 repo 并运行
$ python setup.py install
要运行测试:
$ python setup.py test
用法
一个典型的例子如下:
import ddbscan
# Create a DDBSCAN model with eps = 2 and min_pts = 5
scan = ddbscan.DDBSCAN(2, 5)
# Add points to model
data = [[1, 2], [2, 2], [1, 3], [2, 3], [3, 3], [8, 9],
[7, 6], [9, 7], [6, 9], [6, 8], [5, 5], [7, 8]]
for point in data:
scan.add_point(point=point, count=1, desc="")
# Compute clusters
scan.compute()
print 'Clusters found and its members points index:'
cluster_number = 0
for cluster in scan.clusters:
print '=== Cluster %d ===' % cluster_number
print 'Cluster points index: %s' % list(cluster)
cluster_number += 1
print '\nCluster assigned to each point:'
for i in xrange(len(scan.points)):
print '=== Point: %s ===' % scan.points[i]
print 'Cluster: %2d' % scan.points_data[i].cluster,
# If a point cluster is -1, it's an anomaly
if scan.points_data[i].cluster == -1:
print '\t <== Anomaly found!'
else:
print
执照
The MIT License (MIT) Copyright (c) 2014 CloudWalk, Inc. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.