使用精确的安全模式推断收缩 Pandas 数据帧。
项目描述
熊猫垂头丧气
使用精确的安全模式推断
收缩Pandas数据帧。pandas-downcast
找到每列的最小可行类型,确保结果值在原始值的容差范围内。
安装
pip install pandas-downcast
依赖项
- 蟒蛇> = 3.6
- 熊猫
- 麻木的
执照
用法
import pdcast as pdc
import numpy as np
import pandas as pd
data = {
"integers": np.linspace(1, 100, 100),
"floats": np.linspace(1, 1000, 100).round(2),
"booleans": np.random.choice([1, 0], 100),
"categories": np.random.choice(["foo", "bar", "baz"], 100),
}
df = pd.DataFrame(data)
# Downcast DataFrame to minimum viable schema.
df_downcast = pdc.downcast(df)
# Infer minimum schema for DataFrame.
schema = pdc.infer_schema(df)
# Coerce DataFrame to schema - required if converting float to Pandas Integer.
df_new = pdc.coerce_df(df, schema)
更小的数据类型 -> 更小的内存占用。
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 integers 100 non-null float64
# 1 floats 100 non-null float64
# 2 booleans 100 non-null int64
# 3 categories 100 non-null object
# dtypes: float64(2), int64(1), object(1)
# memory usage: 3.2+ KB
df_downcast.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 integers 100 non-null uint8
# 1 floats 100 non-null float32
# 2 booleans 100 non-null bool
# 3 categories 100 non-null category
# dtypes: bool(1), category(1), float32(1), uint8(1)
# memory usage: 932.0 bytes
如果结果值在原始值的容差范围内,则数值数据类型将被向下转换。有关数值比较容差的详细信息,请参阅 上的注释np.allclose
。
print(df.head())
# integers floats booleans categories
# 0 1.0 1.00 1 foo
# 1 2.0 11.09 0 baz
# 2 3.0 21.18 1 bar
# 3 4.0 31.27 0 bar
# 4 5.0 41.36 0 foo
print(df_downcast.head())
# integers floats booleans categories
# 0 1 1.000000 True foo
# 1 2 11.090000 False baz
# 2 3 21.180000 True bar
# 3 4 31.270000 False bar
# 4 5 41.360001 False foo
print(pdc.options.ATOL)
# >>> 1e-08
print(pdc.options.RTOL)
# >>> 1e-05
容差可以在模块级别设置或传入函数参数。
pdc.options.ATOL = 1e-10
pdc.options.RTOL = 1e-10
df_downcast_new = pdc.downcast(df)
或者
infer_dtype_kws = {
"ATOL": 1e-10,
"RTOL": 1e-10
}
df_downcast_new = pdc.downcast(df, infer_dtype_kws=infer_dtype_kws)
现在保持该floats
列float64
以满足公差要求。列中的值integers
仍然安全地转换为uint8
.
df_downcast_new.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 100 entries, 0 to 99
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 integers 100 non-null uint8
# 1 floats 100 non-null float64
# 2 booleans 100 non-null bool
# 3 categories 100 non-null category
# dtypes: bool(1), category(1), float64(1), uint8(1)
# memory usage: 1.3 KB
推断模式只能限制为 Numpy 数据类型。
# Downcast DataFrame to minimum viable Numpy schema.
df_downcast = pdc.downcast(df, numpy_dtypes_only=True)
# Infer minimum Numpy schema for DataFrame.
schema = pdc.infer_schema(df, numpy_dtypes_only=True)
例子
以下示例显示了向下转换数据通常如何导致大小减少超过 70%,具体取决于原始类型。
import pdcast as pdc
import pandas as pd
import seaborn as sns
df_dict = {df: sns.load_dataset(df) for df in sns.get_dataset_names()}
results = []
for name, df in df_dict.items():
size_pre = df.memory_usage(deep=True).sum()
df_post = pdc.downcast(df)
size_post = df_post.memory_usage(deep=True).sum()
shrinkage = int((1 - (size_post / size_pre)) * 100)
results.append(
{"dataset": name, "size_pre": size_pre, "size_post": size_post, "shrink_pct": shrinkage}
)
results_df = pd.DataFrame(results).sort_values("shrink_pct", ascending=False).reset_index(drop=True)
print(results_df)
dataset size_pre size_post shrink_pct
0 fmri 213232 14776 93
1 titanic 321240 28162 91
2 attention 5888 696 88
3 penguins 75711 9131 87
4 dots 122240 17488 85
5 geyser 21172 3051 85
6 gammas 500128 108386 78
7 anagrams 2048 456 77
8 planets 112663 30168 73
9 anscombe 3428 964 71
10 iris 14728 5354 63
11 exercise 3302 1412 57
12 flights 3616 1888 47
13 mpg 75756 43842 42
14 tips 7969 6261 21
15 diamonds 3184588 2860948 10
16 brain_networks 4330642 4330642 0
17 car_crashes 5993 5993 0
项目详情
关
pandas_downcast -1.2.1-py3-none-any.whl 的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 7992ef18b48889bd6044f833b477bb1f8a11f4e8c38a77018f7c8430380a01d5 |
|
MD5 | 28e9083902856d7a1932467b64b21218 |
|
布莱克2-256 | 8595c17db9c7b85fc7e67e700499e50af0bd29c8bb4f2e753562fadb51eaeb7d |