如何找到真实数据的概率分布和参数？(Python 3)

Question

如何找到真实数据的概率分布和参数？(Python 3)

O.r*_*rka 22 python statistics distribution machine-learning data-fitting

我有一个数据集sklearn,我绘制了load_diabetes.target数据的分布(即load_diabetes.data用于预测的回归值).

我使用它是因为它具有最少数量的回归变量/属性sklearn.datasets.

使用Python 3,我如何获得最接近类似的分布类型和分布参数？

我所知道的target价值都是积极的和倾斜的(假定倾斜/右倾斜)...Python中是否有一种方法可以提供一些分布,然后最适合target数据/向量？或者,根据给出的数据实际建议拟合？对于那些具有理论统计知识但很少将其应用于"真实数据"的人来说,这将是非常有用的.

奖金使用这种方法来确定你的后验分布对"真实数据"的影响是否合理？如果不是,为什么不呢？

from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd

#Get Data
data = load_diabetes()
X, y_ = data.data, data.target

#Organize Data
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")

#Plot Data
fig, ax = plt.subplots()
sns.distplot(SR_y, bins=25, color="g", ax=ax)
plt.show()

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 19

据我所知,没有自动获取样本的分布类型和参数的方法(因为推断样本的分布本身就是一个统计问题).

在我看来,你能做的最好的事情是:

(对于每个属性)

尝试将每个属性拟合到一个相当大的可能分布列表中(例如,参见使用Scipy(Python)将经验分布拟合到理论分布？对于Scipy的示例)
评估所有适合并选择最好的.这可以通过在样本和拟合的每个分布之间执行Kolmogorov-Smirnov检验来完成(您再次在Scipy中实现),并选择最小化D的那个,即测试统计量(也就是测试统计量之间的差异)样品和适合).

奖励:这是有道理的 - 因为你会在为每个变量选择一个变量时为每个变量建立一个模型 - 尽管你的预测的好处取决于你的数据的质量和你用于的分布配件.毕竟,你正在建立一个模型.

本文概述了如何使用 Fitter 库自动识别哪个 scipy 发行版最适合数据 https://towardsdatascience.com/finding-the-best-distribution-that-fits-your-data-using-pythons-fitter-库-319a5a0972e9 (2认同)

Answer 2

Ale*_*eau 12

您可以使用该代码(根据最大可能性)使用您的数据拟合不同的分布:

import matplotlib.pyplot as plt
import scipy
import scipy.stats

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    # here's the parameters of your distribution, scale, location

Run Code Online (Sandbox Code Playgroud)

您可以看到有关如何使用此处获得的参数的示例代码段:使用Scipy(Python)将经验分布拟合到理论分布？

然后,您可以选择具有最佳对数似然的分布(还有其他标准匹配"最佳"分布,例如贝叶斯后验概率,AIC,BIC或BICc值,......).

对于你的奖金问题,我认为没有通用答案.如果您的数据集很重要并且在与真实数据数据相同的条件下获得,则可以执行此操作.

Answer 3

Pas*_*age 12

使用这种方法

import scipy.stats as st
def get_best_distribution(data):
    dist_names = ["norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"]
    dist_results = []
    params = {}
    for dist_name in dist_names:
        dist = getattr(st, dist_name)
        param = dist.fit(data)

        params[dist_name] = param
        # Applying the Kolmogorov-Smirnov test
        D, p = st.kstest(data, dist_name, args=param)
        print("p value for "+dist_name+" = "+str(p))
        dist_results.append((dist_name, p))

    # select the best fitted distribution
    best_dist, best_p = (max(dist_results, key=lambda item: item[1]))
    # store the name of the best fit and its p value

    print("Best fitting distribution: "+str(best_dist))
    print("Best p value: "+ str(best_p))
    print("Parameters for the best fit: "+ str(params[best_dist]))

    return best_dist, best_p, params[best_dist]

Run Code Online (Sandbox Code Playgroud)

在这种方法中，您正在寻找最好的最大 P。不应该选择min(p)吗？ (3认同)
它是上述答案中代码的完整版本。他们为所有可能适合数据的分布创建了一个项目列表。然后他们使用 p 分数创建一个假设，以确定该分布与数据的匹配程度。具有最高 p-score 的被认为是最准确的。这是因为较高的 p-score 意味着假设最接近现实。 (2认同)

Answer 4

小智 7

fitter提供了可能的拟合分布的迭代过程。
它还输出带有统计值的绘图和汇总表。

fitter包提供了一个简单的类来识别生成数据样本的分布。它使用 80 个分布，Scipy并允许您绘制结果以检查最可能的分布和最佳参数。

因此，迭代拟合测试过程与其他答案中描述的基本相同，但由模块方便地执行。

您的SR_y系列结果：

代码：

from sklearn.datasets import load_diabetes
from fitter import Fitter, get_common_distributions

#Get Data - from question
data = load_diabetes()
X, y_ = data.data, data.target
#Organize Data - from question
SR_y = pd.Series(y_, name="y_ (Target Vector Distribution)")


# fitter
distributions_set = get_common_distributions()
distributions_set.extend(['arcsine', 'cosine', 'expon', 'weibull_max', 'weibull_min', 
                          'dweibull', 't', 'pareto', 'exponnorm', 'lognorm',
                         "norm", "exponweib", "weibull_max", "weibull_min", "pareto", "genextreme"])  

f = Fitter(SR_y, distributions = distributions_set) 
f.fit()
f.summary()

Run Code Online (Sandbox Code Playgroud)

这些拟合分布的参数为dict：

f.fitted_param

Run Code Online (Sandbox Code Playgroud)

{'expon': (25.0, 127.13348416289594),
 'cauchy': (132.95536663886972, 52.62243313109789),
 'gamma': (2.496376511103246, 20.737715299081657, 52.63462302106953),
 'norm': (152.13348416289594, 77.00574586945044),
 'chi2': (4.9927545799818525, 20.737731375230684, 26.317289176495912),
 'rayleigh': (14.700761411215545, 111.3948791009951),
 'uniform': (25.0, 321.0),
 'powerlaw': (1.0864390359784966, -6.82376066691087, 352.82376073752073),
 'cosine': (159.01669793410446, 65.6033963343604),
 'arcsine': (-6.99037533558757, 352.9903753355876),
 'exponpow': (0.15440493125261756, 24.999999999999996, 16.00571403929016),
 'weibull_max': (0.168196678837625, 346.0000000000001, 1.6686318895897978),
 'weibull_min': (0.2750237375428041, 24.999999999999996, 6.998090013988461),
 'dweibull': (1.6343449438402855, 157.0247145542748, 73.64165822064473),
 'pareto': (0.6022461735477798, -0.06169932009129858, 25.06169863339018),
 'exponnorm': (6.298770105099791, 53.6065309642624, 15.642251691931591),
 't': (127967.50529392948, 152.12481045573628, 76.98521783304597),
 'exponweib': (0.9662752277542657, 1.6900600238468133, 24.142487003378918, 150.25955880342326),
 'lognorm': (0.44469088248930166, -29.00650970868123, 164.71283014005542),
 'genextreme': (0.029317901766728702, 116.52312667345038, 63.454691756821106)}

Run Code Online (Sandbox Code Playgroud)

要获取所有可用发行版的列表：

from fitter import get_distributions

get_distributions()

Run Code Online (Sandbox Code Playgroud)

对所有这些内容进行测试需要很长时间，因此最好使用已实现的内容get_common_distributions()，并可能通过可能的分布来扩展它们，如上面的代码中所做的那样。

确保安装了当前fitter版本（1.4.1 或更高版本）：

import fitter

print(fitter.version)

Run Code Online (Sandbox Code Playgroud)

我在前一个日志中出现了日志错误，对于我的 conda 环境，我需要：

conda install -c bioconda fitter

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，7 月前
查看次数：	24293 次
最近记录：	7 年前