Python卡方拟合优度测试以获得最佳分布

Question

Python卡方拟合优度测试以获得最佳分布

给定一组数据值，我试图获得能够很好地描述数据的最佳理论分布。经过几天的研究，我想出了以下 python 代码。

import numpy as np
import csv
import pandas as pd
import scipy.stats as st
import math
import sys
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

def fit_to_all_distributions(data):
    dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']

    params = {}
    for dist_name in dist_names:
        try:
            dist = getattr(st, dist_name)
            param = dist.fit(data)

            params[dist_name] = param
        except Exception:
            print("Error occurred in fitting")
            params[dist_name] = "Error"

    return params 


def get_best_distribution_using_chisquared_test(data, params):

    histo, bin_edges = np.histogram(data, bins='auto', normed=False)
    number_of_bins = len(bin_edges) - 1
    observed_values = histo

    dist_names = ['fatiguelife', 'invgauss', 'johnsonsu', 'johnsonsb', 'lognorm', 'norminvgauss', 'powerlognorm', 'exponweib','genextreme', 'pareto']

    dist_results = []

    for dist_name in dist_names:

        param = params[dist_name]
        if (param != "Error"):
            # Applying the SSE test
            arg = param[:-2]
            loc = param[-2]
            scale = param[-1]
            cdf = getattr(st, dist_name).cdf(bin_edges, loc=loc, scale=scale, *arg)
            expected_values = len(data) * np.diff(cdf)
            c , p = st.chisquare(observed_values, expected_values, ddof=number_of_bins-len(param))
            dist_results.append([dist_name, c, p])


    # select the best fitted distribution
    best_dist, best_c, best_p = None, sys.maxsize, 0

    for item in dist_results:
        name = item[0]
        c = item[1]
        p = item[2]
        if (not math.isnan(c)):
            if (c < best_c):
                best_c = c
                best_dist = name
                best_p = p

    # print the name of the best fit and its p value

    print("Best fitting distribution: " + str(best_dist))
    print("Best c value: " + str(best_c))
    print("Best p value: " + str(best_p))
    print("Parameters for the best fit: " + str(params[best_dist]))

    return best_dist, best_c, params[best_dist], dist_results

Run Code Online (Sandbox Code Playgroud)

然后我测试这段代码，

a, m = 3., 2.
values = (np.random.pareto(a, 1000) + 1) * m
data = pd.Series(values)
params = fit_to_all_distributions(data)
best_dist_chi, best_chi, params_chi, dist_results_chi = get_best_distribution_using_chisquared_test(values, params)

Run Code Online (Sandbox Code Playgroud)

由于数据点是使用帕累托分布生成的，因此它应该返回帕累托作为具有足够大 p 值 (p>0.05) 的最佳拟合分布。

但这就是我得到的输出。

Best fitting distribution: genextreme
Best c value: 106.46087793622216
Best p value: 7.626303538461713e-24
Parameters for the best fit: (-0.7664124294696955, 2.3217378846757164, 0.3711562696710188)

Run Code Online (Sandbox Code Playgroud)

我实施卡方拟合优度检验有什么问题吗？

Answer 1

Pas*_*age 4

Python卡方拟合优度检验（https://docs.scipy.org/doc/scipy/reference/ generated/scipy.stats.chisquare.html）提到“\xe2\x80\x9cDelta自由度\xe2\x80 \x9d：调整 p 值的自由度。p 值是使用自由度为 k - 1 - ddof 的卡方分布计算的，其中 k 是观察到的频率数。默认值ddof 为 0。”

\n\n

因此，您的代码应按如下方式更正。

\n\n

c , p = st.chisquare(observed_values, expected_values, ddof=len(param))\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	7 年前
查看次数：	9005 次
最近记录：	7 年前