使用Scipy（Python）使经验分布适合理论分布吗？

简介：我列出了30,000多个整数值，范围从0到47（含0和47），例如[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...]从一些连续分布中采样。列表中的值不一定按顺序排列，但是对于这个问题而言，顺序并不重要。

问题：根据我的分布，我想计算p值（看到更大值的概率））的任何给定值。例如，您可以看到0的p值接近1，更高数字的p值趋向于0。

我不知道我是否正确，但是要确定概率我认为我需要使我的数据适合最适合描述我的数据的理论分布。我假设需要某种拟合优度检验来确定最佳模型。

是否可以在Python中实现这种分析（Scipy或Numpy）？
您能提出一下吗？有任何例子吗？

谢谢！

您只有离散的经验值，但要连续分布吗？我理解正确吗？

似乎荒谬。这些数字代表什么？测量精度有限吗？

迈克尔，我解释了上一个问题中的数字代表什么：stackoverflow.com/questions/6615489/…

这就是计数数据。它不是连续分布。

检查此问题的可接受答案stackoverflow.com/questions/48455018/…

#1 楼

具有平方误差和（SSE）的分布拟合

这是Saullo答案的更新和修改，它使用当前scipy.stats分布的完整列表，并返回分布的直方图之间SSE最小的分布

示例拟合

使用来自statsmodels的ElNiño数据集，拟合分布并确定误差。返回错误最少的分布。

所有发行版

最佳拟合发行版

示例代码

%matplotlib inline

import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
    """Model data by finding best fit distribution to data"""
    # Get histogram of original data
    y, x = np.histogram(data, bins=bins, density=True)
    x = (x + np.roll(x, -1))[:-1] / 2.0

    # Distributions to check
    DISTRIBUTIONS = [        
        st.alpha,st.anglit,st.arcsine,st.beta,st.betaprime,st.bradford,st.burr,st.cauchy,st.chi,st.chi2,st.cosine,
        st.dgamma,st.dweibull,st.erlang,st.expon,st.exponnorm,st.exponweib,st.exponpow,st.f,st.fatiguelife,st.fisk,
        st.foldcauchy,st.foldnorm,st.frechet_r,st.frechet_l,st.genlogistic,st.genpareto,st.gennorm,st.genexpon,
        st.genextreme,st.gausshyper,st.gamma,st.gengamma,st.genhalflogistic,st.gilbrat,st.gompertz,st.gumbel_r,
        st.gumbel_l,st.halfcauchy,st.halflogistic,st.halfnorm,st.halfgennorm,st.hypsecant,st.invgamma,st.invgauss,
        st.invweibull,st.johnsonsb,st.johnsonsu,st.ksone,st.kstwobign,st.laplace,st.levy,st.levy_l,st.levy_stable,
        st.logistic,st.loggamma,st.loglaplace,st.lognorm,st.lomax,st.maxwell,st.mielke,st.nakagami,st.ncx2,st.ncf,
        st.nct,st.norm,st.pareto,st.pearson3,st.powerlaw,st.powerlognorm,st.powernorm,st.rdist,st.reciprocal,
        st.rayleigh,st.rice,st.recipinvgauss,st.semicircular,st.t,st.triang,st.truncexpon,st.truncnorm,st.tukeylambda,
        st.uniform,st.vonmises,st.vonmises_line,st.wald,st.weibull_min,st.weibull_max,st.wrapcauchy
    ]

    # Best holders
    best_distribution = st.norm
    best_params = (0.0, 1.0)
    best_sse = np.inf

    # Estimate distribution parameters from data
    for distribution in DISTRIBUTIONS:

        # Try to fit the distribution
        try:
            # Ignore warnings from data that can't be fit
            with warnings.catch_warnings():
                warnings.filterwarnings('ignore')

                # fit dist to data
                params = distribution.fit(data)

                # Separate parts of parameters
                arg = params[:-2]
                loc = params[-2]
                scale = params[-1]

                # Calculate fitted PDF and error with fit in distribution
                pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
                sse = np.sum(np.power(y - pdf, 2.0))

                # if axis pass in add to plot
                try:
                    if ax:
                        pd.Series(pdf, x).plot(ax=ax)
                    end
                except Exception:
                    pass

                # identify if this distribution is better
                if best_sse > sse > 0:
                    best_distribution = distribution
                    best_params = params
                    best_sse = sse

        except Exception:
            pass

    return (best_distribution.name, best_params)

def make_pdf(dist, params, size=10000):
    """Generate distributions's Probability Distribution Function """

    # Separate parts of parameters
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]

    # Get sane start and end points of distribution
    start = dist.ppf(0.01, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.01, loc=loc, scale=scale)
    end = dist.ppf(0.99, *arg, loc=loc, scale=scale) if arg else dist.ppf(0.99, loc=loc, scale=scale)

    # Build PDF and turn into pandas Series
    x = np.linspace(start, end, size)
    y = dist.pdf(x, loc=loc, scale=scale, *arg)
    pdf = pd.Series(y, x)

    return pdf

# Load data from statsmodels datasets
data = pd.Series(sm.datasets.elnino.load_pandas().data.set_index('YEAR').values.ravel())

# Plot for comparison
plt.figure(figsize=(12,8))
ax = data.plot(kind='hist', bins=50, normed=True, alpha=0.5, color=plt.rcParams['axes.color_cycle'][1])
# Save plot limits
dataYLim = ax.get_ylim()

# Find best fit distribution
best_fit_name, best_fit_params = best_fit_distribution(data, 200, ax)
best_dist = getattr(st, best_fit_name)

# Update plots
ax.set_ylim(dataYLim)
ax.set_title(u'El Niño sea temp.\n All Fitted Distributions')
ax.set_xlabel(u'Temp (°C)')
ax.set_ylabel('Frequency')

# Make PDF with best params 
pdf = make_pdf(best_dist, best_fit_params)

# Display
plt.figure(figsize=(12,8))
ax = pdf.plot(lw=2, label='PDF', legend=True)
data.plot(kind='hist', bins=50, normed=True, alpha=0.5, label='Data', legend=True, ax=ax)

param_names = (best_dist.shapes + ', loc, scale').split(', ') if best_dist.shapes else ['loc', 'scale']
param_str = ', '.join(['{}={:0.2f}'.format(k,v) for k,v in zip(param_names, best_fit_params)])
dist_str = '{}({})'.format(best_fit_name, param_str)

ax.set_title(u'El Niño sea temp. with best fit distribution \n' + dist_str)
ax.set_xlabel(u'Temp. (°C)')
ax.set_ylabel('Frequency')

太棒了考虑在np.histogram（）中使用density = True而不是normed = True。 ^^

– Peque
17年4月28日在16:13

@tmthydvnprt也许您可以撤消.plot（）方法中的更改，以避免将来造成混乱。 ^^

– Peque
17年6月30日在8:45

要获取分发名称：从scipy.stats._continuous_distns导入_distn_names。然后，您可以对_distn_names中的每个distname使用类似getattr（scipy.stats，distname）的名称。有用，因为发行版使用不同的SciPy版本进行了更新。

–布拉德·所罗门
17年8月29日在19:49

您能否解释一下为什么此代码仅检查连续分布的最佳拟合，而不能检查离散或多元分布。谢谢。

–亚当·施罗德（Adam Schroeder）
18 Mar 16 '18 at 20:56

很酷。我必须更新颜色参数-ax = data.plot（kind ='hist'，bins = 50，normed = True，alpha = 0.5，color = list（matplotlib.rcParams ['axes.prop_cycle']）[1] ['颜色']）

–basswaves
18年5月10日在19:43

#2 楼

SciPy 0.12.0中有82个已实现的分发功能。您可以使用其fit()方法测试其中的一些如何适合您的数据。请检查以下代码以获取更多详细信息：

import matplotlib.pyplot as plt
import scipy
import scipy.stats
size = 30000
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*47))
h = plt.hist(y, bins=range(48))

dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

for dist_name in dist_names:
    dist = getattr(scipy.stats, dist_name)
    param = dist.fit(y)
    pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
    plt.plot(pdf_fitted, label=dist_name)
    plt.xlim(0,47)
plt.legend(loc='upper right')
plt.show()

参考文献：

-拟合分布，良好拟合，p值。

-使用Scipy进行分布拟合

，这里列出了Scipy 0.12.0中可用的所有分布函数的名称（VI）：

dist_names = [ 'alpha', 'anglit', 'arcsine', 'beta', 'betaprime', 'bradford', 'burr', 'cauchy', 'chi', 'chi2', 'cosine', 'dgamma', 'dweibull', 'erlang', 'expon', 'exponweib', 'exponpow', 'f', 'fatiguelife', 'fisk', 'foldcauchy', 'foldnorm', 'frechet_r', 'frechet_l', 'genlogistic', 'genpareto', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma', 'genhalflogistic', 'gilbrat', 'gompertz', 'gumbel_r', 'gumbel_l', 'halfcauchy', 'halflogistic', 'halfnorm', 'hypsecant', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu', 'ksone', 'kstwobign', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke', 'nakagami', 'ncx2', 'ncf', 'nct', 'norm', 'pareto', 'pearson3', 'powerlaw', 'powerlognorm', 'powernorm', 'rdist', 'reciprocal', 'rayleigh', 'rice', 'recipinvgauss', 'semicircular', 't', 'triang', 'truncexpon', 'truncnorm', 'tukeylambda', 'uniform', 'vonmises', 'wald', 'weibull_min', 'weibull_max', 'wrapcauchy']

如果在绘制直方图时normed = True怎么办？您不会将pdf_fitted乘以大小，对吗？

–阿罗哈
2015年3月23日在21:00

如果您想查看所有发行版的外观或有关如何访问所有发行版的想法，请参见此答案。

–tmthydvnprt
16年6月1日在12:54

@SaulloCastro在dist.fit的输出中，param中的3个值代表什么

– shaifali Gupta
17年8月12日在16:40

我会从代码中删除color ='w'，否则不会显示直方图。

–伊朗
19年5月9日在19:58

#3 楼

@Saullo Castro提到的fit()方法提供了最大似然估计（MLE）。数据的最佳分布是可以通过几种不同的方法确定的，从而可以为您提供最高的数据：例如

1，它可以为您提供最高的对数可能性。

2，即为您提供最小的AIC，BIC或BICc值（请参阅Wiki：http：//en.wikipedia.org/wiki/Akaike_information_criterion），基本上可以看作是针对参数数量调整后的对数似然，与期望更多的参数更合适）

3，这是使贝叶斯后验概率最大化的参数。（请参阅Wiki：http：//zh.wikipedia.org/wiki/Posterior_probability）

当然，如果您已经拥有一个可以描述您数据的分布（基于特定领域的理论）并且要坚持下去，您将跳过确定最佳拟合分布的步骤。

scipy并不具有计算对数似然的功能（尽管提供了MLE方法），但是使用了硬代码1很容易：请参见“ scipy.stat.distributions”的内置概率密度函数是否比用户提供的函数慢？

我如何将这种方法应用于已对数据进行分箱的情况-已经是直方图，而不是从数据生成直方图？

– Pete
17年3月15日在15:45

@pete，这将是间隔检查数据的情况，有最大似然方法，但目前尚未在scipy中实现

–朱骏
17 Mar 16 '17 at 0:08

不要忘记证据

– jtlz2
4月30日20:10

#4 楼

尝试使用distfit库。

pip install distfit

# Create 1000 random integers, value between [0-50]
X = np.random.randint(0, 50,1000)

# Retrieve P-value for y
y = [0,10,45,55,100]

# From the distfit library import the class distfit
from distfit import distfit

# Initialize.
# Set any properties here, such as alpha.
# The smoothing can be of use when working with integers. Otherwise your histogram
# may be jumping up-and-down, and getting the correct fit may be harder.
dist = distfit(alpha=0.05, smooth=10)

# Search for best theoretical fit on your empirical data
dist.fit_transform(X)

> [distfit] >fit..
> [distfit] >transform..
> [distfit] >[norm      ] [RSS: 0.0037894] [loc=23.535 scale=14.450] 
> [distfit] >[expon     ] [RSS: 0.0055534] [loc=0.000 scale=23.535] 
> [distfit] >[pareto    ] [RSS: 0.0056828] [loc=-384473077.778 scale=384473077.778] 
> [distfit] >[dweibull  ] [RSS: 0.0038202] [loc=24.535 scale=13.936] 
> [distfit] >[t         ] [RSS: 0.0037896] [loc=23.535 scale=14.450] 
> [distfit] >[genextreme] [RSS: 0.0036185] [loc=18.890 scale=14.506] 
> [distfit] >[gamma     ] [RSS: 0.0037600] [loc=-175.505 scale=1.044] 
> [distfit] >[lognorm   ] [RSS: 0.0642364] [loc=-0.000 scale=1.802] 
> [distfit] >[beta      ] [RSS: 0.0021885] [loc=-3.981 scale=52.981] 
> [distfit] >[uniform   ] [RSS: 0.0012349] [loc=0.000 scale=49.000] 

# Best fitted model
best_distr = dist.model
print(best_distr)

# Uniform shows best fit, with 95% CII (confidence intervals), and all other parameters
> {'distr': <scipy.stats._continuous_distns.uniform_gen at 0x16de3a53160>,
>  'params': (0.0, 49.0),
>  'name': 'uniform',
>  'RSS': 0.0012349021241149533,
>  'loc': 0.0,
>  'scale': 49.0,
>  'arg': (),
>  'CII_min_alpha': 2.45,
>  'CII_max_alpha': 46.55}

# Ranking distributions
dist.summary

# Plot the summary of fitted distributions
dist.plot_summary()

# Make prediction on new datapoints based on the fit
dist.predict(y)

# Retrieve your pvalues with 
dist.y_pred
# array(['down', 'none', 'none', 'up', 'up'], dtype='<U4')
dist.y_proba
array([0.02040816, 0.02040816, 0.02040816, 0.        , 0.        ])

# Or in one dataframe
dist.df

# The plot function will now also include the predictions of y
dist.plot()

请注意，在这种情况下，由于均匀分布，所有点都是有效的。如果需要，可以使用dist.y_pred进行过滤。

#5 楼

AFAICU，您的分布是离散的（除了离散之外什么都没有）。因此，仅计算不同值的频率并对它们进行归一化就足以满足您的目的。因此，有一个示例来证明这一点：

In []: values= [0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4]
In []: counts= asarray(bincount(values), dtype= float)
In []: cdf= counts.cumsum()/ counts.sum()

因此，看到高于1的值的概率很简单（根据互补累积分布函数（ccdf））：

In []: 1- cdf[1]
Out[]: 0.40000000000000002

请注意，ccdf与生存函数（sf）密切相关，但它也用离散分布定义，而sf仅针对连续分布定义。

#6 楼

对我来说，这听起来像是概率密度估计问题。

from scipy.stats import gaussian_kde
occurences = [0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47]
values = range(0,48)
kde = gaussian_kde(map(float, occurences))
p = kde(values)
p = p/sum(p)
print "P(x>=1) = %f" % sum(p[1:])

另请参阅http://jpktd.blogspot.com/2009/03/using-gaussian-kernel-density .html。

对于未来的读者：此解决方案（或至少是该主意）为OP问题（“ p值是什么”）提供了最简单的答案-知道如何将其与一些更合适的方法进行比较会很有趣已知的分布。

–格雷格
16-10-10在20:58

高斯核回归适用于所有分布吗？

–user7345804
17年4月1日在9:57

@mikey作为一般规则，对于所有发行版，都不能进行回归。他们还不错

–环境主义者
17年7月22日在9:42

#7 楼

如果我不理解您的需要，请原谅我，但是将数据存储在字典中，该字典中的键将是0到47之间的数字，并且将其在原始列表中的相关键的出现次数视为数值，请原谅我吗？
可能性p（x）将是大于x的键的所有值的总和除以30000。

在这种情况下，对于任何大于47的值，p（x）都将相同（等于0）。我需要连续的概率分布。

– s_sherly
2011年7月8日在6:15

@s_sherly-如果您可以更好地编辑和澄清问题，那将是一件好事，因为正如您所说的那样，“看到更大的值的概率”确实为零，高于池中最高值的值。

–mac
2011年7月8日在7:00

#8 楼

尽管上述许多答案是完全有效的，但似乎没有人可以完全回答您的问题，特别是以下部分：

我不知道我是否正确，但是要确定概率，我认为我需要使我的数据符合最适合描述我的数据的理论分布。我假设需要某种拟合优度检验才能确定最佳模型。

参数化方法
这是您正在描述的使用一些理论分布并拟合参数的过程对您的数据进行处理，并有一些很好的答案。
非参数方法
但是，也可以对问题使用非参数方法，这意味着您无需承担任何潜在的责任
通过所谓的经验分布函数，等于：
Fn（x）= SUM（I [X <= x]）/ n。因此，低于x的值所占的比例。
如上面的回答之一所指出的那样，您感兴趣的是CDF的倒数（累积分布函数），它等于1-F（x）
可以证明，经验分布函数将收敛到生成数据的任何“真实” CDF。
此外，通过以下方法构造1-alpha置信区间很简单：

L(X) = max{Fn(x)-en, 0}
U(X) = min{Fn(x)+en, 0}
en = sqrt( (1/2n)*log(2/alpha)

然后P（L（X）<= F（X）<= U（X））> = 1-alpha对于所有x。
我很惊讶PierrOz的答案是0票，而这是使用非参数方法估算F（x）的问题的完全有效答案。
请注意，对于任何x> 47，您提到的P（X> = x）= 0都是一个个人问题偏好可能会导致您选择非参数方法之上的参数方法。但是，这两种方法都是解决您问题的完全有效的方法。
有关上述声明的更多详细信息和证据，我建议您看一下
“所有统计：Larry A. Wasserman的统计推断简明课程”。关于参数和非参数推理的优秀书籍。
编辑：
由于您专门询问了一些python示例，因此可以使用numpy来完成：

import numpy as np

def empirical_cdf(data, x):
    return np.sum(x<=data)/len(data)

def p_value(data, x):
    return 1-empirical_cdf(data, x)

# Generate some data for demonstration purposes
data = np.floor(np.random.uniform(low=0, high=48, size=30000))

print(empirical_cdf(data, 20))
print(p_value(data, 20)) # This is the value you're interested in

#9 楼

使用OpenTURNS，我将使用BIC标准来选择适合此类数据的最佳分布。这是因为此标准不会给具有更多参数的分布带来太多优势。确实，如果分布具有更多参数，则拟合的分布更容易接近数据。此外，在这种情况下，Kolmogorov-Smirnov可能没有意义，因为测量值的微小误差将对p值产生巨大影响。
为了说明该过程，我加载了El-Nino数据，其中包含1950年至2010年的732次每月温度测量值：

import statsmodels.api as sm
dta = sm.datasets.elnino.load_pandas().data
dta['YEAR'] = dta.YEAR.astype(int).astype(str)
dta = dta.set_index('YEAR').T.unstack()
data = dta.values

使用GetContinuousUniVariateFactories静态方法很容易获得30个内置的单变量分布工厂。完成后，BestModelBIC静态方法将返回最佳模型和相应的BIC得分。

sample = ot.Sample([[p] for p in data]) # data reshaping
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model, best_bic = ot.FittingTest.BestModelBIC(sample,
                                                   tested_factories)
print("Best=",best_model)

打印：

Best= Beta(alpha = 1.64258, beta = 2.4348, a = 18.936, b = 29.254)

为了以图形方式将拟合度与直方图进行比较，我使用最佳分布的drawPDF方法。

import openturns.viewer as otv
graph = ot.HistogramFactory().build(sample).drawPDF()
bestPDF = best_model.drawPDF()
bestPDF.setColors(["blue"])
graph.add(bestPDF)
graph.setTitle("Best BIC fit")
name = best_model.getImplementation().getClassName()
graph.setLegends(["Histogram",name])
graph.setXTitle("Temperature (°C)")
otv.View(graph)

这将产生：

有关此主题的更多详细信息，请参见BestModelBIC文档。可以将Scipy分布包括在SciPyDistribution中，甚至可以将ChaosPy分布与ChaosPyDistribution一起包含，但我想当前脚本可以满足大多数实际目的。

您可能应该宣布有兴趣？

– jtlz2
4月30日20:13

编程黑洞网