Statistical package in Python based on Pandas


https://travis-ci.org/raphaelvallat/pingouin.svg?branch=master https://pepy.tech/badge/pingouin/month https://badges.gitter.im/owner/repo.png
https://github.com/raphaelvallat/pingouin/blob/master/docs/pictures/logo_pingouin.png

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation.

  1. ANOVAs: N-ways, repeated measures, mixed, ancova
  2. Pairwise post-hocs tests (parametric and non-parametric) and pairwise correlations
  3. Robust, partial, distance and repeated measures correlations
  4. Linear/logistic regression and mediation analysis
  5. Bayes Factors
  6. Multivariate tests
  7. Reliability and consistency
  8. Effect sizes and power analysis
  9. Parametric/bootstrapped confidence intervals around an effect size or a correlation coefficient
  10. Circular statistics
  11. Chi-squared tests
  12. Plotting: Bland-Altman plot, Q-Q plot, paired plot, robust correlation...

Pingouin is designed for users who want simple yet exhaustive statistical functions.

For example, the ttest_ind function of SciPy returns only the T-value and the p-value. By contrast, the ttest function of Pingouin returns the T-value, the p-value, the degrees of freedom, the effect size (Cohen's d), the 95% confidence intervals of the difference in means, the statistical power and the Bayes Factor (BF10) of the test.

Documentation

Chat

If you have questions, please ask them in the public Gitter chat

https://badges.gitter.im/owner/repo.png

Installation

Dependencies

The main dependencies of Pingouin are :

In addition, some functions require :

Pingouin is a Python 3 package and is currently tested for Python 3.6 and 3.7. Pingouin does not work with Python 2.7.

User installation

Pingouin can be easily installed using pip

pip install pingouin

or conda

conda install -c conda-forge pingouin

New releases are frequent so always make sure that you have the latest version:

pip install --upgrade pingouin

Quick start

Click on the link below and navigate to the notebooks/ folder to run a collection of interactive Jupyter notebooks showing the main functionalities of Pingouin. No need to install Pingouin beforehand, the notebooks run in a Binder environment.

10 minutes to Pingouin

1. T-test

import numpy as np
import pingouin as pg

np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T

# T-test
pg.ttest(x, y)
Output
T dof tail p-val CI95% cohen-d BF10 power
-3.401 58 two-sided 0.001 [-1.68 -0.43] 0.878 26.155 0.917

2. Pearson's correlation

pg.corr(x, y)
Output
n r CI95% r2 adj_r2 p-val BF10 power
30 0.595 [0.3 0.79] 0.354 0.306 0.001 69.723 0.95

3. Robust correlation

# Introduce an outlier
x[5] = 18
# Use the robust Shepherd's pi correlation
pg.corr(x, y, method="shepherd")
Output
n outliers r CI95% r2 adj_r2 p-val power
30 1 0.561 [0.25 0.77] 0.315 0.264 0.002 0.917

4. Test the normality of the data

The pingouin.normality function works with lists, arrays, or pandas DataFrame in wide or long-format.

print(pg.normality(x))                                    # Univariate normality
print(pg.multivariate_normality(np.column_stack((x, y)))) # Multivariate normality
Output
W pval normal
0.615 0.000 False
(False, 0.00018)

5. One-way ANOVA using a pandas DataFrame

# Read an example dataset
df = pg.read_dataset('mixed_anova')

# Run the ANOVA
aov = pg.anova(data=df, dv='Scores', between='Group', detailed=True)
print(aov)
Output
Source SS DF MS F p-unc np2
Group 5.460 1 5.460 5.244 0.023 0.029
Within 185.343 178 1.041 nan nan nan

6. Repeated measures ANOVA

pg.rm_anova(data=df, dv='Scores', within='Time', subject='Subject', detailed=True)
Output
Source SS DF MS F p-unc np2 eps
Time 7.628 2 3.814 3.913 0.023 0.062 0.999
Error 115.027 118 0.975 nan nan nan nan

7. Post-hoc tests corrected for multiple-comparisons

# FDR-corrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_ttests(data=df, dv='Scores', within='Time', subject='Subject',
                             parametric=True, padjust='fdr_bh', effsize='hedges')

# Pretty printing of table
pg.print_table(posthoc, floatfmt='.3f')
Output
Contrast A B Paired Parametric T dof Tail p-unc p-corr p-adjust BF10 hedges
Time August January True True -1.740 59.000 two-sided 0.087 0.131 fdr_bh 0.582 -0.328
Time August June True True -2.743 59.000 two-sided 0.008 0.024 fdr_bh 4.232 -0.485
Time January June True True -1.024 59.000 two-sided 0.310 0.310 fdr_bh 0.232 -0.170

8. Two-way mixed ANOVA

# Compute the two-way mixed ANOVA
aov = pg.mixed_anova(data=df, dv='Scores', between='Group', within='Time',
                     subject='Subject', correction=False, effsize="np2")
pg.print_table(aov)
Output
Source SS DF1 DF2 MS F p-unc np2 eps
Group 5.460 1 58 5.460 5.052 0.028 0.080 nan
Time 7.628 2 116 3.814 4.027 0.020 0.065 0.999
Interaction 5.167 2 116 2.584 2.728 0.070 0.045 nan

9. Pairwise correlations between columns of a dataframe

import pandas as pd
np.random.seed(123)
z = np.random.normal(5, 1, 30)
data = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
pg.pairwise_corr(data, columns=['X', 'Y', 'Z'], method='pearson')
Output
X Y method tail n r CI95% r2 adj_r2 z p-unc BF10 power
X Y pearson two-sided 30 0.366 [0.01 0.64] 0.134 0.070 0.384 0.047 1.500 0.525
X Z pearson two-sided 30 0.251 [-0.12 0.56] 0.063 -0.006 0.257 0.181 0.534 0.272
Y Z pearson two-sided 30 0.020 [-0.34 0.38] 0.000 -0.074 0.020 0.916 0.228 0.051

10. Convert between effect sizes

# Convert from Cohen's d to Hedges' g
pg.convert_effsize(0.4, 'cohen', 'hedges', nx=10, ny=12)
0.384

11. Multiple linear regression

pg.linear_regression(data[['X', 'Z']], data['Y'])
Linear regression summary
names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%]
Intercept 4.650 0.841 5.530 0.000 0.139 0.076 2.925 6.376
X 0.143 0.068 2.089 0.046 0.139 0.076 0.003 0.283
Z -0.069 0.167 -0.416 0.681 0.139 0.076 -0.412 0.273

12. Mediation analysis

pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)
Mediation summary
path coef se pval CI[2.5%] CI[97.5%] sig
Z ~ X 0.103 0.075 0.181 -0.051 0.256 No
Y ~ Z 0.018 0.171 0.916 -0.332 0.369 No
Total 0.136 0.065 0.047 0.002 0.269 Yes
Direct 0.143 0.068 0.046 0.003 0.283 Yes
Indirect -0.007 0.025 0.898 -0.069 0.029 No

13. Contingency analysis

data = pg.read_dataset('chi2_independence')
expected, observed, stats = pg.chi2_independence(data, x='sex', y='target')
stats
Chi-squared tests summary
test lambda chi2 dof p cramer power
pearson 1.000 22.717 1.000 0.000 0.274 0.997
cressie-read 0.667 22.931 1.000 0.000 0.275 0.998
log-likelihood 0.000 23.557 1.000 0.000 0.279 0.998
freeman-tukey -0.500 24.220 1.000 0.000 0.283 0.998
mod-log-likelihood -1.000 25.071 1.000 0.000 0.288 0.999
neyman -2.000 27.458 1.000 0.000 0.301 0.999

Integration with Pandas

Several functions of Pingouin can be used directly as pandas DataFrame methods. Try for yourself with the code below:

import pingouin as pg

# Example 1 | ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)

# Example 2 | Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])

# Example 3 | Partial correlation matrix
data.pcorr()

The functions that are currently supported as pandas method are:

Development

Pingouin was created and is maintained by Raphael Vallat, mostly during his spare time. Contributions are more than welcome so feel free to contact me, open an issue or submit a pull request!

To see the code or report a bug, please visit the GitHub repository.

Note that this program is provided with NO WARRANTY OF ANY KIND. If you can, always double check the results with another statistical software.

Contributors

How to cite Pingouin?

If you want to cite Pingouin, please use the publication in JOSS:

Acknowledgement

Several functions of Pingouin were inspired from R or Matlab toolboxes, including:

Owner
Raphael Vallat
French research scientist specialized in sleep and dreaming | Strong interest in stats and signal processing | Python lover
Raphael Vallat
Comments
  • ValueError with large dataset

    ValueError with large dataset

    I have a large dataset of 500k rows and 74 columns. Whenever I try to use a pairwise partial correlation I get the following error:

    ----> 1 data2.pairwise_corr(covar = 'SEX')
    
    ~/.local/lib/python3.7/site-packages/pandas_flavor/register.py in __call__(self, *args, **kwargs)
         27             @wraps(method)
         28             def __call__(self, *args, **kwargs):
    ---> 29                 return method(self._obj, *args, **kwargs)
         30 
         31         register_dataframe_accessor(method.__name__)(AccessorMethod)
    
    ~/.local/lib/python3.7/site-packages/pingouin/pairwise.py in pairwise_corr(data, columns, covar, tail, method, padjust, nan_policy)
       1229         else:
       1230             cor_st = partial_corr(data=data, x=col1, y=col2, covar=covar,
    -> 1231                                   tail=tail, method=method)
       1232         cor_st_keys = cor_st.columns.tolist()
       1233 
    
    ~/.local/lib/python3.7/site-packages/pingouin/correlation.py in partial_corr(data, x, y, covar, x_covar, y_covar, tail, method, **kwargs)
        783         # PARTIAL CORRELATION
        784         cvar = np.atleast_2d(C[covar].to_numpy())
    --> 785         beta_x = np.linalg.lstsq(cvar, C[x].to_numpy(), rcond=None)[0]
        786         beta_y = np.linalg.lstsq(cvar, C[y].to_numpy(), rcond=None)[0]
        787         res_x = C[x].to_numpy() - cvar @ beta_x
    
    <__array_function__ internals> in lstsq(*args, **kwargs)
    
    /curc/sw/anaconda3/2019.07/envs/jupyterlab2/lib/python3.7/site-packages/numpy/linalg/linalg.py in lstsq(a, b, rcond)
       2257         # lapack can't handle n_rhs = 0 - so allocate the array one larger in that axis
       2258         b = zeros(b.shape[:-2] + (m, n_rhs + 1), dtype=b.dtype)
    -> 2259     x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)
       2260     if m == 0:
       2261         x[...] = 0
    
    ValueError: On entry to DLASCL parameter number 4 had an illegal value
    

    This is even with 12 cores running, is there any way to resolve this or is the pairwise correlation with a covariate just not compatible with a large dataset? pcorr() and pairwise_corr() work as methods on the dataframe.

  • Unexpected results in semi-partial correlation with Spearman/Kendall

    Unexpected results in semi-partial correlation with Spearman/Kendall

    As a test, I generated correlated variables x and y. I then set x_test to x and ran a semi-partial correlation of xtest against y, controlling for the effect x on y. As one would expect, the pearson semi-partial correlation of this is zero. However, when applying Spearman and Kendall, the result is non-zero.

    The Spearman and Kendall result was 0.14 and 0.11 respectively. Although it is low using the following dummy dataset, the data I am working with produces a very high semi-partial correlation around 0.8 which does not make sense.

    df=pd.DataFrame({"y":[1,2,3,4,5,6,7,8,9,10],"x":[1,3,5,8,7,10,13,14,18,21]})
    print(df.corr())
    df['x_test']=df['x']
    partial_corr(data=df,x='x_test',y='y',y_covar=['x'], 
                 method='spearman')['r'].values
    
    Out:
    
              y         x
    y  1.000000  0.985318
    x  0.985318  1.000000
    
    array([0.13939394])
    
  • Shapiro-Wilk test should have option to return W statistic

    Shapiro-Wilk test should have option to return W statistic

    Currently pingouin.normality() silently discards the W statistic returned by scipy.stats.shapiro(). I would love to have an option to retrieve these stats, as I need to present them together with the p values when drafting a scientific publication. I wonder if the function could not be adjusted to return a DataFrame with numerous information like most of the other pingouin functions?

  • Add axis title to plot_shift

    Add axis title to plot_shift

    The entire library is absolutely amazing; thank you for developing it. I came up with a minor improvement for the "Shift plot". No matter what data array we provide, this graphic shows X and Y. There may be disagreement about its use in academic papers. I believed that by defining X and Y, this conflict would be addressed. I created an edit for this where the user can configure the X and Y variables as well as the plot title. The labels being on the Y axis creates problems with the size of the figure, thus I located them as a legend in the plot.

    I have ideas for other modules and would like to add them when I have the time. My initial contribution would be adding the odds ratio and CI-range values to the chi-sqr module.

    Here my test code; plotting_test.zip

  • multiple regression: catch rank deficient design matrices and warn

    multiple regression: catch rank deficient design matrices and warn

    closes #130

    For now I opted to only return the coefficients in case a rank deficient design matrix is detected.

    The basic error of pingouin is its inability to handle the residues return parameter of lstsq whenever the design matrix is rank deficient. In that case, residues is an empty array array([], dtype=float64) instead of a float.

    One option apart from warning and simply returning the coefficients only could be to "revive" the currently commented line from line 413(old) / 422(new): # ss_res = (resid ** 2).sum()

    Something like:

        # FIT (WEIGHTED) LEAST SQUARES REGRESSION USING SCIPY.LINALG.LSTST
        coef, ss_res, rank, _ = lstsq(Xw, yw)
        if coef_only:
            return coef
        calc_ss_res = False
        if rank < Xw.shape[1]:
            warnings.warn('Design matrix supplied with `X` parameter is rank '
                          f'deficient (rank {rank} with {Xw.shape[1]} columns. '
                          'That means that one or more of the columns in `X` '
                          'are a linear combination of one of more of the '
                          'other columns.'')
            calc_ss_res = True
    
        # Degrees of freedom
        df_model = rank - constant
        df_resid = n - p
        # Calculate predicted values and (weighted) residuals
        pred = Xw @ coef
        resid = yw - pred
        if calc_ss_res:
            ss_res = (resid ** 2).sum()
    

    I played around with that, and it seems to be producing the same results as statsmodels. Try for yourself if you'd like:

    import numpy as np 
    import pingouin 
    import statsmodels.api as sm 
    
    # make up some random data and a design matrix for multiple linear regression 
    n = 100  
    y = np.random.randn(n) 
    X = np.vstack([np.random.permutation([0, 1, 0, 0, 0]) for i in range(n)]) 
    
    # try with pingouin (use patch from code example above)
    results_pingouin = pingouin.linear_regression(X, y, add_intercept=True) 
    print(results_pingouin)
    
    # try with statsmodels
    X_with_intercept = sm.add_constant(X) 
    model = sm.OLS(endog=y, exog=X_with_intercept) 
    results_sm = model.fit() 
    print(results_sm.summary())
    
    # compare residuals
    np.allclose(results_pingouin.residuals_, results_sm.resid)  # True
    np.allclose(results_pingouin.coef.to_numpy(), results_sm.params)  # True
    

    Let me know what you think :-)

  • pairwise_corr doc updated and extended functionality to automatically control for all remaining covariates

    pairwise_corr doc updated and extended functionality to automatically control for all remaining covariates

    I updated doc of pairwise_corr to reflect more clearly that a partial corr is only calculated, if covariates are specified.

    Furthermore I implemented that combinations of columns, which are also present in covars, will not be dropped. Instead the x and y columns will be removed from covar only for this specific combination. This allows for automatically controlling for all other columns in the dataframe while calculating the pairwise partial correlation for all combinations.

    If this means that there are not covars for a combination, an empty list will be returned for stats['covar'] and via partial_corr line 744 a standard non-partial corr will be returned. Is this clear enough for the user or should "empty" covar rows be dropped instead?

    Closes #123

    Some basic examples of the funcitonality (tests are included in the respective test file)

    import numpy as np
    import pandas as pd
    
    import pingouin as pg
    
    df = pd.DataFrame(
        data=np.random.rand(200, 4), columns=['a', 'b', 'c', 'd'])
    df['e'] = df['a']**2
    df['f'] = df['a'] * df['b']
    df['g'] = df['c'] + np.random.randint(0, 5, 200)
    df['h'] = np.sqrt(df['d'] + 5) - df['b']**3
    
    # Get pairwise corr for all available covars per combination, also try nan policy
    pwc = pg.pairwise_corr(data=df, covar=df.columns)
    pwc_lwnan = pg.pairwise_corr(data=df, covar=df.columns, nan_policy='listwise')
    # assert that both are equal
    assert (
        pwc.drop(columns=['CI95%']) == pwc_lwnan.drop(columns=['CI95%'])
    ).all().all()
    
    # Now the same with a MultiIndex:
    mi = pd.MultiIndex.from_product((
        ('a', 'b', 'c', 'd'), ('u', 'v'),
    ))
    df_mi = df.copy()
    df_mi.columns = mi
    
    pwc_mi = pg.pairwise_corr(data=df_mi, covar=df_mi.columns)
    # Assert that the results are equal to non-multiindex tests
    assert (
        pwc_mi.drop(
            columns=['X', 'Y', 'covar', 'CI95%']) == pwc.drop(
                columns=['X', 'Y', 'covar', 'CI95%'])
    ).all().all()
    # and compare to nan policy listwise
    pwc_mi_lwnan = pg.pairwise_corr(
        data=df_mi, covar=df_mi.columns, nan_policy='listwise'
    )
    assert (
        pwc_mi.drop(
            columns=['X', 'Y', 'covar', 'CI95%']) == pwc.drop(
                columns=['X', 'Y', 'covar', 'CI95%'])
    ).all().all()
    

    And to show what I mean with "empty covars" (taken from test_pairwise.py):

    from pingouin import pairwise_corr, read_dataset
    
    data = read_dataset('pairwise_corr').iloc[:, 1:]
    n = data.shape[0]
    data['Age'] = np.random.randint(18, 65, n)
    data['IQ'] = np.random.normal(105, 1, n)
    data['One'] = 1
    data['Gender'] = np.repeat(['M', 'F'], int(n / 2))
    
    # see row 4 for empty covar
    pwc_empty = pairwise_corr(data, columns='Neuroticism', covar='Age')
    print(pwc_empty.covar)
    # or fully empty covar
    pwc_fully_empty = pairwise_corr(data, columns=['Neuroticism', 'Age'], covar='Age')
    
  • Assertion failure (-1 <= r <= 1) in power.py when running rm_corr

    Assertion failure (-1 <= r <= 1) in power.py when running rm_corr

    Hi there, Thanks for providing this awesome library. I picked it up to run an rm_corr test on some data but I'm having issues.

    I have 1000 individuals for each of whom I have 8 paired measurements. I have uploaded a pickled dataframe that causes the issue to GDrive (https://drive.google.com/open?id=1gkvGKBth2J8EhmhblsWRdK-eCDlk8zJk).

    Here's a snapshot of what the dataframe looks like: image

    You can reproduce the issue by running:

    df = pd.read_pickle('bad_df.pkl')
    rm_corr(data=df, x='m2', y='m1', subject='id')
    

    which yields the error

    AssertionError                            Traceback (most recent call last)
    <ipython-input-156-86b16d57ef4b> in <module>
    ----> 1 rm_corr(data=df, x='m2', y='m1', subject='id')
    
    .../lib/python3.7/site-packages/pingouin/correlation.py in rm_corr(data, x, y, subject, tail)
        981     pval = pval * 0.5 if tail == 'one-sided' else pval
        982     ci = compute_esci(stat=rm, nx=n, eftype='pearson').tolist()
    --> 983     pwr = power_corr(r=rm, n=n, tail=tail)
        984     # Convert to Dataframe
        985     stats = pd.DataFrame({"r": round(rm, 3), "dof": int(dof),
    
    .../lib/python3.7/site-packages/pingouin/power.py in power_corr(r, n, power, alpha, tail)
        902     # Safety checks
        903     if r is not None:
    --> 904         assert -1 <= r <= 1
        905         r = abs(r)
        906     if alpha is not None:
    
    

    If I drop into a debugger I can see that r = 1.0000184067367273, perhaps this is due to numerical stability issues?

  • Partial correlations out of range when variance = 0

    Partial correlations out of range when variance = 0

    The following reproducer:

    from pandas import DataFrame
    from random import seed, sample
    
    import pingouin as pg
    
    seed(0, 1)
    sample_size = 100
    random_columns = 60
    
    n = [0.0000001 for _ in range(sample_size)]
    df = DataFrame(dict(
        n1b=n, n2b=n, n3b=n,
        **{
            f'x{i}': sample(range(2000), sample_size)
            for i in range(random_columns)
        }
    ))
    
    pg.pcorr(df).max().max()
    

    gives a value of 6.754645467173336 with pingouin 0.4.0. I encountered this behaviour (giving out of range coefficients) in the wild when I accidentally forgot to remove variables with no variance.

    To ensure that incorrect results like that don't get published, I would:

    • add a check on the data verifying that var() != 0 prior to running computations; not sure if this should raise or warn
    • add a check on the results; if any of the coefficients are out of range it would warn and provide name of the column those come from and a helpful message like "your data may have close-to-zero variance variables or be otherwise ill-conditioned".

    It might be also worth investigating if there is a way to workaround these numerical issues, but the two points above should address the most urgent worries.

  • Skipped correlation gives different results than the Pernet implementation in Matlab

    Skipped correlation gives different results than the Pernet implementation in Matlab

    I think this line in correlation.py:

    dis[i, :] = np.linalg.norm(B * B2[i, :] / bot[i], axis=1)
    

    is functionally different compared to the Matlab implementation and the equation in Wilcox, R. (2004)

  • adding option to return stats as float, rather than str

    adding option to return stats as float, rather than str

    This PR is a work in progress.

    I'm not sure whether there is a general guideline about returning statistics as formatted strings, rather than as float, but I find it very useful to use the raw float values, rather than get strings. This PR adds this functionality, at least to the bayesian module (not sure what the situation is in other modules).

    This PR leaves the default behaviour of returning strings untouched. It could be argued that returning the raw float values should be default, and formatting left to the user in general (or some utility function perhaps). I'm interested to hear thoughts on this as well.

    If you think in principle this PR is a good idea, I'll update and add proper documentation etc.

  • TestParametric::test_pandas fails with dtype mismatch on 32-bit platforms

    TestParametric::test_pandas fails with dtype mismatch on 32-bit platforms

    I’m helping to update the python-pingouin package in Fedora Linux. One test, TestParametric::test_pandas, is failing on 32-bit platforms (i686 and armv7hl). I can’t easily tell if the actual problem here is in pingouin, pandas, or somewhere else (scipy, pandas-flavor, …).

    =================================== FAILURES ===================================
    __________________________ TestParametric.test_pandas __________________________
    self = <pingouin.tests.test_pandas.TestParametric testMethod=test_pandas>
        def test_pandas(self):
            """Test pandas method.
            """
            # Test the ANOVA (Pandas)
            aov = df.anova(dv='Scores', between='Group', detailed=True)
            assert aov.equals(pg.anova(dv='Scores', between='Group', detailed=True,
                                       data=df))
            aov3_ss1 = df_aov3.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                     ss_type=1)
            aov3_ss2 = df_aov3.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                     ss_type=2)
            aov3_ss2_pg = pg.anova(dv='Cholesterol', between=['Sex', 'Drug'],
                                   data=df_aov3, ss_type=2)
            assert not aov3_ss1.equals(aov3_ss2)
            assert aov3_ss2.round(3).equals(aov3_ss2_pg.round(3))
        
            # Test the Welch ANOVA (Pandas)
            aov = df.welch_anova(dv='Scores', between='Group')
            assert aov.equals(pg.welch_anova(dv='Scores', between='Group',
                                             data=df))
        
            # Test the ANCOVA
            aov = df_anc.ancova(dv='Scores', covar='Income',
                                between='Method').round(3)
            assert aov.equals(pg.ancova(data=df_anc, dv='Scores', covar='Income',
                              between='Method').round(3))
        
            # Test the repeated measures ANOVA (Pandas)
            aov = df.rm_anova(dv='Scores', within='Time', subject='Subject',
                              detailed=True)
            assert aov.equals(pg.rm_anova(dv='Scores', within='Time',
                                          subject='Subject',
                                          detailed=True, data=df))
        
            # FDR-corrected post hocs with Hedges'g effect size
            ttests = df.pairwise_ttests(dv='Scores', within='Time',
                                        subject='Subject', padjust='fdr_bh',
                                        effsize='hedges')
            assert ttests.equals(pg.pairwise_ttests(dv='Scores', within='Time',
                                                    subject='Subject',
                                                    padjust='fdr_bh',
                                                    effsize='hedges', data=df))
        
            # Pairwise Tukey
            tukey = df.pairwise_tukey(dv='Scores', between='Group')
            assert tukey.equals(pg.pairwise_tukey(data=df, dv='Scores',
                                                  between='Group'))
        
            # Test two-way mixed ANOVA
            aov = df.mixed_anova(dv='Scores', between='Group', within='Time',
                                 subject='Subject', correction=False)
            assert aov.equals(pg.mixed_anova(dv='Scores', between='Group',
                                             within='Time',
                                             subject='Subject', correction=False,
                                             data=df))
        
            # Test parwise correlations
            corrs = data.pairwise_corr(columns=['X', 'M', 'Y'], method='spearman')
            corrs2 = pg.pairwise_corr(data=data, columns=['X', 'M', 'Y'],
                                      method='spearman')
            assert corrs['r'].equals(corrs2['r'])
        
            # Test partial correlation
            corrs = data.partial_corr(x='X', y='Y', covar='M', method='spearman')
            corrs2 = pg.partial_corr(x='X', y='Y', covar='M', method='spearman',
                                     data=data)
            assert corrs['r'].equals(corrs2['r'])
        
            # Test partial correlation matrix (compare with the ppcor package)
            corrs = data.iloc[:, :5].pcorr().round(3)
            np.testing.assert_array_equal(corrs.iloc[0, :].to_numpy(),
                                          [1, 0.392, 0.06, -0.014, -0.149])
            # Now compare against Pingouin's own partial_corr function
            corrs = data[['X', 'Y', 'M']].pcorr()
            corrs2 = data.partial_corr(x='X', y='Y', covar='M')
            assert np.isclose(corrs.at['X', 'Y'], corrs2.at['pearson', 'r'])
        
            # Test rcorr (correlation matrix with p-values)
            # We compare against Pingouin pairwise_corr function
            corrs = df_corr.rcorr(padjust='holm', decimals=4)
            corrs2 = df_corr.pairwise_corr(padjust='holm').round(4)
            assert corrs.at['Neuroticism', 'Agreeableness'] == '*'
            assert (corrs.at['Agreeableness', 'Neuroticism'] ==
                    str(corrs2.at[2, 'r']))
            corrs = df_corr.rcorr(padjust='holm', stars=False, decimals=4)
            assert (corrs.at['Neuroticism', 'Agreeableness'] ==
                    str(corrs2.at[2, 'p-corr'].round(4)))
            corrs = df_corr.rcorr(upper='n', decimals=5)
            corrs2 = df_corr.pairwise_corr().round(5)
            assert corrs.at['Extraversion', 'Openness'] == corrs2.at[4, 'n']
            assert corrs.at['Openness', 'Extraversion'] == str(corrs2.at[4, 'r'])
            # Method = spearman does not work with Python 3.5 on Travis?
            # Instead it seems to return the Pearson correlation!
    >       df_corr.rcorr(method='spearman')
    aov        =         Source        SS  DF1  DF2  ...         F     p-unc       np2       eps
    0        Group  5.459963    1   58  ......064929  0.998751
    2  Interaction  5.167192    2  116  ...  2.727996  0.069545  0.044922       NaN
    [3 rows x 9 columns]
    aov3_ss1   =        Source         SS    DF        MS         F     p-unc       np2
    0         Sex   3.362769   1.0  3.362769  3.461...62   2.0  0.561181  0.577636  0.564168  0.018007
    3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
    aov3_ss2   =        Source         SS    DF        MS         F     p-unc       np2
    0         Sex   3.419789   1.0  3.419789  3.520...62   2.0  0.561181  0.577636  0.564168  0.018007
    3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
    aov3_ss2_pg =        Source         SS    DF        MS         F     p-unc       np2
    0         Sex   3.419789   1.0  3.419789  3.520...62   2.0  0.561181  0.577636  0.564168  0.018007
    3    Residual  61.205378  63.0  0.971514       NaN       NaN       NaN
    corrs      =                   Neuroticism Extraversion  ... Agreeableness Conscientiousness
    Neuroticism                 -         ...              500
    Conscientiousness    -0.36801      0.06459  ...       0.15867                 -
    [5 rows x 5 columns]
    corrs2     =                X                  Y   method  ...    p-unc       BF10    power
    0    Neuroticism       Extraversion  pe...  0.059  0.06035
    9  Agreeableness  Conscientiousness  pearson  ...  0.00037     31.243  0.94638
    [10 rows x 10 columns]
    self       = <pingouin.tests.test_pandas.TestParametric testMethod=test_pandas>
    ttests     =   Contrast        A        B  Paired  ...    p-corr  p-adjust   BF10    hedges
    0     Time   August  January    True  ....  4.232 -0.482547
    2     Time  January     June    True  ...  0.310194    fdr_bh  0.232 -0.169520
    [3 rows x 13 columns]
    tukey      =          A           B   mean(A)  ...         T   p-tukey    hedges
    0  Control  Meditation  5.567851  ... -2.289903  0.023202 -0.339918
    [1 rows x 9 columns]
    pingouin/tests/test_pandas.py:114: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    /usr/lib/python3.10/site-packages/pandas_flavor/register.py:29: in __call__
        return method(self._obj, *args, **kwargs)
            args       = ()
            kwargs     = {'method': 'spearman'}
            method     = <function rcorr at 0xbb3880b8>
            self       = <pandas_flavor.register.register_dataframe_method.<locals>.inner.<locals>.AccessorMethod object at 0xa8881f10>
    pingouin/correlation.py:1050: in rcorr
        mat = self.corr(method=method).round(decimals)
            decimals   = 3
            ffp        = <function format_float_positional at 0xc3ab3028>
            method     = 'spearman'
            padjust    = None
            pearsonr   = <function pearsonr at 0xbbb59e38>
            pval_stars = {0.001: '***', 0.01: '**', 0.05: '*'}
            self       =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
    0        2.47917       4.20833   3.93750   ...3            3.39583
    499      2.54167       3.56250   3.14583        3.45833            2.89583
    [500 rows x 5 columns]
            spearmanr  = <function spearmanr at 0xbbb5f388>
            stars      = True
            tif        = <function triu_indices_from at 0xc39efcd0>
            upper      = 'pval'
    /usr/lib/python3.10/site-packages/pandas/core/frame.py:9376: in corr
        correl = libalgos.nancorr_spearman(mat, minp=min_periods)
            cols       = Index(['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness',
           'Conscientiousness'],
          dtype='object')
            idx        = Index(['Neuroticism', 'Extraversion', 'Openness', 'Agreeableness',
           'Conscientiousness'],
          dtype='object')
            mat        = array([[2.47917, 4.20833, 3.9375 , 3.95833, 3.45833],
           [2.60417, 3.1875 , 3.95833, 3.39583, 3.22917],
           [2.... 3.     ],
           [2.08333, 3.66667, 3.58333, 3.45833, 3.39583],
           [2.54167, 3.5625 , 3.14583, 3.45833, 2.89583]])
            method     = 'spearman'
            min_periods = 1
            numeric_df =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
    0        2.47917       4.20833   3.93750   ...3            3.39583
    499      2.54167       3.56250   3.14583        3.45833            2.89583
    [500 rows x 5 columns]
            self       =      Neuroticism  Extraversion  Openness  Agreeableness  Conscientiousness
    0        2.47917       4.20833   3.93750   ...3            3.39583
    499      2.54167       3.56250   3.14583        3.45833            2.89583
    [500 rows x 5 columns]
    pandas/_libs/algos.pyx:415: in pandas._libs.algos.nancorr_spearman
        ???
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    >   ???
    E   ValueError: Buffer dtype mismatch, expected 'const intp_t' but got 'long long'
    pandas/_libs/algos.pyx:938: ValueError
    

    On 32-bit Linux platforms, long long is normally 64-bit but intp_t is 32-bit, which explains what is happening but not why it is happening.

    I can easily do any tests that are needed on these platforms to try to figure out where the root issue is, but some ideas regarding what to look for would be very helpful.

  • FutureWarning followed by sserror = grp.apply(lambda x: (x - x.mean()) ** 2).sum()

    FutureWarning followed by sserror = grp.apply(lambda x: (x - x.mean()) ** 2).sum()

    I was just trying to understand some generated data in pingouin using ANOVA, however I got a FutureWarning and sserror message that I can't seem to figure out. For the sake of completeness, I will post what I was doing.

    # the libraries
    import pandas as pd
    import numpy as np
    
    import pingouin as pg
    import itertools
    
    rng = np.random.default_rng()
    
    # the generated data
    year = pd.DataFrame([rng.choice(a=[2000, 2001, 2002], size=450, replace=True, shuffle=True) 
            for _ in range(3)]).stack().reset_index(drop=["level_0", "level_1"])
    
    dv1 = pd.DataFrame([rng.normal(loc=9.83, scale=189.34, size=450) for _ in range(3)]).stack()\
            .reset_index(drop=["level_0", "level_1"])
    
    dv2 = pd.DataFrame([rng.normal(loc=196.52, scale=1091.10, size=450) for _ in range(3)]).stack()\
            .reset_index(drop=["level_0", "level_1"])
    
    iv1 = pd.DataFrame([rng.choice(a=range(5), size=450) for _ in range(3)]).stack()\
            .reset_index(drop=["level_0", "level_1"])
    
    iv2 = pd.DataFrame([rng.choice(a=range(5), size=450) for _ in range(3)]).stack()\
            .reset_index(drop=["level_0", "level_1"])
    
    iv3 = pd.DataFrame([rng.choice(a=range(5), size=450) for _ in range(3)]).stack()\
            .reset_index(drop=["level_0", "level_1"])
    
    iv4 = pd.DataFrame([rng.choice(a=range(5), size=450) for _ in range(3)]).stack()\
            .reset_index(drop=["level_0", "level_1"])
    
    dfa = pd.concat([year, dv1, dv2, iv1, iv1, iv3, iv4], axis=1).rename({0: "year",
                                                                   1: 'dv1', 2: 'dv2',
                                                                   3: 'iv1', 4: 'iv2',
                                                                   5: 'iv3', 6: 'iv4'},
                                                                  axis="columns")
    print(dfa)
    
    # the dvs and ivs
    dvs = ["dv1", "dv2"] # continuous variables
    ivs = ["iv1", "iv2", "iv3", "iv4"] # categorical variables
    
    # the anova model
    pgaov = {}
    for idx, df in dfa.groupby('year'):
        # get the cartesian product
        for dv, iv in itertools.product(dvs, ivs):
            pgaov[idx, dv, iv] = pg.anova(data=df, dv=dv, between=iv, detailed=True, effsize='n2')
    

    When I ran the code above for the anova model, I got the following message:

    C:\Python39\lib\site-packages\pingouin\parametric.py:992: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
    To preserve the previous behavior, use
    
    	>>> .groupby(..., group_keys=False)
    
    To adopt the future behavior and silence this warning, use 
    
    	>>> .groupby(..., group_keys=True)
      sserror = grp.apply(lambda x: (x - x.mean()) ** 2).sum()
    

    My questions are two-fold: (1) how do I keep the future warning from showing using my groupby statement? and (2) what is the source of the sserror = grp.apply(lambda x: (x - x.mean()) ** 2).sum() and how can I fix it?

  • non consistent post-hoc output

    non consistent post-hoc output

    Hi, Many thanks for Your job, pinguin makes real difference in python environment! I observed that output data frames headings from pairwise_gameshowell, pairwise_tests and pairwise_tukey functions are not consistent. And to some extend is understandable, but discrepancies such as:

    • different names of columns containing p-values
    • lack of mean(A) and mean(B) in pairwise_tests output
    • lack of diff in pairwise_tests output
    • effect size value differ between pairwise_gameshowell and pairwise_tukey

    makes automation more difficult, and my question is -> is it possible to standardize output from these functions?

  • Reporting Extra GG Corrected Values in rm_anova

    Reporting Extra GG Corrected Values in rm_anova

    I believe it would be a great addition to the package! Generally, when reporting GG corrected p-values, the corrected F value and ddofs are also reported. It is also a relatively easy addition. (...) if correction: (...)
    corr_fval = f(corr_ddof1, corr_ddof2).ppf(1-p_corr)

    then during the formation of dataframe: if not detailed: (....) aov["F"] = corr_fval aov["DF"] = [corr_ddof1,corr_ddof2] else: (...) aov["DF"] = [corr_ddof1,corr_ddof2] aov["F"] = [corr_fval,np.nan] (...)

  • Pearson's r from compute_effsize =|= Pearson's r from convert_effsize

    Pearson's r from compute_effsize =|= Pearson's r from convert_effsize

    I noticed that Pearson's r when calculated with compute_effsize is different from when it is converted from Cohen's d via convert_effsize:

    df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
    d = pg.compute_effsize(df['A'],df['B'], paired=False, eftype='cohen')
    r_calc = pg.compute_effsize(df['A'],df['B'], paired=False, eftype='r')
    r_conv = pg.convert_effsize(d,'cohen','r')
    
    

    -> r_calc =|= r_conv, not even close

    Am I missing something in how the formula works or is this a bug?

  • Add Z tests (proportions & means)

    Add Z tests (proportions & means)

    Hi,

    1. Intro

    As mentioned in discussion #296, it would be convenient to have support for both proportion and mean z tests within pingouin.

    statsmodels already provides several methods around those tests but in a confused and dispersed way, so you often end up writing some wrapper. Moreover, those methods sometimes have strange signatures, making it not really straightforward to use. A single and richer pandas output would be far more helpful. We are talking about basics here: in my opinion it would strengthen pingouin's position as a user-friendly but powerful and complete statistical package.

    2. Ressources

    • statsmodels

      • Proportions
        • z_stat and p_value: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportions_ztest.html
        • diff CI: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.confint_proportions_2indep.html
        • power: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.power_proportions_2indep.html
        • effect size: https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportion_effectsize.html
      • Means
        • z_stat and p_value: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html?highlight=ztest#statsmodels.stats.weightstats.ztest OR https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.ztest_ind.html#statsmodels.stats.weightstats.CompareMeans.ztest_ind
        • diff CI: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.zconfint.html?highlight=ztest OR https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.zconfint_diff.html#statsmodels.stats.weightstats.CompareMeans.zconfint_diff
    • R

      • Means
        • Paired Z test: https://rpubs.com/nguyenminhsang/paired_z-test

    3. Feature

    • proportions_ztest()

      • parameters

        • x1: 2-column array_like, 1st column: number of trials, 2nd column: number of successes
        • x2: same or proportion value
        • alternative
        • paired
        • r
        • confidence
        • method: method for computing confidence interval, ‘newcomb’ (default), ‘wald’, ‘agresti-caffo’, ‘score’
      • returns

        • Z
        • alternative
        • p_val
        • CI95%: diff in proportion
        • cohen-d
        • BF10
        • power
    • means_ztest()

      • parameters
        • x1: array_like
        • x2: same or mean value
        • alternative
        • paired
        • r
        • confidence
      • returns
        • Z
        • alternative
        • p_val
        • CI95%: diff in means
        • cohen-d
        • BF10
        • power

    Thanks! Aurélien

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.
A powerful data analysis package based on mathematical step functions.  Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

Nov 28, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

Dec 3, 2022
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

Dec 5, 2022
Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

Nov 21, 2022
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Nov 28, 2022
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

Oct 3, 2021
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Aug 19, 2022
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

Dec 5, 2022
Creating a statistical model to predict 10 year treasury yields
Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

Oct 27, 2021
Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

Nov 8, 2022
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Dec 4, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Sep 8, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

Sep 6, 2021
A data analysis using python and pandas to showcase trends in school performance.
 A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Sep 7, 2021
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Apr 12, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Apr 27, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Nov 22, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Dec 4, 2022
NumPy and Pandas interface to Big Data
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Nov 26, 2022