!pip install statsmodels
Requirement already satisfied: statsmodels in /root/venv/lib/python3.7/site-packages (0.12.2)
Requirement already satisfied: scipy>=1.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels) (1.6.0)
Requirement already satisfied: numpy>=1.15 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels) (1.19.5)
Requirement already satisfied: patsy>=0.5 in /root/venv/lib/python3.7/site-packages (from statsmodels) (0.5.1)
Requirement already satisfied: pandas>=0.21 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels) (1.2.1)
Requirement already satisfied: six in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from patsy>=0.5->statsmodels) (1.15.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pandas>=0.21->statsmodels) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas>=0.21->statsmodels) (2021.1)
WARNING: You are using pip version 20.1.1; however, version 21.0.1 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
from statsmodels.stats.anova import AnovaRM
from statsmodels.multivariate.manova import MANOVA
help(AnovaRM)
Help on class AnovaRM in module statsmodels.stats.anova:
class AnovaRM(builtins.object)
| AnovaRM(data, depvar, subject, within=None, between=None, aggregate_func=None)
|
| Repeated measures Anova using least squares regression
|
| The full model regression residual sum of squares is
| used to compare with the reduced model for calculating the
| within-subject effect sum of squares [1].
|
| Currently, only fully balanced within-subject designs are supported.
| Calculation of between-subject effects and corrections for violation of
| sphericity are not yet implemented.
|
| Parameters
| ----------
| data : DataFrame
| depvar : str
| The dependent variable in `data`
| subject : str
| Specify the subject id
| within : list[str]
| The within-subject factors
| between : list[str]
| The between-subject factors, this is not yet implemented
| aggregate_func : {None, 'mean', callable}
| If the data set contains more than a single observation per subject
| and cell of the specified model, this function will be used to
| aggregate the data before running the Anova. `None` (the default) will
| not perform any aggregation; 'mean' is s shortcut to `numpy.mean`.
| An exception will be raised if aggregation is required, but no
| aggregation function was specified.
|
| Returns
| -------
| results : AnovaResults instance
|
| Raises
| ------
| ValueError
| If the data need to be aggregated, but `aggregate_func` was not
| specified.
|
| Notes
| -----
| This implementation currently only supports fully balanced designs. If the
| data contain more than one observation per subject and cell of the design,
| these observations need to be aggregated into a single observation
| before the Anova is calculated, either manually or by passing an aggregation
| function via the `aggregate_func` keyword argument.
| Note that if the input data set was not balanced before performing the
| aggregation, the implied heteroscedasticity of the data is ignored.
|
| References
| ----------
| .. [*] Rutherford, Andrew. Anova and ANCOVA: a GLM approach. John Wiley & Sons, 2011.
|
| Methods defined here:
|
| __init__(self, data, depvar, subject, within=None, between=None, aggregate_func=None)
| Initialize self. See help(type(self)) for accurate signature.
|
| fit(self)
| estimate the model and compute the Anova table
|
| Returns
| -------
| AnovaResults instance
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
df = pd.read_csv('run_results.csv')
df.head()
cleaned_df = pd.DataFrame()
cleaned_df['name'] = df['name'].loc[df['name'].index.repeat(7)]
cleaned_df = cleaned_df.reset_index()
cleaned_df['within'] = np.tile([0, 1, 2, 3, 4, 5, 6], 72)
cleaned_df['response'] = pd.Series(df[['0','1','2','3','4','5','6']].values.reshape(1, -1)[0])
#cleaned_df = cleaned_df.drop(columns=['index'])
cleaned_df['name'] = cleaned_df['name'] + '-' + cleaned_df['within'].astype(str)
idx = np.concatenate((np.tile([0], 7), np.tile([1], 7), np.tile([2], 7)))
cleaned_df['index'] = np.tile([idx], 24).reshape(1,-1)[0]
#perform the repeated measures ANOVA
result = AnovaRM(data=cleaned_df, depvar='response', subject='name', within=['index']).fit()
print(result)
Anova
====================================
F Value Num DF Den DF Pr > F
------------------------------------
index 2.5914 2.0000 334.0000 0.0764
====================================
help(MANOVA)
Help on class MANOVA in module statsmodels.multivariate.manova:
class MANOVA(statsmodels.base.model.Model)
| MANOVA(endog, exog, missing='none', hasconst=None, **kwargs)
|
| Multivariate Analysis of Variance
|
| The implementation of MANOVA is based on multivariate regression and does
| not assume that the explanatory variables are categorical. Any type of
| variables as in regression is allowed.
|
| Parameters
| ----------
| endog : array_like
| Dependent variables. A nobs x k_endog array where nobs is
| the number of observations and k_endog is the number of dependent
| variables.
| exog : array_like
| Independent variables. A nobs x k_exog array where nobs is the
| number of observations and k_exog is the number of independent
| variables. An intercept is not included by default and should be added
| by the user. Models specified using a formula include an intercept by
| default.
|
| Attributes
| ----------
| endog : ndarray
| See Parameters.
| exog : ndarray
| See Parameters.
|
| Notes
| -----
| MANOVA is used though the `mv_test` function, and `fit` is not used.
|
| The ``from_formula`` interface is the recommended method to specify
| a model and simplifies testing without needing to manually configure
| the contrast matrices.
|
| References
| ----------
| .. [*] ftp://public.dhe.ibm.com/software/analytics/spss/documentation/
| statistics/20.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf
|
| Method resolution order:
| MANOVA
| statsmodels.base.model.Model
| builtins.object
|
| Methods defined here:
|
| __init__(self, endog, exog, missing='none', hasconst=None, **kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|
| fit(self)
| Fit a model to data.
|
| mv_test(self, hypotheses=None)
| Linear hypotheses testing
|
| Parameters
| ----------
| hypotheses : list[tuple]
| Hypothesis `L*B*M = C` to be tested where B is the parameters in
| regression Y = X*B. Each element is a tuple of length 2, 3, or 4:
|
| * (name, contrast_L)
| * (name, contrast_L, transform_M)
| * (name, contrast_L, transform_M, constant_C)
|
| containing a string `name`, the contrast matrix L, the transform
| matrix M (for transforming dependent variables), and right-hand side
| constant matrix constant_C, respectively.
|
| contrast_L : 2D array or an array of strings
| Left-hand side contrast matrix for hypotheses testing.
| If 2D array, each row is an hypotheses and each column is an
| independent variable. At least 1 row
| (1 by k_exog, the number of independent variables) is required.
| If an array of strings, it will be passed to
| patsy.DesignInfo().linear_constraint.
|
| transform_M : 2D array or an array of strings or None, optional
| Left hand side transform matrix.
| If `None` or left out, it is set to a k_endog by k_endog
| identity matrix (i.e. do not transform y matrix).
| If an array of strings, it will be passed to
| patsy.DesignInfo().linear_constraint.
|
| constant_C : 2D array or None, optional
| Right-hand side constant matrix.
| if `None` or left out it is set to a matrix of zeros
| Must has the same number of rows as contrast_L and the same
| number of columns as transform_M
|
| If `hypotheses` is None: 1) the effect of each independent variable
| on the dependent variables will be tested. Or 2) if model is created
| using a formula, `hypotheses` will be created according to
| `design_info`. 1) and 2) is equivalent if no additional variables
| are created by the formula (e.g. dummy variables for categorical
| variables and interaction terms)
|
|
| Returns
| -------
| results: MultivariateTestResults
|
| Notes
| -----
| Testing the linear hypotheses
|
| L * params * M = 0
|
| where `params` is the regression coefficient matrix for the
| linear model y = x * params
|
| If the model is not specified using the formula interfact, then the
| hypotheses test each included exogenous variable, one at a time. In
| most applications with categorical variables, the ``from_formula``
| interface should be preferred when specifying a model since it
| provides knowledge about the model when specifying the hypotheses.
|
| ----------------------------------------------------------------------
| Methods inherited from statsmodels.base.model.Model:
|
| predict(self, params, exog=None, *args, **kwargs)
| After a model has been fit predict returns the fitted values.
|
| This is a placeholder intended to be overwritten by individual models.
|
| ----------------------------------------------------------------------
| Class methods inherited from statsmodels.base.model.Model:
|
| from_formula(formula, data, subset=None, drop_cols=None, *args, **kwargs) from builtins.type
| Create a Model from a formula and dataframe.
|
| Parameters
| ----------
| formula : str or generic Formula object
| The formula specifying the model.
| data : array_like
| The data for the model. See Notes.
| subset : array_like
| An array-like object of booleans, integers, or index values that
| indicate the subset of df to use in the model. Assumes df is a
| `pandas.DataFrame`.
| drop_cols : array_like
| Columns to drop from the design matrix. Cannot be used to
| drop terms involving categoricals.
| *args
| Additional positional argument that are passed to the model.
| **kwargs
| These are passed to the model with one exception. The
| ``eval_env`` keyword is passed to patsy. It can be either a
| :class:`patsy:patsy.EvalEnvironment` object or an integer
| indicating the depth of the namespace to use. For example, the
| default ``eval_env=0`` uses the calling namespace. If you wish
| to use a "clean" environment set ``eval_env=-1``.
|
| Returns
| -------
| model
| The model instance.
|
| Notes
| -----
| data must define __getitem__ with the keys in the formula terms
| args and kwargs are passed on to the model instantiation. E.g.,
| a numpy structured or rec array, a dictionary, or a pandas DataFrame.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from statsmodels.base.model.Model:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| endog_names
| Names of endogenous variables.
|
| exog_names
| Names of exogenous variables.
n_samples = 20
n_dim = 5
n_classes = 3
X = np.random.randn(n_samples, n_dim)
y = np.random.randint(n_classes, size=n_samples)
print(X.shape)
print(y.shape)
manova = MANOVA(endog=X, exog=y)
print(manova.mv_test())
(20, 5)
(20,)
Multivariate linear model
============================================================
------------------------------------------------------------
x0 Value Num DF Den DF F Value Pr > F
------------------------------------------------------------
Wilks' lambda 0.8411 5.0000 15.0000 0.5666 0.7244
Pillai's trace 0.1589 5.0000 15.0000 0.5666 0.7244
Hotelling-Lawley trace 0.1889 5.0000 15.0000 0.5666 0.7244
Roy's greatest root 0.1889 5.0000 15.0000 0.5666 0.7244
============================================================
avg_df = df.groupby('name',axis=0, as_index=False).mean()
avg_df['dataset'] = avg_df['name'].apply(lambda x: x.split('-')[0])
avg_df['method'] = avg_df['name'].apply(lambda x: x.split('-')[1])
avg_df['sensitive_attr'] = avg_df['name'].apply(lambda x: x.split('-')[2])
X = avg_df[['0','1','2','3','4','5','6']].to_numpy()
y = avg_df['method'].to_numpy()[0]
manova = MANOVA(endog=X, exog=y)
print(manova.mv_test())
ValueError: unrecognized data structures: <class 'numpy.ndarray'> / <class 'str'>
print(X.shape, y.shape)
AttributeError: 'str' object has no attribute 'shape'