/shared-libs/python3.7/py-core/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3173: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 956638 entries, 0 to 956637
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YearStart 956638 non-null int64
1 YearEnd 956638 non-null int64
2 LocationAbbr 956638 non-null object
3 LocationDesc 956638 non-null object
4 DataSource 956638 non-null object
5 Topic 956638 non-null object
6 Question 956638 non-null object
7 Response 0 non-null float64
8 DataValueUnit 830938 non-null object
9 DataValueType 956638 non-null object
10 DataValue 646040 non-null object
11 DataValueAlt 644213 non-null float64
12 DataValueFootnoteSymbol 323640 non-null object
13 DatavalueFootnote 323640 non-null object
14 LowConfidenceLimit 548944 non-null float64
15 HighConfidenceLimit 548944 non-null float64
16 StratificationCategory1 956638 non-null object
17 Stratification1 956638 non-null object
18 StratificationCategory2 0 non-null float64
19 Stratification2 0 non-null float64
20 StratificationCategory3 0 non-null float64
21 Stratification3 0 non-null float64
22 GeoLocation 948566 non-null object
23 ResponseID 0 non-null float64
24 LocationID 956638 non-null int64
25 TopicID 956638 non-null object
26 QuestionID 956638 non-null object
27 DataValueTypeID 956638 non-null object
28 StratificationCategoryID1 956638 non-null object
29 StratificationID1 956638 non-null object
30 StratificationCategoryID2 0 non-null float64
31 StratificationID2 0 non-null float64
32 StratificationCategoryID3 0 non-null float64
33 StratificationID3 0 non-null float64
dtypes: float64(13), int64(3), object(18)
memory usage: 248.2+ MB
Figure A
Figure B
The figure below shows the rates of the diseases grouped by gender over the course of 2008-2019. We want to explore the prominence of specific diseases such as asthma and diabetes, and the difference in prevalence among the two genders.
Figure C
The figure shows the prevalence of diseases according to gender. This allows us to see which diseases we can potentially do further research on and narrow our scope by not only the diseases, but as well as the questions associated with this disease.
Figure D
The figure shows the prevalence of diseases according to race/ethnicity. This allows us to see which diseases we can potentially do further research on and narrow our scope by not only the diseases, but as well as the questions associated with this disease. We can further determine which diseases are more prevalent within a race/ethnicity group we may focus on.
Execution error
Execution error
Research Question Results
– Summarize and interpret your models. – Estimate any uncertainty in your GLM predictions, providing clear quantitative state- ments of the uncertainty in plain English.
Checkpoint
Comparing GLM and non-parametric methods: DT/RF/OSL
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: DataValue No. Observations: 1885
Model: GLM Df Residuals: 1879
Model Family: Poisson Df Model: 5
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -17943.
Date: Fri, 03 Dec 2021 Deviance: 22647.
Time: 23:58:27 Pearson chi2: 2.26e+04
No. Iterations: 4 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
===================================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
const 19.6886 1.076 18.293 0.000 17.579 21.798
Stratification_American Indian or Alaska Native 4.0466 0.215 18.792 0.000 3.625 4.469
Stratification_Asian or Pacific Islander 3.5735 0.215 16.597 0.000 3.152 3.996
Stratification_Black, non-Hispanic 4.3510 0.215 20.212 0.000 3.929 4.773
Stratification_Hispanic 3.6022 0.215 16.732 0.000 3.180 4.024
Stratification_White, non-Hispanic 4.1153 0.215 19.119 0.000 3.693 4.537
YearStart -0.0091 0.001 -14.254 0.000 -0.010 -0.008
===================================================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: DataValue No. Observations: 1885
Model: GLM Df Residuals: 1880
Model Family: Poisson Df Model: 4
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -18045.
Date: Fri, 03 Dec 2021 Deviance: 22850.
Time: 23:58:08 Pearson chi2: 2.28e+04
No. Iterations: 100 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
===================================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
const 4.3467 0.001 2974.357 0.000 4.344 4.350
Stratification_American Indian or Alaska Native 0.9776 0.004 264.129 0.000 0.970 0.985
Stratification_Asian or Pacific Islander 0.5049 0.004 124.518 0.000 0.497 0.513
Stratification_Black, non-Hispanic 1.2827 0.003 444.475 0.000 1.277 1.288
Stratification_Hispanic 0.5340 0.004 138.829 0.000 0.526 0.542
Stratification_White, non-Hispanic 1.0475 0.003 355.788 0.000 1.042 1.053
===================================================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: DataValue No. Observations: 1885
Model: GLM Df Residuals: 1879
Model Family: NegativeBinomial Df Model: 5
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -11728.
Date: Sat, 04 Dec 2021 Deviance: 126.36
Time: 00:00:26 Pearson chi2: 122.
No. Iterations: 5 Pseudo R-squ. (CS): 0.08798
Covariance Type: nonrobust
===================================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
const 20.4320 15.011 1.361 0.173 -8.989 49.853
Stratification_American Indian or Alaska Native 4.1949 3.003 1.397 0.162 -1.691 10.081
Stratification_Asian or Pacific Islander 3.7218 3.003 1.239 0.215 -2.163 9.607
Stratification_Black, non-Hispanic 4.4999 3.002 1.499 0.134 -1.385 10.384
Stratification_Hispanic 3.7512 3.002 1.249 0.212 -2.133 9.636
Stratification_White, non-Hispanic 4.2642 3.002 1.420 0.155 -1.620 10.148
YearStart -0.0096 0.009 -1.072 0.284 -0.027 0.008
===================================================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: DataValue No. Observations: 1885
Model: GLM Df Residuals: 1880
Model Family: NegativeBinomial Df Model: 4
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -11728.
Date: Sat, 04 Dec 2021 Deviance: 127.52
Time: 00:01:09 Pearson chi2: 123.
No. Iterations: 100 Pseudo R-squ. (CS): 0.08742
Covariance Type: nonrobust
===================================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
const 4.3467 0.020 222.861 0.000 4.308 4.385
Stratification_American Indian or Alaska Native 0.9776 0.053 18.535 0.000 0.874 1.081
Stratification_Asian or Pacific Islander 0.5049 0.047 10.726 0.000 0.413 0.597
Stratification_Black, non-Hispanic 1.2828 0.046 27.901 0.000 1.193 1.373
Stratification_Hispanic 0.5340 0.046 11.707 0.000 0.445 0.623
Stratification_White, non-Hispanic 1.0474 0.043 24.422 0.000 0.963 1.131
===================================================================================================================
The glm module is deprecated and will be removed in version 4.0
We recommend to instead use Bambi https://bambinos.github.io/bambi/
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:6: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [mu, White, Hispanic, Black, Asian, AI_AN, Intercept]
Execution error
/root/venv/lib/python3.7/site-packages/arviz/stats/density_utils.py:481: UserWarning: Your data appears to have a single value or no finite values
warnings.warn("Your data appears to have a single value or no finite values")
/shared-libs/python3.7/py/lib/python3.7/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
Execution error
Decision Trees
Training set error for decision tree: 49.60958958718125
Test set error for decision tree: 46.247496077127046
Training set error for decision tree: 65.14973499797053
Test set error for decision tree: 98.09461339452832
Error in callback <function install_repl_displayhook.<locals>.post_execute at 0x7fd79801c200> (for post_execute):
Execution error
Execution error
Execution error
Training set error for decision tree: 0.0
Test set error for decision tree: 22.635548123897024
OLS Regression Results
==============================================================================
Dep. Variable: DataValue R-squared: 0.581
Model: OLS Adj. R-squared: 0.580
Method: Least Squares F-statistic: 651.3
Date: Fri, 03 Dec 2021 Prob (F-statistic): 0.00
Time: 06:32:14 Log-Likelihood: -10005.
No. Observations: 1885 AIC: 2.002e+04
Df Residuals: 1880 BIC: 2.005e+04
Df Model: 4
Covariance Type: nonrobust
===================================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
const 160.5836 0.951 168.813 0.000 158.718 162.449
Stratification_American Indian or Alaska Native 44.6673 2.573 17.359 0.000 39.621 49.714
Stratification_Asian or Pacific Islander -32.6409 2.294 -14.229 0.000 -37.140 -28.142
Stratification_Black, non-Hispanic 117.9050 2.244 52.537 0.000 113.504 122.306
Stratification_Hispanic -28.8671 2.223 -12.985 0.000 -33.227 -24.507
Stratification_White, non-Hispanic 59.5193 2.093 28.441 0.000 55.415 63.624
==============================================================================
Omnibus: 117.233 Durbin-Watson: 2.066
Prob(Omnibus): 0.000 Jarque-Bera (JB): 346.210
Skew: 0.291 Prob(JB): 6.63e-76
Kurtosis: 5.017 Cond. No. 3.54e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.81e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
The estimated causal effect on 'Mortality from total cardiovascular diseases' of 'Stratification_American Indian or Alaska Native' is 160.58
The estimated causal effect on 'Mortality from total cardiovascular diseases' of 'Stratification_Asian or Pacific Islander' is 44.67
OLS Regression Results
=======================================================================================
Dep. Variable: DataValue R-squared (uncentered): 0.145
Model: OLS Adj. R-squared (uncentered): 0.144
Method: Least Squares F-statistic: 318.6
Date: Fri, 03 Dec 2021 Prob (F-statistic): 5.92e-66
Time: 06:41:01 Log-Likelihood: -12585.
No. Observations: 1885 AIC: 2.517e+04
Df Residuals: 1884 BIC: 2.518e+04
Df Model: 1
Covariance Type: nonrobust
===================================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
Stratification_American Indian or Alaska Native 205.2509 11.500 17.848 0.000 182.697 227.805
==============================================================================
Omnibus: 101.234 Durbin-Watson: 0.568
Prob(Omnibus): 0.000 Jarque-Bera (JB): 119.185
Skew: -0.563 Prob(JB): 1.32e-26
Kurtosis: 3.500 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Random Forest
Training set error for random forest: 50.21067356920616
Test set error for random forest: 45.58664310748402
Training set error for random forest: 70.34108554613819
Test set error for random forest: 75.01382729772637
Training set error for random forest: 17.58136891470574
Test set error for random forest: 20.667707505506552
Training set error for random forest: 9.626999111711388
Test set error for random forest: 23.730663840279597