In USA Drug Overdoes Analyzing, we made a comprehensive analysis for the crude death rate resulted from drug overdose in us. We compared the change of death rate for different states in different years. Here in regression analysis, our goal is trying to build a regression model which explains the variances of drug overdose death rate among different states. The possible predictors are from social economic index and behavior statistics of residents for fifty states and Washington D.C.. We only use drug overdose death rate from 2015 - 2019 to guarantee the data availability for possible predictors.
Poverty Rate: The poverty data is from United States Census Bureau. We imported and cleaned to get the poverty rate for fifty states and Washington D.C. from 2015 - 2019. Then, we take an average of the poverty rate for 5 years.
## load the poverty rate data from 2015-2019
poverty_2015 =
read_csv("data_regression_us/povety/poverty_us_2015.csv") %>%
select(NAME, S1701_C03_001E)
poverty_2015 =
poverty_2015[-1,] %>%
mutate(
poverty_population_2015 = as.numeric(S1701_C03_001E),
state = NAME
) %>%
select(-c(NAME, S1701_C03_001E)) %>%
relocate(state)
poverty_2016 =
read_csv("data_regression_us/povety/poverty_us_2016.csv") %>%
select(NAME, S1701_C03_001E)
poverty_2016 =
poverty_2016[-1,] %>%
mutate(
poverty_population_2016 = as.numeric(S1701_C03_001E),
state = NAME
) %>%
select(-c(NAME, S1701_C03_001E)) %>%
relocate(state)
poverty_2017 =
read_csv("data_regression_us/povety/poverty_us_2017.csv") %>%
select(NAME, S1701_C03_001E)
poverty_2017 =
poverty_2017[-1,] %>%
mutate(
poverty_population_2017 = as.numeric(S1701_C03_001E),
state = NAME
) %>%
select(-c(NAME, S1701_C03_001E)) %>%
relocate(state)
poverty_2018 =
read_csv("data_regression_us/povety/poverty_us_2018.csv") %>%
select(NAME, S1701_C03_001E)
poverty_2018 =
poverty_2018[-1,] %>%
mutate(
poverty_population_2018 = as.numeric(S1701_C03_001E),
state = NAME
) %>%
select(-c(NAME, S1701_C03_001E)) %>%
relocate(state)
poverty_2019 =
read_csv("data_regression_us/povety/poverty_us_2019.csv") %>%
select(NAME, S1701_C03_001E)
poverty_2019 =
poverty_2019[-1,] %>%
mutate(
poverty_population_2019 = as.numeric(S1701_C03_001E),
state = NAME
) %>%
select(-c(NAME, S1701_C03_001E)) %>%
relocate(state)
## combine the datasets, take average
poverty_overall =
join_all(list(poverty_2019, poverty_2018, poverty_2017, poverty_2016, poverty_2015))
poverty_overall =
poverty_overall %>%
mutate(sum_of_rows = rowSums((poverty_overall[,2:6]), na.rm = TRUE),
mean_poverty_rate = 0.01 * sum_of_rows/5) %>%
select(state, mean_poverty_rate)
Population: The population data is from United States Census Bureau. We imported the population data for fifty states and Washington D.C. from 2020 United States census. Here we do not count population as a predictors. This data is only used to compute other predictors from available data.
population_us =
read_csv("data_regression_us/population/ACSDT5Y2019.B01003_data_with_overlays_2021-11-16T170855.csv")
population_us =
population_us[-1,] %>%
mutate(
state = NAME,
population = as.numeric(B01003_001E)
) %>%
select(state, population)
Education Rate: The education data is from United States Census Bureau. We imported the data for fifty states and Washington D.C. from 2015 - 2019. It contains the number of population with different education level. We consider bachelor’s education as the threshold of “high education”. Therefore, we sum-up the number of people who has a bachelor or above degree. Then, we take an average of the number for 5 years. And the education rate can be calculated by dividing the high-educated population with the total population. Here education rate means that the proportion of population who receive bachelor’s or higher education.
## load the education data from 2015-2019
education_2015 =
read_csv("data_regression_us/education/education_us_2015.csv")
education_2015 =
education_2015[-1,] %>%
select(B15003_022E, B15003_023E, B15003_024E, B15003_025E, NAME) %>%
mutate(
number_bachelor = as.numeric(B15003_022E),
number_master = as.numeric(B15003_023E),
number_profesional = as.numeric(B15003_024E),
number_doctor = as.numeric(B15003_025E),
state = NAME
) %>%
select(state, number_bachelor, number_master, number_profesional, number_doctor) %>%
mutate(sum_high_educ_2015 = number_bachelor + number_master + number_profesional + number_doctor) %>%
select(state, sum_high_educ_2015)
education_2016 =
read_csv("data_regression_us/education/education_us_2016.csv")
education_2016 =
education_2016[-1,] %>%
select(B15003_022E, B15003_023E, B15003_024E, B15003_025E, NAME) %>%
mutate(
number_bachelor = as.numeric(B15003_022E),
number_master = as.numeric(B15003_023E),
number_profesional = as.numeric(B15003_024E),
number_doctor = as.numeric(B15003_025E),
state = NAME
) %>%
select(state, number_bachelor, number_master, number_profesional, number_doctor) %>%
mutate(sum_high_educ_2016 = number_bachelor + number_master + number_profesional + number_doctor) %>%
select(state, sum_high_educ_2016)
education_2017 =
read_csv("data_regression_us/education/education_us_2017.csv")
education_2017 =
education_2017[-1,] %>%
select(B15003_022E, B15003_023E, B15003_024E, B15003_025E, NAME) %>%
mutate(
number_bachelor = as.numeric(B15003_022E),
number_master = as.numeric(B15003_023E),
number_profesional = as.numeric(B15003_024E),
number_doctor = as.numeric(B15003_025E),
state = NAME
) %>%
select(state, number_bachelor, number_master, number_profesional, number_doctor) %>%
mutate(sum_high_educ_2017 = number_bachelor + number_master + number_profesional + number_doctor) %>%
select(state, sum_high_educ_2017)
education_2018 =
read_csv("data_regression_us/education/education_us_2018.csv")
education_2018 =
education_2018[-1,] %>%
select(B15003_022E, B15003_023E, B15003_024E, B15003_025E, NAME) %>%
mutate(
number_bachelor = as.numeric(B15003_022E),
number_master = as.numeric(B15003_023E),
number_profesional = as.numeric(B15003_024E),
number_doctor = as.numeric(B15003_025E),
state = NAME
) %>%
select(state, number_bachelor, number_master, number_profesional, number_doctor) %>%
mutate(sum_high_educ_2018 = number_bachelor + number_master + number_profesional + number_doctor) %>%
select(state, sum_high_educ_2018)
education_2019 =
read_csv("data_regression_us/education/education_us_2019.csv")
education_2019 =
education_2019[-1,] %>%
select(B15003_022E, B15003_023E, B15003_024E, B15003_025E, NAME) %>%
mutate(
number_bachelor = as.numeric(B15003_022E),
number_master = as.numeric(B15003_023E),
number_profesional = as.numeric(B15003_024E),
number_doctor = as.numeric(B15003_025E),
state = NAME
) %>%
select(state, number_bachelor, number_master, number_profesional, number_doctor) %>%
mutate(sum_high_educ_2019 = number_bachelor + number_master + number_profesional + number_doctor) %>%
select(state, sum_high_educ_2019)
## combine to produce overall education data, take average for 5 years
education_overall =
join_all(list(education_2019, education_2018, education_2017, education_2016, education_2015))
education_overall =
education_overall %>%
mutate(sum_of_rows = rowSums((education_overall[,2:6]), na.rm = TRUE),
mean_educ = sum_of_rows/5) %>%
select(state, mean_educ)
## compute education rate
education_rate =
left_join(education_overall, population_us) %>%
mutate(education_rate = mean_educ / population) %>%
select(state, education_rate)
Unemployment Rate: The unemployment data is from United States Census Bureau. We imported and cleaned to get the unemployment rate for fifty states and Washington D.C. from 2015 - 2019. Then, we take an average of the unemployment rate for 5 years.
## load the unemployment 2015-2019
unemployment_2015 =
read_csv("data_regression_us/unemployment/unemployment_2015.csv")
unemployment_2015 =
unemployment_2015[-1,] %>%
select(NAME, S2301_C04_001E) %>%
mutate(
unemployment_rate_2015 = as.numeric(S2301_C04_001E),
state = NAME
) %>%
select(state, unemployment_rate_2015)
unemployment_2016 =
read_csv("data_regression_us/unemployment/unemployment_2016.csv")
unemployment_2016 =
unemployment_2016[-1,] %>%
select(NAME, S2301_C04_001E) %>%
mutate(
unemployment_rate_2016 = as.numeric(S2301_C04_001E),
state = NAME
) %>%
select(state, unemployment_rate_2016)
unemployment_2017 =
read_csv("data_regression_us/unemployment/unemployment_2017.csv")
unemployment_2017 =
unemployment_2017[-1,] %>%
select(NAME, S2301_C04_001E) %>%
mutate(
unemployment_rate_2017 = as.numeric(S2301_C04_001E),
state = NAME
) %>%
select(state, unemployment_rate_2017)
unemployment_2018 =
read_csv("data_regression_us/unemployment/unemployment_2018.csv")
unemployment_2018 =
unemployment_2018[-1,] %>%
select(NAME, S2301_C04_001E) %>%
mutate(
unemployment_rate_2018 = as.numeric(S2301_C04_001E),
state = NAME
) %>%
select(state, unemployment_rate_2018)
unemployment_2019 =
read_csv("data_regression_us/unemployment/unemployment_2019.csv")
unemployment_2019 =
unemployment_2019[-1,] %>%
select(NAME, S2301_C04_001E) %>%
mutate(
unemployment_rate_2019 = as.numeric(S2301_C04_001E),
state = NAME
) %>%
select(state, unemployment_rate_2019)
## combine to overall unemployment df, take average of unemployment rate for 5 years
unemployment_overall =
join_all(list(unemployment_2019, unemployment_2018, unemployment_2017, unemployment_2016, unemployment_2015))
unemployment_overall =
unemployment_overall %>%
mutate(sum_of_rows = rowSums((unemployment_overall[,2:6]), na.rm = TRUE),
mean_unemployment_rate = 0.01 * sum_of_rows/5) %>%
select(state, mean_unemployment_rate)
Crude Divorce Rate: The divorce data is from United States Census Bureau. We imported the data for fifty states and Washington D.C. from 2015 - 2019. It contains the number of population who divorce with their partners. We take an average of the number for 5 years. And the crude divorce rate can be calculated by dividing the divorced population with the total population.
## load the divorce count data from 2015-2019
divorce_2015 =
read_csv("data_regression_us/divorce/divorce_number_2015.csv")
divorce_2015 =
divorce_2015[-1,] %>%
select(NAME, B12503_004E, B12503_009E) %>%
mutate(
B12503_004E = as.numeric(B12503_004E),
B12503_009E = as.numeric(B12503_009E),
num_divorce = B12503_004E + B12503_009E,
divorce_num_2015 = as.numeric(num_divorce),
state = NAME
) %>%
select(state, divorce_num_2015)
divorce_2016 =
read_csv("data_regression_us/divorce/divorce_number_2016.csv")
divorce_2016 =
divorce_2016[-1,] %>%
select(NAME, B12503_004E, B12503_009E) %>%
mutate(
B12503_004E = as.numeric(B12503_004E),
B12503_009E = as.numeric(B12503_009E),
num_divorce = B12503_004E + B12503_009E,
divorce_num_2016 = as.numeric(num_divorce),
state = NAME
) %>%
select(state, divorce_num_2016)
divorce_2017 =
read_csv("data_regression_us/divorce/divorce_number_2017.csv")
divorce_2017 =
divorce_2017[-1,] %>%
select(NAME, B12503_004E, B12503_009E) %>%
mutate(
B12503_004E = as.numeric(B12503_004E),
B12503_009E = as.numeric(B12503_009E),
num_divorce = B12503_004E + B12503_009E,
divorce_num_2017 = as.numeric(num_divorce),
state = NAME
) %>%
select(state, divorce_num_2017)
divorce_2018 =
read_csv("data_regression_us/divorce/divorce_number_2018.csv")
divorce_2018 =
divorce_2018[-1,] %>%
select(NAME, B12503_004E, B12503_009E) %>%
mutate(
B12503_004E = as.numeric(B12503_004E),
B12503_009E = as.numeric(B12503_009E),
num_divorce = B12503_004E + B12503_009E,
divorce_num_2018 = as.numeric(num_divorce),
state = NAME
) %>%
select(state, divorce_num_2018)
divorce_2019 =
read_csv("data_regression_us/divorce/divorce_number_2019.csv")
divorce_2019 =
divorce_2019[-1,] %>%
select(NAME, B12503_004E, B12503_009E) %>%
mutate(
B12503_004E = as.numeric(B12503_004E),
B12503_009E = as.numeric(B12503_009E),
num_divorce = B12503_004E + B12503_009E,
divorce_num_2019 = as.numeric(num_divorce),
state = NAME
) %>%
select(state, divorce_num_2019)
## combine 5 df to a overall divorce count df, take average for 5 years
divorce_overall =
join_all(list(divorce_2019, divorce_2018, divorce_2017, divorce_2016, divorce_2015))
divorce_overall =
divorce_overall %>%
mutate(sum_of_rows = rowSums((divorce_overall[,2:6]), na.rm = TRUE),
mean_divorce_num = sum_of_rows/5) %>%
select(state, mean_divorce_num)
## compute crude divorce rate
divorce_rate =
left_join(divorce_overall, population_us) %>%
mutate(divorce_rate = mean_divorce_num / population) %>%
select(state, divorce_rate)
Smoke Rate: The smoke rate data is from America’s Health Rankings. The smoke rate is defined as “the percentage of adults who reported smoking at least 100 cigarettes in their lifetime and currently smoke daily or some days”. We imported and clean to get the smoke rate for fifty states and Washington D.C. from 2015 - 2019. Then, we take an average of the smoke rate for 5 years.
smoke_rate =
read_excel("data_regression_us/smoke/smoke_rate.xlsx")
smoke_rate =
smoke_rate %>%
mutate(sum_of_rows = rowSums((smoke_rate[,2:6]), na.rm = TRUE),
mean_smoke_rate = sum_of_rows/5) %>%
select(state, mean_smoke_rate)
Binge Drinking: Binge drinking data is collected from Statista. Binge drinking prevalence for the 51 jurisdictions (50 states and Washington D.C.). The prevalence of binge drinking is defined as number of binge drinker over the whole population within each jurisdiction.
state_level = c(state.name[1:8], "District of Columbia", state.name[9:50])
drinking_19 =
read_excel("./data/statistic_id378966_us-binge-drinking-among-adults-by-state-2019.xlsx", range = "Data!B3:C57") %>%
janitor::clean_names() %>%
slice(-c(1,2)) %>%
mutate(binge_drink_rate = 0.01 * as.numeric(x2)) %>%
select(state = u_s_binge_drinking_among_adults_by_state_2019, binge_drink_rate)
Crude Death Rate: Mentioned in the USA overview section as well. In a few words, it is the number of deaths due to drug overdose per 100,000 people.
crudedeath_rate =
read.csv("./data/agegroup_race_state_year_99-19.csv") %>%
janitor::clean_names() %>%
select(state, year = year_code, age = ten_year_age_groups_code, race, deaths,population) %>%
drop_na() %>%
filter(year %in% c("2015", "2016", "2017", "2018", "2019")) %>%
group_by(state) %>%
dplyr::summarize(total_deaths = sum(deaths),
total_population = sum(population)) %>%
mutate(crudedeath_rate = 0.01 * (total_deaths/total_population) * 100000) %>%
select(state, crudedeath_rate)
The regression overall data is produced by combining the dataset generated above. For each state, this data contains the response – the crude death rate for drug overdose and all possible predictors – binge drinking rate, divorce rate, unemployment rate, education rate, poverty rate, and smoke rate. The following tables describe this data:
overall_regression_view =
join_all(list(crudedeath_rate, drinking_19, divorce_rate, unemployment_overall, education_rate, poverty_overall, smoke_rate))
overall_regression =
overall_regression_view %>%
select(-state)
overall_regression_view %>% head(10) %>% knitr::kable(caption = "The First 10 Rows of Regression Overall Data")
state | crudedeath_rate | binge_drink_rate | divorce_rate | mean_unemployment_rate | education_rate | mean_poverty_rate | mean_smoke_rate |
---|---|---|---|---|---|---|---|
Alabama | 0.2270304 | 0.136 | 0.5606409 | 0.0598 | 0.1718737 | 0.1696 | 0.2082 |
Alaska | 0.2663021 | 0.182 | 0.5173449 | 0.0722 | 0.1932690 | 0.1046 | 0.1962 |
Arizona | 0.2960310 | 0.165 | 0.5365614 | 0.0594 | 0.1959374 | 0.1524 | 0.1496 |
Arkansas | 0.2089063 | 0.149 | 0.5776714 | 0.0516 | 0.1530968 | 0.1722 | 0.2364 |
California | 0.1640786 | 0.183 | 0.5099043 | 0.0606 | 0.2273319 | 0.1350 | 0.1160 |
Colorado | 0.2309491 | 0.196 | 0.5524384 | 0.0434 | 0.2791094 | 0.1034 | 0.1520 |
Connecticut | 0.4042932 | 0.177 | 0.5388083 | 0.0604 | 0.2712121 | 0.1006 | 0.1342 |
Delaware | 0.6017141 | 0.189 | 0.4362088 | 0.0542 | 0.2215250 | 0.1230 | 0.1770 |
District of Columbia | 0.8842317 | 0.272 | 0.3632256 | 0.0692 | 0.4124074 | 0.1644 | 0.1508 |
Florida | 0.2746527 | 0.180 | 0.5690009 | 0.0564 | 0.2120040 | 0.1414 | 0.1590 |
summary(overall_regression_view) %>% knitr::kable(caption = "Summary of The Response and The Predictors")
state | crudedeath_rate | binge_drink_rate | divorce_rate | mean_unemployment_rate | education_rate | mean_poverty_rate | mean_smoke_rate | |
---|---|---|---|---|---|---|---|---|
Length:51 | Min. :0.1182 | Min. :0.1200 | Min. :0.3632 | Min. :0.02740 | Min. :0.1449 | Min. :0.0762 | Min. :0.0910 | |
Class :character | 1st Qu.:0.2057 | 1st Qu.:0.1653 | 1st Qu.:0.5256 | 1st Qu.:0.04270 | 1st Qu.:0.1873 | 1st Qu.:0.1092 | 1st Qu.:0.1511 | |
Mode :character | Median :0.2710 | Median :0.1865 | Median :0.5497 | Median :0.05320 | Median :0.2061 | Median :0.1306 | Median :0.1718 | |
NA | Mean :0.3015 | Mean :0.1858 | Mean :0.5441 | Mean :0.05128 | Mean :0.2153 | Mean :0.1321 | Mean :0.1752 | |
NA | 3rd Qu.:0.3561 | 3rd Qu.:0.2010 | 3rd Qu.:0.5649 | 3rd Qu.:0.05900 | 3rd Qu.:0.2357 | 3rd Qu.:0.1502 | 3rd Qu.:0.1961 | |
NA | Max. :0.8842 | Max. :0.2720 | Max. :0.6018 | Max. :0.07460 | Max. :0.4124 | Max. :0.2038 | Max. :0.2568 | |
NA | NA | NA’s :1 | NA | NA | NA | NA | NA |
corr_matrix =
overall_regression %>%
chart.Correlation(histogram = TRUE, method = "pearson")
Comments:
For those variables that are strongly correlated (***) with each other, we add their interaction into the regression model for further analysis.
Two-ways interaction we consider:
binge_drink_rate * education_rate
divorce_rate * education_rate
mean_unemployment_rate * mean_poverty_rate
education_rate * mean_poverty_rate
education_rate * mean_smoke_rate
mean_poverty_rate * mean_smoke_rate
Method:
The initial full model for crude death rate consists of all possible predictors and reasonable interaction.
Model_full: crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate + binge_drink_rate * education_rate + mean_unemployment_rate * mean_poverty_rate + education_rate * mean_poverty_rate + education_rate * mean_smoke_rate + mean_poverty_rate * mean_smoke_rate + divorce_rate * education_rate
Here we are going to use two modeling method and try to generate a better regression model to explain crude death rate.
stepwise =
step(lm(crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate + binge_drink_rate * education_rate + mean_unemployment_rate * mean_poverty_rate + education_rate * mean_poverty_rate + education_rate * mean_smoke_rate + mean_poverty_rate * mean_smoke_rate + divorce_rate * education_rate, data = overall_regression),direction = "both", trace = 0)
Based on the stepwise results, the best model is model_1.
Model_1: crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate + mean_poverty_rate:mean_smoke_rate + divorce_rate:education_rate
This model can be described in the following parameter:
summary(stepwise) %>% broom::tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.1823514 | 0.7988419 | -0.2282697 | 0.8205712 |
binge_drink_rate | -0.9643137 | 0.6718818 | -1.4352432 | 0.1588030 |
divorce_rate | 1.1058355 | 1.4478464 | 0.7637795 | 0.4493718 |
mean_unemployment_rate | 3.1693281 | 2.1249877 | 1.4914571 | 0.1434962 |
education_rate | 6.4472365 | 2.3587499 | 2.7333277 | 0.0092123 |
mean_poverty_rate | -5.7853504 | 3.0285863 | -1.9102478 | 0.0631129 |
mean_smoke_rate | -1.0370218 | 2.3260266 | -0.4458340 | 0.6580627 |
mean_poverty_rate:mean_smoke_rate | 23.4829679 | 15.8225164 | 1.4841488 | 0.1454173 |
divorce_rate:education_rate | -9.4598001 | 4.9459427 | -1.9126384 | 0.0627972 |
summary(stepwise) %>% broom::glance() %>% knitr::kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.6163078 | 0.541441 | 0.0981217 | 8.23206 | 1.4e-06 | 8 | 41 | 50 |
reg_subsets =
regsubsets(crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate + binge_drink_rate * education_rate + mean_unemployment_rate * mean_poverty_rate + education_rate * mean_poverty_rate + education_rate * mean_smoke_rate + mean_poverty_rate * mean_smoke_rate + divorce_rate * education_rate, data = overall_regression, nvmax = 12)
summary(reg_subsets)
Method: This table describes the different statistics value for different number of predictors the model contains. For each number, regsubset() produces the best model and outputs the statistics in the following.
cri_measure =
cbind(
cp = summary(reg_subsets)$cp,
r2 = summary(reg_subsets)$rsq,
adj_r2 = summary(reg_subsets)$adjr2,
BIC = summary(reg_subsets)$bic
)
cri_measure
## cp r2 adj_r2 BIC
## [1,] 24.568548 0.3055733 0.2911061 -10.4093868
## [2,] 14.151243 0.4277652 0.4034148 -16.1742284
## [3,] 6.317893 0.5245299 0.4935210 -21.5244707
## [4,] 3.981957 0.5671975 0.5287261 -22.3135710
## [5,] 3.236676 0.5942123 0.5481001 -21.6241191
## [6,] 3.301372 0.6132566 0.5592924 -20.1155275
## [7,] 4.689757 0.6192751 0.5558210 -16.9877323
## [8,] 6.326699 0.6228478 0.5492571 -13.5471160
## [9,] 7.034735 0.6355613 0.5535626 -11.3496172
## [10,] 9.007389 0.6358304 0.5424536 -7.4745279
## [11,] 11.001814 0.6358853 0.5304836 -3.5700381
## [12,] 13.000000 0.6359031 0.5178176 0.3395343
Based on cp, the “best” model contains five predictors.
Model_2: crudedeath_rate ~ binge_drink_rate + education_rate + mean_poverty_rate + mean_poverty_rate:mean_smoke_rate + divorce_rate:education_rate
This model can be described in the following parameter:
model_2 = lm(data = overall_regression, formula = crudedeath_rate ~ binge_drink_rate + education_rate + mean_poverty_rate + mean_poverty_rate:mean_smoke_rate + divorce_rate:education_rate)
summary(model_2) %>% broom::tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.3219639 | 0.2096655 | 1.535607 | 0.1317951 |
binge_drink_rate | -1.4348310 | 0.5377575 | -2.668175 | 0.0106366 |
education_rate | 5.2963056 | 0.7760553 | 6.824650 | 0.0000000 |
mean_poverty_rate | -4.1019627 | 1.2793286 | -3.206340 | 0.0025056 |
mean_poverty_rate:mean_smoke_rate | 18.0807105 | 4.0391521 | 4.476363 | 0.0000532 |
education_rate:divorce_rate | -6.7196736 | 1.8057116 | -3.721344 | 0.0005592 |
summary(model_2) %>% broom::glance() %>% knitr::kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.5942123 | 0.5481001 | 0.0974067 | 12.88622 | 1e-07 | 5 | 44 | 50 |
Based on adjusted r square, we get model_3 as the “best” model.
Model_3: crudedeath_rate ~ binge_drink_rate + education_rate + mean_poverty_rate + mean_poverty_rate:mean_smoke_rate + divorce_rate:education_rate + mean_unemployment_rate:mean_poverty_rate
This model can be described in the following parameter:
model_3 = lm(data = overall_regression, formula = crudedeath_rate ~ binge_drink_rate + education_rate + mean_poverty_rate + mean_poverty_rate:mean_smoke_rate + divorce_rate:education_rate + mean_unemployment_rate:mean_poverty_rate)
summary(model_3) %>% broom::tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.3801708 | 0.2108814 | 1.802771 | 0.0784354 |
binge_drink_rate | -1.0853390 | 0.5828431 | -1.862146 | 0.0694250 |
education_rate | 4.7397334 | 0.8565289 | 5.533653 | 0.0000017 |
mean_poverty_rate | -5.9904343 | 1.8111909 | -3.307456 | 0.0019082 |
mean_poverty_rate:mean_smoke_rate | 17.9418971 | 3.9899597 | 4.496761 | 0.0000516 |
education_rate:divorce_rate | -5.9249642 | 1.8649682 | -3.176979 | 0.0027542 |
mean_poverty_rate:mean_unemployment_rate | 22.6088897 | 15.5372534 | 1.455141 | 0.1528923 |
summary(model_3) %>% broom::glance() %>% knitr::kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.6132566 | 0.5592924 | 0.0961929 | 11.36414 | 1e-07 | 6 | 43 | 50 |
Based on BIC, we get model_4 as the “best” model.
Model_4: crudedeath_rate ~ binge_drink_rate + education_rate + mean_smoke_rate + divorce_rate:education_rate
This model can be described in the following parameter:
model_4 = lm(data = overall_regression, formula = crudedeath_rate ~ binge_drink_rate + education_rate + mean_smoke_rate + divorce_rate:education_rate)
summary(model_4) %>% broom::tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.214375 | 0.1768821 | -1.211965 | 0.2318510 |
binge_drink_rate | -1.565518 | 0.5413276 | -2.891997 | 0.0058775 |
education_rate | 4.874933 | 0.6976179 | 6.987970 | 0.0000000 |
mean_smoke_rate | 2.212329 | 0.5296644 | 4.176850 | 0.0001339 |
education_rate:divorce_rate | -5.427036 | 1.4483395 | -3.747074 | 0.0005074 |
summary(model_4) %>% broom::glance() %>% knitr::kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.5671975 | 0.5287261 | 0.0994728 | 14.74338 | 1e-07 | 4 | 45 | 50 |
Method:
The initial full model for crude death rate consists of all possible predictors.
Model_full: crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate
Here we are going to use two modeling method and try to generate a better regression model to explain crude death rate.
stepwise_no_inter =
step(lm(crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate, data = overall_regression),direction = "both", trace = 0)
Based on the stepwise results, the best model is model_1.
Model_5: crudedeath_rate ~ binge_drink_rate + divorce_rate + education_rate + mean_smoke_rate
Comment:
Since the model from regsubset() without considering the interaction is the same as model_5, we will not discuss the details of that method here.
This model can be described in the following parameter:
summary(stepwise_no_inter) %>% broom::tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.4646221 | 0.3226392 | 1.440067 | 0.1567697 |
binge_drink_rate | -1.5395071 | 0.5520224 | -2.788849 | 0.0077258 |
divorce_rate | -1.5061409 | 0.4402422 | -3.421165 | 0.0013370 |
education_rate | 2.2751208 | 0.4591017 | 4.955592 | 0.0000107 |
mean_smoke_rate | 2.5809509 | 0.5356159 | 4.818660 | 0.0000168 |
summary(stepwise_no_inter) %>% broom::glance() %>% knitr::kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.5493662 | 0.5093098 | 0.1015012 | 13.71484 | 2e-07 | 4 | 45 | 50 |
We now have five candidates as considered possible “best” model for drug overdose crude death rate:
Model_1: crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate + binge_drink_rate:education_rate.
Model_2: crudedeath_rate ~ mean_poverty_rate + mean_unemployment_rate:mean_poverty_rate + education_rate:mean_poverty_rate + mean_poverty_rate:mean_smoke_rate
Model_3: crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_poverty_rate + binge_drink_rate:education_rate +mean_unemployment_rate:mean_poverty_rate + mean_poverty_rate:mean_smoke_rate
Model_4: crudedeath_rate ~ binge_drink_rate + education_rate + mean_smoke_rate + divorce_rate:education_rate
Model_5: crudedeath_rate ~ binge_drink_rate + divorce_rate + education_rate + mean_smoke_rate
In cross validation process, we are going to compare the cross-validated prediction error for these five model and their result with initial full model.
train_df = sample_n(overall_regression, 41)
test_df = anti_join(overall_regression, train_df)
set.seed(1)
cv_modeling_df =
crossv_mc(overall_regression, 5000) %>%
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble)
) %>%
mutate(
model_1 = map(.x = train, ~lm(crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate + binge_drink_rate:education_rate, data = .x)),
model_2 = map(.x = train, ~lm(crudedeath_rate ~ mean_poverty_rate + mean_unemployment_rate:mean_poverty_rate + education_rate:mean_poverty_rate + mean_poverty_rate:mean_smoke_rate, data = .x)),
model_3 = map(.x = train, ~lm(crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_poverty_rate + binge_drink_rate:education_rate +mean_unemployment_rate:mean_poverty_rate + mean_poverty_rate:mean_smoke_rate, data = .x)),
model_4 = map(.x = train, ~lm(crudedeath_rate ~ binge_drink_rate + education_rate + mean_smoke_rate + divorce_rate:education_rate, data = .x)),
model_5 = map(.x = train, ~lm(crudedeath_rate ~ binge_drink_rate + divorce_rate + education_rate + mean_smoke_rate, data = .x)),
model_full = map(.x = train, ~lm(crudedeath_rate ~ binge_drink_rate + divorce_rate + mean_unemployment_rate + education_rate + mean_poverty_rate + mean_smoke_rate + binge_drink_rate * education_rate + mean_unemployment_rate * mean_poverty_rate + education_rate * mean_poverty_rate + education_rate * mean_smoke_rate + mean_poverty_rate * mean_smoke_rate, data = .x))
) %>%
mutate(
rmse_model_1 = map2_dbl(.x = model_1, .y = test, ~rmse(model = .x, data = .y)),
rmse_model_2 = map2_dbl(.x = model_2, .y = test, ~rmse(model = .x, data = .y)),
rmse_model_3 = map2_dbl(.x = model_3, .y = test, ~rmse(model = .x, data = .y)),
rmse_model_4 = map2_dbl(.x = model_4, .y = test, ~rmse(model = .x, data = .y)),
rmse_model_5 = map2_dbl(.x = model_5, .y = test, ~rmse(model = .x, data = .y)),
rmse_model_full = map2_dbl(.x = model_full, .y = test, ~rmse(model = .x, data = .y))
)
cv_modeling_output =
cv_modeling_df %>%
select(.id, starts_with("rmse")) %>%
pivot_longer(
rmse_model_1:rmse_model_full,
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_"
) %>%
ggplot(aes(x = model, y = rmse)) +
geom_violin(aes(fill = model), alpha = 0.5)
cv_modeling_output
Comments:
All five models we select are better than the initial model. Model_5 may be slightly better than other four.
The better model to explain the variances (from 2015 - 2019) of crude death rate of drug overdose for fifty states and Washington D.C. is model 5.
Model_5: crudedeath_rate ~ binge_drink_rate + divorce_rate + education_rate + mean_smoke_rate
summary(stepwise_no_inter) %>% broom::tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.4646221 | 0.3226392 | 1.440067 | 0.1567697 |
binge_drink_rate | -1.5395071 | 0.5520224 | -2.788849 | 0.0077258 |
divorce_rate | -1.5061409 | 0.4402422 | -3.421165 | 0.0013370 |
education_rate | 2.2751208 | 0.4591017 | 4.955592 | 0.0000107 |
mean_smoke_rate | 2.5809509 | 0.5356159 | 4.818660 | 0.0000168 |
Predictor evaluation: Binge drink rate and divorce rate are negatively associated with crude death rate of drug overdose. This means states with the higher value of these predictors may have a lower death rate for drug overdose. Education rate and smoke rate are positively associated with crude death rate of drug overdose. This means that states which have higher proportion of population with bachelor’s degree or above, or states with higher prevalence of adult smokers may also have a higher rate of death induced by drug overdose. All four predictors are significant in this models.
summary(stepwise_no_inter) %>% broom::glance() %>% knitr::kable()
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.5493662 | 0.5093098 | 0.1015012 | 13.71484 | 2e-07 | 4 | 45 | 50 |
Model description: Overall, this model explains around 50% of the difference of crude death rate of drug overdose among fifty states and Washington D.C.. However, since the sample size we use is quite small(50 + 1 = 51), this model might not be very reliable.