En este ejemplo veremos como seleccionar un conjunto de variables para nuestro modelo. Recuerde que necesitamos agregar suficientes variables para explicar el fenómeno pero no demasiadas para no tener mucha varianza.
Primero veamos como se usan algunas medidas como la \(C_P\) de Mallows.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(leaps)
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
##
## cement
cemento<-read.table("cement.txt", header=TRUE, skip=5)
#G. de dispersión y correlación entre cada par de variables
ggpairs(cemento)
model1<-lm(y~., cemento, x=TRUE, y=TRUE)
#Calcular C_p y grafica en función de p
outs <- leaps(model1$x, cemento$y, int = FALSE)
plot(outs$size, outs$Cp, log = "y", xlab = "p", ylab = expression(C[p]), cex=0.5, pch=16)
#Recta C_p=p
lines(outs$size, outs$size)
#Etiquetamos con el número correspondiente al renglón de outs$which para saber
#a qué variables corresponde cada punto
text(outs$size, outs$Cp, labels=row(outs$which),cex=0.5, pos=4)
Utilizando esta medida, ¿qué modelo elegiría?
A continuación veremos como utilizar los algoritmos vistos en clase:
Por ejemplo. para probar todos los modelos posibles
model <- lm(mpg ~ disp + hp + wt + qsec, data = mtcars)
#Probar todos los posibles subconjuntos,
#esta función arroja R^2, R^2 ajustada, y C_p para cada modelo
ols_step_all_possible(model)
## Index N Predictors R-Square Adj. R-Square Mallow's Cp
## 3 1 1 wt 0.7528328 0.7445939 12.480939
## 1 2 1 disp 0.7183433 0.7089548 18.129607
## 2 3 1 hp 0.6024373 0.5891853 37.112642
## 4 4 1 qsec 0.1752963 0.1478062 107.069616
## 8 5 2 hp wt 0.8267855 0.8148396 2.369005
## 10 6 2 wt qsec 0.8264161 0.8144448 2.429492
## 6 7 2 disp wt 0.7809306 0.7658223 9.879096
## 5 8 2 disp hp 0.7482402 0.7308774 15.233115
## 7 9 2 disp qsec 0.7215598 0.7023571 19.602810
## 9 10 2 hp qsec 0.6368769 0.6118339 33.472150
## 14 11 3 hp wt qsec 0.8347678 0.8170643 3.061665
## 11 12 3 disp hp wt 0.8268361 0.8082829 4.360702
## 13 13 3 disp wt qsec 0.8264170 0.8078189 4.429343
## 12 14 3 disp hp qsec 0.7541953 0.7278591 16.257790
## 15 15 4 disp hp wt qsec 0.8351443 0.8107212 5.000000
Si no queremos todos los posibles modelos, ya que pueden ser demasiados, podemos buscar al mejor modelo de cada tamaño
sub.fit<-regsubsets(y~.,cemento)
summary(sub.fit)
## Subset selection object
## Call: regsubsets.formula(y ~ ., cemento)
## 4 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## x4 FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## x1 x2 x3 x4
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) "*" "*" " " " "
## 3 ( 1 ) "*" "*" " " "*"
## 4 ( 1 ) "*" "*" "*" "*"
par(mfrow=c(1,2))
plot(sub.fit,scale = "Cp")
¿Cómo interpreta la grafica anterior?
Otra opción es usar las funciones del paquete MASS para los algoritmos de selección ‘paso a paso’
full.model <- lm(Fertility ~., data = swiss)
step.model <- stepAIC(full.model, direction = "both",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
O el paquete olsrr que nos ofrece mas detalles del desarrollo
model <- lm(y ~ ., data = surgical)
ols_step_forward_p(model)
##
## Selection Summary
## ------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------
## 1 liver_test 0.4545 0.4440 62.5119 771.8753 296.2992
## 2 alc_heavy 0.5667 0.5498 41.3681 761.4394 266.6484
## 3 enzyme_test 0.6590 0.6385 24.3379 750.5089 238.9145
## 4 pindex 0.7501 0.7297 7.5373 735.7146 206.5835
## 5 bcs 0.7809 0.7581 3.1925 730.6204 195.4544
## ------------------------------------------------------------------------------
#Para que muestre paso a paso
ols_step_forward_p(model,details=T)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. bcs
## 2. pindex
## 3. enzyme_test
## 4. liver_test
## 5. age
## 6. gender
## 7. alc_mod
## 8. alc_heavy
##
## We are selecting variables based on p value...
##
##
## Forward Selection: Step 1
##
## - liver_test
##
## Model Summary
## -----------------------------------------------------------------
## R 0.674 RMSE 296.299
## R-Squared 0.455 Coef. Var 42.202
## Adj. R-Squared 0.444 MSE 87793.232
## Pred R-Squared 0.386 MAE 212.857
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 3804272.477 1 3804272.477 43.332 0.0000
## Residual 4565248.060 52 87793.232
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------
## (Intercept) 15.191 111.869 0.136 0.893 -209.290 239.671
## liver_test 250.305 38.025 0.674 6.583 0.000 174.003 326.607
## -------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 2
##
## - alc_heavy
##
## Model Summary
## -----------------------------------------------------------------
## R 0.753 RMSE 266.648
## R-Squared 0.567 Coef. Var 37.979
## Adj. R-Squared 0.550 MSE 71101.387
## Pred R-Squared 0.487 MAE 187.393
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 4743349.776 2 2371674.888 33.356 0.0000
## Residual 3626170.761 51 71101.387
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) -5.069 100.828 -0.050 0.960 -207.490 197.352
## liver_test 234.597 34.491 0.632 6.802 0.000 165.353 303.841
## alc_heavy 342.183 94.156 0.338 3.634 0.001 153.157 531.208
## --------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 3
##
## - enzyme_test
##
## Model Summary
## -----------------------------------------------------------------
## R 0.812 RMSE 238.914
## R-Squared 0.659 Coef. Var 34.029
## Adj. R-Squared 0.639 MSE 57080.128
## Pred R-Squared 0.567 MAE 170.603
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 5515514.136 3 1838504.712 32.209 0.0000
## Residual 2854006.401 50 57080.128
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------
## (Intercept) -344.559 129.156 -2.668 0.010 -603.976 -85.141
## liver_test 183.844 33.845 0.495 5.432 0.000 115.865 251.823
## alc_heavy 319.662 84.585 0.315 3.779 0.000 149.769 489.555
## enzyme_test 6.263 1.703 0.335 3.678 0.001 2.843 9.683
## ---------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 4
##
## - pindex
##
## Model Summary
## -----------------------------------------------------------------
## R 0.866 RMSE 206.584
## R-Squared 0.750 Coef. Var 29.424
## Adj. R-Squared 0.730 MSE 42676.744
## Pred R-Squared 0.669 MAE 146.473
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6278360.060 4 1569590.015 36.779 0.0000
## Residual 2091160.477 49 42676.744
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## -----------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -----------------------------------------------------------------------------------------------
## (Intercept) -789.012 153.372 -5.144 0.000 -1097.226 -480.799
## liver_test 125.474 32.358 0.338 3.878 0.000 60.448 190.499
## alc_heavy 359.875 73.754 0.355 4.879 0.000 211.660 508.089
## enzyme_test 7.548 1.503 0.404 5.020 0.000 4.527 10.569
## pindex 7.876 1.863 0.335 4.228 0.000 4.133 11.620
## -----------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 5
##
## - bcs
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
##
##
## No more variables to be added.
##
## Variables Entered:
##
## + liver_test
## + alc_heavy
## + enzyme_test
## + pindex
## + bcs
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -----------------------------------------------------------------
## R 0.884 RMSE 195.454
## R-Squared 0.781 Coef. Var 27.839
## Adj. R-Squared 0.758 MSE 38202.426
## Pred R-Squared 0.700 MAE 137.656
## -----------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 6535804.090 5 1307160.818 34.217 0.0000
## Residual 1833716.447 48 38202.426
## Total 8369520.537 53
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------
## (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
## liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
## alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
## enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
## pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
## bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
## ------------------------------------------------------------------------------------------------
##
## Selection Summary
## ------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------
## 1 liver_test 0.4545 0.4440 62.5119 771.8753 296.2992
## 2 alc_heavy 0.5667 0.5498 41.3681 761.4394 266.6484
## 3 enzyme_test 0.6590 0.6385 24.3379 750.5089 238.9145
## 4 pindex 0.7501 0.7297 7.5373 735.7146 206.5835
## 5 bcs 0.7809 0.7581 3.1925 730.6204 195.4544
## ------------------------------------------------------------------------------
#Otras opciones
ols_step_backward_p(model)
##
##
## Elimination Summary
## --------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------
## 1 alc_mod 0.7818 0.7486 7.0141 734.4068 199.2637
## 2 gender 0.7814 0.7535 5.0870 732.4942 197.2921
## 3 age 0.7809 0.7581 3.1925 730.6204 195.4544
## --------------------------------------------------------------------------
ols_step_both_p(model)
##
## Stepwise Selection Summary
## ------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------------------
## 1 liver_test addition 0.455 0.444 62.5120 771.8753 296.2992
## 2 alc_heavy addition 0.567 0.550 41.3680 761.4394 266.6484
## 3 enzyme_test addition 0.659 0.639 24.3380 750.5089 238.9145
## 4 pindex addition 0.750 0.730 7.5370 735.7146 206.5835
## 5 bcs addition 0.781 0.758 3.1920 730.6204 195.4544
## ------------------------------------------------------------------------------------------