Ejemplo: Validación del supuesto de linealidad

Estadística II - Modelos no paramétricos y de regresión

2018-1

Descripción de los datos

Considerar el conjunto de datos de fecundidad de la ONU (disponible aquí), que contiene la 193 observaciones de tres variables, cada una corresponde a un país. Las variables son:

El objetivo es construir verificar el supuesto de linealidad del modelo RLM de fecundidad sobre el producto y la población en zonas urbanas.

file_url <- 'http://users.stat.umn.edu/~sandy/alr3ed/website/data/UN2.txt'
datos <- read.table(file_url, T)
rownames(datos) <- datos$Locality
datos <- datos[,-4]

Linealidad en logPPgdp

Métodos gráficos

modelo_s1 <- lm(logFertility ~ Purban, data = datos)
res_par1  <- residuals(modelo_s1)

Métodos estadísticos: prueba de falta de ajuste (lack-of-fit)

\[ \text{logFertility} = \beta_0 + \beta_1 \text{logPPgdp} + \beta_2 \text{Purbana} + \gamma_1 \text{logPPgdp}^2 + u \]

modelo_loc1 <- lm(logFertility ~ logPPgdp + Purban + I(logPPgdp^2), data = datos)
summary(modelo_loc1)
## 
## Call:
## lm(formula = logFertility ~ logPPgdp + Purban + I(logPPgdp^2), 
##     data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08171 -0.23352  0.05329  0.24583  0.94550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.603987   0.641007   7.182 1.53e-11 ***
## logPPgdp      -0.507502   0.120142  -4.224 3.73e-05 ***
## Purban        -0.002711   0.001857  -1.460  0.14595    
## I(logPPgdp^2)  0.016951   0.005266   3.219  0.00152 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 189 degrees of freedom
## Multiple R-squared:  0.4965, Adjusted R-squared:  0.4885 
## F-statistic: 62.13 on 3 and 189 DF,  p-value: < 2.2e-16

Linealidad en Purban

Métodos gráficos

modelo_s2 <- lm(logFertility ~ logPPgdp, data = datos)
res_par2  <- residuals(modelo_s2)

Métodos estadísticos: prueba de falta de ajuste (lack-of-fit)

\[ \text{logFertility} = \beta_0 + \beta_1 \text{logPPgdp} + \beta_2 \text{Purbana} + \gamma_2 \text{Purban}^2 + u \]

modelo_loc2 <- lm(logFertility ~ logPPgdp + Purban + I(Purban^2), data = datos)
summary(modelo_loc2)
## 
## Call:
## lm(formula = logFertility ~ logPPgdp + Purban + I(Purban^2), 
##     data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99912 -0.23736  0.06394  0.25862  0.94249 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.010e+00  1.892e-01  15.912  < 2e-16 ***
## logPPgdp    -1.291e-01  1.863e-02  -6.930 6.45e-11 ***
## Purban      -2.082e-02  5.452e-03  -3.818 0.000182 ***
## I(Purban^2)  1.593e-04  4.728e-05   3.368 0.000916 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3833 on 189 degrees of freedom
## Multiple R-squared:  0.499,  Adjusted R-squared:  0.491 
## F-statistic: 62.75 on 3 and 189 DF,  p-value: < 2.2e-16

No linealidad en la respuesta logFertility

Métodos gráficos

modelo <- lm(logFertility ~ logPPgdp + Purban, data = datos)
residuos <- residuals(modelo)
ajustados <- fitted(modelo)

Métodos estadisticos: prueba de no aditividad

\[ \text{logFertility} = \beta_0 + \beta_1 \text{logPPgdp} + \beta_2 \text{Purbana} + \gamma Z + u \] donde \(Z\) se calcula como \(\hat{y}^2/2\bar{y}_n\).

datos$Z <- ajustados^2/(2*mean(datos$logFertility))
modelo_nat <- lm(logFertility ~ logPPgdp + Purban + Z, data = datos)
summary(modelo_nat)
## 
## Call:
## lm(formula = logFertility ~ logPPgdp + Purban + Z, data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06422 -0.23936  0.04417  0.24320  0.95933 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.834426   0.949077  -0.879 0.380411    
## logPPgdp     0.070907   0.056860   1.247 0.213922    
## Purban       0.002642   0.002486   1.063 0.289304    
## Z            1.610369   0.440880   3.653 0.000336 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3814 on 189 degrees of freedom
## Multiple R-squared:  0.5039, Adjusted R-squared:  0.4961 
## F-statistic:    64 on 3 and 189 DF,  p-value: < 2.2e-16

Solución al problema de no linealidad

Regresión polinomial

\[ \text{logFertility} = \beta_0 + \beta_1 \text{logPPgdp} + \beta_2 \text{Purban} + \beta_{11} \text{logPPgdp}^2 + \beta_{22} \text{Purban}^2 \]

modelo_pol <- lm(logFertility ~ logPPgdp + Purban + I(logPPgdp^2) + I(Purban^2), data = datos)
summary(modelo_pol)
## 
## Call:
## lm(formula = logFertility ~ logPPgdp + Purban + I(logPPgdp^2) + 
##     I(Purban^2), data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03354 -0.24410  0.06117  0.25147  0.97594 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.184e+00  6.668e-01   6.274 2.36e-09 ***
## logPPgdp      -3.748e-01  1.352e-01  -2.773  0.00612 ** 
## Purban        -1.511e-02  6.248e-03  -2.419  0.01653 *  
## I(logPPgdp^2)  1.095e-02  5.967e-03   1.835  0.06809 .  
## I(Purban^2)    1.115e-04  5.371e-05   2.077  0.03917 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3809 on 188 degrees of freedom
## Multiple R-squared:  0.5078, Adjusted R-squared:  0.4973 
## F-statistic: 48.49 on 4 and 188 DF,  p-value: < 2.2e-16
modelo_pol2 <- lm(logFertility ~ logPPgdp + Purban + I(Purban^2), data = datos)
summary(modelo_pol2)
## 
## Call:
## lm(formula = logFertility ~ logPPgdp + Purban + I(Purban^2), 
##     data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99912 -0.23736  0.06394  0.25862  0.94249 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.010e+00  1.892e-01  15.912  < 2e-16 ***
## logPPgdp    -1.291e-01  1.863e-02  -6.930 6.45e-11 ***
## Purban      -2.082e-02  5.452e-03  -3.818 0.000182 ***
## I(Purban^2)  1.593e-04  4.728e-05   3.368 0.000916 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3833 on 189 degrees of freedom
## Multiple R-squared:  0.499,  Adjusted R-squared:  0.491 
## F-statistic: 62.75 on 3 and 189 DF,  p-value: < 2.2e-16

Trasfromación a la variable respuesta

datos$logFertility_trans <- 1/sqrt(datos$logFertility+1)
modelo_trans <- lm(logFertility_trans ~ logPPgdp + Purban, datos)
summary(modelo_trans)
## 
## Call:
## lm(formula = logFertility_trans ~ logPPgdp + Purban, data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15922 -0.04969 -0.02266  0.03833  0.23703 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.4526117  0.0305630  14.809  < 2e-16 ***
## logPPgdp    0.0210704  0.0039737   5.302 3.15e-07 ***
## Purban      0.0007235  0.0003922   1.845   0.0666 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08192 on 190 degrees of freedom
## Multiple R-squared:  0.3855, Adjusted R-squared:  0.379 
## F-statistic:  59.6 on 2 and 190 DF,  p-value: < 2.2e-16