Skip to content

Variance and Regression

Question 1.

Conduct ANOVA (analysis of variance) and Regression coefficients to the data from data (” cystfibr ”) database. You can choose any variable you like to interpret. In the report, you need to state the result of coefficients and significance to any variables you like both under ANOVA and multivariate analysis. Please provide a specific interpretation of R results.

Clue:

Import library(ISwR) The model code example:

lm(pemax ~ age + weight + bmp + fev1, data = cystfibr)
anova(lm(pemax ~ age + weight + bmp + fev1, data=cystfibr))
> library(ISwR)
> data("cystfibr")
> model <-lm(pemax ~ age + weight + bmp + fev1, data = cystfibr)
> anova_results <- anova(model)
> anova_results
Analysis of Variance Table
Response: pemax
Df Sum Sq Mean Sq
age 1 10098.5 10098.5
weight 1 945.2 945.2
bmp 1 2379.7 2379.7
fev1 1 2455.6 2455.6
Residuals 20 10953.7 547.7
F value Pr(>F)
age 18.4385 0.0003538 ***
weight 1.7258 0.2038195
bmp 4.3450 0.0501483 .
fev1 4.4836 0.0469468 *
Residuals
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
0.05 ‘.’ 0.1 ‘ ’ 1
> summary(model)
Call:
lm(formula = pemax ~ age + weight + bmp + fev1, data = cystfibr)
Residuals:
Min 1Q Median 3Q
-42.521 -10.885 3.003 15.488
Max
41.767
Coefficients:
Estimate Std. Error
(Intercept) 179.2957 61.8855
age -3.4181 3.3086
weight 2.6882 1.1727
bmp -2.0657 0.8198
fev1 1.0882 0.5139
t value Pr(>|t|)
(Intercept) 2.897 0.00891 **
age -1.033 0.31389
weight 2.292 0.03287 *
bmp -2.520 0.02036 *
fev1 2.117 0.04695 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.4 on 20 degrees of freedom
Multiple R-squared: 0.5918, Adjusted R-squared: 0.5101
F-statistic: 7.248 on 4 and 20 DF, p-value: 0.0008891

This ANOVA test was performed to test if the coefficients at all contributed to the variation of pemax. Weight, bpm, and fev1 are significant because they are < 0.05. Age in this case is .31389 and is not significant in this test.

  • Weight has a positive impact on pemax.
  • bpm has a negative impact on pemax (bpm increases pemax decreases)
  • fev1 and weight have a positive impact (fev1 and weight increase pemax increases)

Question 2.

The secher data(“secher”) are best analyzed after log-transforming birth weight as well as the abdominal and biparietal diameters. Fit a prediction weight as well as abdominal and biparietal diameters. For a prediction equation for birth weight. How much is gained by using both diameters in a prediction equation? The sum of the two regression coefficients is almost identical and equal to 3. Can this be given a nice interpretation to our analysis? Please provide step by step on your analysis and code you use to find out the result.

Clue:

Model with only abdominal diameter: model_ad <- lm(log(bwt) ~ I(log(ad)), data=secher)
Model with only biparietal diameter: model_bpd <- lm(log(bwt) ~ I(log(bpd)), data=secher)
Combine both models: model_combined <- lm(log(bwt) ~ I(log(ad)) + I(log(bpd)), data=secher)
summary(model_combined)
> data("secher")
> model_ad <- lm(log(bwt) ~ I(log(ad)), data=secher)
> summary(model_ad)
Call:
lm(formula = log(bwt) ~ I(log(ad)), data = secher)
Residuals:
Min 1Q Median 3Q Max
-0.58560 -0.06609 0.00184 0.07479 0.48435
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.4446 0.5103 -4.791 5.49e-06 ***
I(log(ad)) 2.2365 0.1105 20.238 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1275 on 105 degrees of freedom
Multiple R-squared: 0.7959, Adjusted R-squared: 0.794
F-statistic: 409.6 on 1 and 105 DF, p-value: < 2.2e-16
> model_bpd <- lm(log(bwt) ~ I(log(bpd)), data=secher)
> summary(model_bpd)
Call:
lm(formula = log(bwt) ~ I(log(bpd)), data = secher)
Residuals:
Min 1Q Median 3Q Max
-0.36478 -0.09725 0.01251 0.07703 0.51154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.0862 0.9062 -7.819 4.35e-12 ***
I(log(bpd)) 3.3320 0.2017 16.516 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1488 on 105 degrees of freedom
Multiple R-squared: 0.7221, Adjusted R-squared: 0.7194
F-statistic: 272.8 on 1 and 105 DF, p-value: < 2.2e-16
> model_combined <- lm(log(bwt) ~ I(log(ad)) + I(log(bpd)), data=secher)
> summary(model_combined)
Call:
lm(formula = log(bwt) ~ I(log(ad)) + I(log(bpd)), data = secher)
Residuals:
Min 1Q Median 3Q Max
-0.35074 -0.06741 -0.00792 0.05750 0.36360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.8615 0.6617 -8.859 2.36e-14 ***
I(log(ad)) 1.4667 0.1467 9.998 < 2e-16 ***
I(log(bpd)) 1.5519 0.2294 6.764 8.09e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1068 on 104 degrees of freedom
Multiple R-squared: 0.8583, Adjusted R-squared: 0.8556
F-statistic: 314.9 on 2 and 104 DF, p-value: < 2.2e-16

By looking aat model_ad, model_bpd, and model_combined 85% is increased when using both diameters. As r^2 increases, the variability becomes more and more obvious.

model_combined r^2 = 0.85 model_bpd r^2 = 0.72 model_ad r^2 = 0.79

When the abdominal and biparietal model are added together, we get a value close to 3 showing that the prediction is accurate.

Just an additional question (this will not be graded). When should we consider “log-transforming” a dataset? This is a very common practice in data science.

Using log-transforming can help reduce variance and outliers.