1- Import the 'CENTENARI e BIOCHIMICA' database - Remove samples with missing data, save a new dataset called 'DATI,' and reclassify the factor variables that were erroneously interpreted as numerical.
I import the dataset and save a new dataset 'DATI' without missing data.
DATI<-na.omit(CENTENARI_BIOCHIMICA)
I reclassify the categorical variables erroneously interpreted as numerical by R.
DATI$uo<-as.factor(DATI$uo)
DATI$FUMO<-as.factor(DATI$FUMO)
DATI$INFARTO<-as.factor(DATI$INFARTO)
DATI$INSUFF_RENE<-as.factor(DATI$INSUFF_RENE)
DATI$DIABETE<-as.factor(DATI$DIABETE)
DATI$TEST_1_DIABETE<-as.factor(DATI$TEST_1_DIABETE)
DATI$TEST_2_DIABETE<-as.factor(DATI$TEST_2_DIABETE)
2- How many centenarian subjects are there with a BMI<25?
CENTENARI<-subset(DATI,DATI$Gruppo=="CENT" & DATI$BMI<25) dim(CENTENARI)
dim(CENTENARI) results in 49 and 36.
49 is the number of rows, and 36 is the number of columns.
The number of rows indicates how many centenarian subjects are present in the dataset with a BMI<25.
You can also obtain the same information with the following command:
table(DATI$Gruppo,DATI$BMI<25)
Centenarians with a BMI<25 are 49.
3- Display with a graph if the distribution of the GLICEMIA variable is different between Diabetics and non-diabetics.
boxplot(DATI$GLICEMIA~DATI$DIABETE)
4- Display with a graph the percentages of subjects belonging to the various operational units.
percentuali <-round(table(DATI$uo)/sum(table(DATI$uo))*100)
labels<-paste(levels(DATI$uo), percentuali,"%", sep="_")
pie(table(DATI$uo),labels)
5- Display a graph depicting the relationship between GLICEMIA and HOMA-IR.
scatter.smooth(DATI$GLICEMIA,DATI$`HOMA-IR`)
6- Is the number of subjects who have had a heart attack significantly different between smokers and non-smokers? Choose the most appropriate test and indicate the p-value.
chisq.test(table(DATI$INFARTO,DATI$FUMO))
Pearson's Chi-squared test with Yates' continuity correction
data: table(DATI$INFARTO, DATI$FUMO)
X-squared = 16.056, df = 1, p-value = 6.15e-05
The number of subjects who have had a heart attack is significantly different between smokers and non-smokers, as the p-value is <0.05.
7- Is there a difference in the values of Total Cholesterol (COL_TOT) between subjects with and without renal blockage (INSUFFICIENZA_RENE)? Choose the test you deem appropriate and interpret the result.
I check for normality and homoscedasticity with the following tests:
shapiro.test(DATI$COL_TOT)
Shapiro-Wilk normality test
data: DATI$COL_TOT
W = 0.99263, p-value = 0.05678
bartlett.test(DATI$COL_TOT~DATI$INSUFF_RENE)
Bartlett test of homogeneity of variances
data: DATI$COL_TOT by DATI$INSUFF_RENE
Bartlett's K-squared = 1.2554, df = 1, p-value = 0.2625
Both of them have a p-value >0.05, so I can apply a parametric test.
In this specific case, I have a binary categorical variable and a continuous one, so I can apply the t-test.
t.test(DATI$COL_TOT~DATI$INSUFF_RENE, paired = FALSE)
Welch Two Sample t-test
data: DATI$COL_TOT by DATI$INSUFF_RENE
t = 1.256, df = 8.2289, p-value = 0.2436
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17.22163 58.86178
sample estimates:
mean in group 0 mean in group 1
200.5979 179.7778
There is no significant difference in the values of Total Cholesterol (COL_TOT) between subjects with and without renal blockage (p-value <0.05).
8- Evaluate the relationship between GLICEMIA and BMI in a regression model. Assess if the relationship is significant and interpret the result by explaining what the beta value and intercept indicate.
summary(glm(DATI$BMI~DATI$GLICEMIA, family = gaussian))
Call:
glm(formula = DATI$BMI ~ DATI$GLICEMIA, family = gaussian)
Deviance Residuals:
Min 1Q Median 3Q Max
-13.0860 -2.7545 -0.2448 2.3224 22.7330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.047938 0.758165 29.081 < 2e-16 ***
DATI$GLICEMIA 0.046394 0.007826 5.928 6.89e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 20.28338)
Null deviance: 8420.4 on 381 degrees of freedom
Residual deviance: 7707.7 on 380 degrees of freedom
AIC: 2237.8
Number of Fisher Scoring iterations: 2
The relationship between GLICEMIA and BMI is found to be significant as the p-value is <0.05.
The intercept indicates the value of Y when X is zero. In this specific case, BMI has a value of 22.04 when blood glucose is 0.
The beta value, on the other hand, indicates how much Y increases for a one-unit change in X. In this specific case, BMI increases by 0.04 units for a one-unit change in blood glucose.
By taking the exponential of the beta, I obtain the Odds Ratio (which can be calculated with the 'exp()' command).
9- Evaluate the relationship between INFARTO = Y and GLICEMIA = X in a regression model. Assess if the relationship is significant and interpret the result by explaining what the beta value and intercept indicate.
summary(glm(DATI$INFARTO~DATI$GLICEMIA, family = binomial))
Call:
glm(formula = DATI$INFARTO ~ DATI$GLICEMIA, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6856 -0.5646 -0.5369 -0.5026 2.0879
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.776554 0.419783 -6.614 3.73e-11 ***
DATI$GLICEMIA 0.010860 0.004036 2.691 0.00712 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 321.91 on 381 degrees of freedom
Residual deviance: 314.75 on 380 degrees of freedom
AIC: 318.75
Number of Fisher Scoring iterations: 4
The relationship between heart attack and blood glucose is found to be significant (p-value 0.007).
The intercept indicates that the heart attack value is -2.7765 when blood glucose is zero.
The beta, in this case, is 0.01, indicating an increase in the risk of having a heart attack for each one-unit increase in blood glucose.
10- Singh et al. studied immune abnormalities in autistic children, and data related to the measurement of serum antigen concentration in units/ml are reported for three groups of children under the age of 10: autistic, normal, and delayed. Are there differences between these three groups of children in terms of antigen concentration?
autistici<c(755,383,380,215,400,343,415,360,345,450,410,435,460,360,225,900,365,440,820,400,170,300,325,345,230,370,285,315,195,270,305,375,220)
normali<-c(165,390,290,435,235,320,330,205,375,345,305,220,270,355,360,335,305,360,335,305,325,245,285,370,345)
ritardati<-c(380,510,315,565,715,380,390,245,155,335,295,200,105,105,245)
I create a vector with all the values of serum concentrations.
concentrazionesierica<c(755,383,380,215,400,343,415,360,345,450,410,435,460,360,225,900,365,440,820,400,170,300,325,345,230,370,285,315,195,270,305,375,220,165,390,290,435,235,320,330,205,375,345,305,220,270,355,360,335,305,360,335,305,325,245,285,370,345,380,510,315,565,715,380,390,245,155,335,295,200,105,105,245)
I create a vector that contains the categories corresponding to the values of serum concentrations.
GRUPPI<-c(rep("AUT",length(autistici)),rep("NORM",length(normali)),rep("RIT",(length(ritardati)))
Since R mistakenly considers the variable 'GRUPPI' as a character, I reclassify it as a factor.
GRUPPI<-as.factor(GRUPPI)
I check for normality.
shapiro.test(concentrazionesierica)
Shapiro-Wilk normality test
data: concentrazionesierica
W = 0.84468, p-value = 2.934e-07
Since I don't have a normal distribution, I apply a non-parametric test, specifically the Kruskal-Wallis test (the non-parametric counterpart of the parametric ANOVA test).
I should set the continuous variable as Y and the categorical variable as X.
kruskal.test(concentrazionesierica~GRUPPI)
Kruskal-Wallis rank sum test
data: concentrazionesierica by GRUPPI
Kruskal-Wallis chi-squared = 3.808, df = 2, p-value = 0.149
The p-value is >0.05, so there are no significant differences between the three groups of autistic, normal, and delayed children.