Exercises with R
1. Load the alcoholism dataset
alcolismo <- read.csv2("/Users/rebeccacavagnola/Desktop/tutor/alcolismo.csv", sep=";")
2. Proceed with checking the dataset and assign the correct class to the variables if necessary
str(alcolismo)
alcolismo$drinks <- as.numeric(alcolismo$drinks)
alcolismo$sesso <- as.factor(alcolismo$sesso)
# Convert the 'drinks' variable to numeric since it was incorrectly classified as a character, and the 'sesso' variable to a factor since it was incorrectly classified as numeric.
3. Check for the presence of missing data and impute if necessary
summary(alcolismo)
table(is.na(alcolismo))
# There are no missing data
4. How many men drink more than 2 pints per day (sesso=2)
uomini_pinte <- as.data.frame(subset(alcolismo, alcolismo$sesso == "2" & alcolismo$drinks > 2))
dim(uomini_pinte)
5. Create a new variable called RangeDrinks based on the Drinks variable, categorizing its values into the following ranges: (0-4, 5-8, 9-20) and naming them respectively: "low", "medium", "alcoholism"
alcolismo["RangeDrinks"] = cut(alcolismo$drinks, c(0,4,8,20), c("0-4","5-8","9-20"), include.lowest=TRUE)
levels(alcolismo$RangeDrinks) <- c("low", "medium", "alcoholism")
6. Calculate the mean of aspartate and represent the median values of alanine aminotransferase for the newly created categories
# Mean
mean(alcolismo$aspartate.aminotransferase)
# 24.64
# To represent the median values, use a boxplot
boxplot(alcolismo$aspartate.aminotransferase ~ alcolismo$RangeDrinks, main="Aspartate Levels in the Three RangeDrinks Categories", xlab="Groups", ylab="Aspartate", ylim=c(0,50))
# prettier Boxplot
library(ggplot2)
library(ggpubr)
ggboxplot(alcolismo, x = "RangeDrinks", y = "aspartato.amminotrasferasi",
color = "RangeDrinks", palette = "jco",
add = "jitter", ylim=c(0,50)) + ggtitle("Livelli di aspartato nelle tre classi di RangeDrinks")
7. Verify with the most appropriate test (justifying the choice) if there are significant differences between the three groups of RangeDrinks in aspartate aminotransferase levels.
#ANOVA is the most suitable test provided that the assumptions of normality and homoscedasticity are met
Let's check for normality using the Shapiro test
shapiro.test(alcolismo$aspartate.aminotransferase)
Since it doesn't have a normal distribution (p-value < 0.05: rejecting the null hypothesis of normality), I apply a non-parametric test, specifically the Kruskal-Wallis test (the non-parametric counterpart of the parametric ANOVA test). I could also check for homoscedasticity, but it's not necessary because I already know that I don't have a normal distribution.
bartlett.test(aspartate.aminotransferase ~ RangeDrinks, data=alcolismo)
# I apply the non-parametric Kruskal-Wallis test:
kruskal.test(aspartate.aminotransferase ~ RangeDrinks, data = alcolismo)
The p-value is < 0.05, so there are significant differences between the three groups of RangeDrinks: low, medium, and alcoholism. Let's determine which groups differ by applying the post-hoc Dunn's Test.
library(FSA)
dunnTest(aspartato.amminotrasferasi ~ RangeDrinks,
data=alcolismo,
method="bonferroni")
8. Represent the corpuscular volume in a graph for the three groups.
boxplot(alcolismo$volume.corpuscolare ~ alcolismo$RangeDrinks,main="Volume corpuscolare ",xlab="gruppi",ylab="aspartato ", ylim=c(80,110))
# prettier
library(ggplot2)
library(ggpubr)
ggboxplot(alcolismo, x = "RangeDrinks", y = "volume.corpuscolare",
color = "RangeDrinks", palette = "jco",
add = "jitter", ylim=c(80,110)) + ggtitle("Volume corpuscolare nei tre gruppi")
9. Evaluate the relationship between corpuscular volume and the number of pints consumed per day.
summary(modello<- glm(volume.corpuscolare ~ drinks, data= alcolismo, family= "gaussian"))
p-value = 2.92e-09: SIGNIFICANT. There is a significant relationship between corpuscular volume and the number of pints consumed per day.
B0: 88.72 (the value of Y when X=0)
B1: 0.42 (Change in Y for a one-unit increase in X, for a one-unit increase in the number of pints).
10. Evaluate the change in corpuscular volume among the three groups of RangeDrinks using a linear regression model.
summary(modello<- glm(volume.corpuscolare ~ RangeDrinks, data= alcolismo, family= "gaussian"))
• 89.15: Corpuscular volume for RangeDrinks low when RangeDrinks medium = 0 and RangeDrinks alcoholism = 0.
• 3.3024: The difference in corpuscular volume between RangeDrinks low and RangeDrinks medium (p-value < 0.05, so the difference is significantly different from 0).
• 3.3738: The difference in corpuscular volume between RangeDrinks low and RangeDrinks alcoholism (p-value < 0.05, so the difference is significantly different from 0).
1. Evaluate the association between corpuscular volume and aspartate aminotransferase and represent the association with the appropriate graph.
# First, I check for normality with the Shapiro test:
shapiro.test(alcolismo$aspartate.aminotransferase)
shapiro.test(alcolismo$corpuscular.volume)
# p-value < 0.05: rejecting the null hypothesis of normality, so I use the Spearman method
cor.test(alcolismo$aspartate.aminotransferase, alcolismo$corpuscular.volume, method = "spearman")
# p-value = 0.049 < 0.05 (barely), there is a statistically significant relationship between aspartate and corpuscular volume
plot(alcolismo$aspartate.aminotransferase, alcolismo$corpuscular.volume)
Like