1. Load the alcoholism dataset

`alcolismo <- read.csv2("/Users/rebeccacavagnola/Desktop/tutor/alcolismo.csv", sep=";")`

2. Proceed with checking the dataset and assign the correct class to the variables if necessary

```
str(alcolismo)
alcolismo$drinks <- as.numeric(alcolismo$drinks)
alcolismo$sesso <- as.factor(alcolismo$sesso)
# Convert the 'drinks' variable to numeric since it was incorrectly classified as a character, and the 'sesso' variable to a factor since it was incorrectly classified as numeric.
```

3. Check for the presence of missing data and impute if necessary

```
summary(alcolismo)
table(is.na(alcolismo))
# There are no missing data
```

4. How many men drink more than 2 pints per day (sesso=2)

```
uomini_pinte <- as.data.frame(subset(alcolismo, alcolismo$sesso == "2" & alcolismo$drinks > 2))
dim(uomini_pinte)
```

5. Create a new variable called RangeDrinks based on the Drinks variable, categorizing its values into the following ranges: (0-4, 5-8, 9-20) and naming them respectively: "low", "medium", "alcoholism"

```
alcolismo["RangeDrinks"] = cut(alcolismo$drinks, c(0,4,8,20), c("0-4","5-8","9-20"), include.lowest=TRUE)
levels(alcolismo$RangeDrinks) <- c("low", "medium", "alcoholism")
```

6. Calculate the mean of aspartate and represent the median values of alanine aminotransferase for the newly created categories

```
# Mean
mean(alcolismo$aspartate.aminotransferase)
# 24.64
# To represent the median values, use a boxplot
boxplot(alcolismo$aspartate.aminotransferase ~ alcolismo$RangeDrinks, main="Aspartate Levels in the Three RangeDrinks Categories", xlab="Groups", ylab="Aspartate", ylim=c(0,50))
```

```
# prettier Boxplot
library(ggplot2)
library(ggpubr)
ggboxplot(alcolismo, x = "RangeDrinks", y = "aspartato.amminotrasferasi",
color = "RangeDrinks", palette = "jco",
add = "jitter", ylim=c(0,50)) + ggtitle("Livelli di aspartato nelle tre classi di RangeDrinks")
```

7. Verify with the most appropriate test (justifying the choice) if there are significant differences between the three groups of RangeDrinks in aspartate aminotransferase levels.

```
#ANOVA is the most suitable test provided that the assumptions of normality and homoscedasticity are met
Let's check for normality using the Shapiro test
shapiro.test(alcolismo$aspartate.aminotransferase)
```

Since it doesn't have a normal distribution (p-value < 0.05: rejecting the null hypothesis of normality), I apply a non-parametric test, specifically the Kruskal-Wallis test (the non-parametric counterpart of the parametric ANOVA test). I could also check for homoscedasticity, but it's not necessary because I already know that I don't have a normal distribution.

`bartlett.test(aspartate.aminotransferase ~ RangeDrinks, data=alcolismo)`

```
# I apply the non-parametric Kruskal-Wallis test:
kruskal.test(aspartate.aminotransferase ~ RangeDrinks, data = alcolismo)
```

The p-value is < 0.05, so there are significant differences between the three groups of RangeDrinks: low, medium, and alcoholism. Let's determine which groups differ by applying the post-hoc Dunn's Test.

```
library(FSA)
dunnTest(aspartato.amminotrasferasi ~ RangeDrinks,
data=alcolismo,
method="bonferroni")
```

8. Represent the corpuscular volume in a graph for the three groups.

`boxplot(alcolismo$volume.corpuscolare ~ alcolismo$RangeDrinks,main="Volume corpuscolare ",xlab="gruppi",ylab="aspartato ", ylim=c(80,110))`

```
# prettier
library(ggplot2)
library(ggpubr)
ggboxplot(alcolismo, x = "RangeDrinks", y = "volume.corpuscolare",
color = "RangeDrinks", palette = "jco",
add = "jitter", ylim=c(80,110)) + ggtitle("Volume corpuscolare nei tre gruppi")
```

9. Evaluate the relationship between corpuscular volume and the number of pints consumed per day.

`summary(modello<- glm(volume.corpuscolare ~ drinks, data= alcolismo, family= "gaussian"))`

p-value = 2.92e-09: SIGNIFICANT. There is a significant relationship between corpuscular volume and the number of pints consumed per day.

B0: 88.72 (the value of Y when X=0)

B1: 0.42 (Change in Y for a one-unit increase in X, for a one-unit increase in the number of pints).

10. Evaluate the change in corpuscular volume among the three groups of RangeDrinks using a linear regression model.

`summary(modello<- glm(volume.corpuscolare ~ RangeDrinks, data= alcolismo, family= "gaussian"))`

89.15: Corpuscular volume for RangeDrinks low when RangeDrinks medium = 0 and RangeDrinks alcoholism = 0.

3.3024: The difference in corpuscular volume between RangeDrinks low and RangeDrinks medium (p-value < 0.05, so the difference is significantly different from 0).

3.3738: The difference in corpuscular volume between RangeDrinks low and RangeDrinks alcoholism (p-value < 0.05, so the difference is significantly different from 0).

Evaluate the association between corpuscular volume and aspartate aminotransferase and represent the association with the appropriate graph.

```
# First, I check for normality with the Shapiro test:
shapiro.test(alcolismo$aspartate.aminotransferase)
shapiro.test(alcolismo$corpuscular.volume)
# p-value < 0.05: rejecting the null hypothesis of normality, so I use the Spearman method
cor.test(alcolismo$aspartate.aminotransferase, alcolismo$corpuscular.volume, method = "spearman")
# p-value = 0.049 < 0.05 (barely), there is a statistically significant relationship between aspartate and corpuscular volume
plot(alcolismo$aspartate.aminotransferase, alcolismo$corpuscular.volume)
```