1- Load the dataset "CENTENARI BIOCHIMICA.xls" from the website.
Use the read.table function after saving the data in tab-delimited text format .txt.
Specify that the file has a header, and that the symbol separating the columns is a tabulation using sep="\t".
Note: If you import the data using a graphical interface, be careful about MISSING DATA. They should be indicated in the import options screen. In the dropdown menu for NA, choose how the missing data are indicated. In the example's case, they are indicated as "NA," but in other cases, they could be represented as empty cells.
DATA<- read.table("DIRECORY/DATI/CENTENARI_BIOCHIMICA.txt",header=TRUE,sep="\t",na.string="")
Note: In this example, a DATA object has been created to hold the data. If you import the CENTENARI_BIOCHIMICA file using the graphical interface of R, an object named CENTENARI_BIOCHIMICA will be created instead of DATA. Naturally, the subsequent commands will need to be modified accordingly, or you can create a copy of DATA that contains CENTENARI_BIOCHIMICA data.
2- Assign the appropriate types "factor" and "numeric" to the variables.
First, check the variable types using:
str(DATA)
Most of the variables are numeric or integer, the only incorrect ones are Gruppo, uo, pid, FUMO, INFARTO, INSUFF_RENE, DIABETE, TEST_1_DIABETE, TEST_2_DIABETE.
I am making them categorical. This is an important step because R will handle them appropriately during the analysis.
DATA$Gruppo <- as.factor(DATA$Gruppo)
DATA$uo <- as.factor(DATA$uo)
DATA$pid <- as.factor(DATA$pid)
DATA$FUMO <- as.factor(DATA$FUMO)
DATA$INFARTO <- as.factor(DATA$INFARTO)
DATA$INSUFF_RENE <- as.factor(DATA$INSUFF_RENE)
DATA$DIABETE <- as.factor(DATA$DIABETE)
DATA$TEST_1_DIABETE <- as.factor(DATA$TEST_1_DIABETE)
DATA$TEST_2_DIABETE <- as.factor(DATA$TEST_2_DIABETE)
# Check that everything is correct by typing again
str(DATA)
3- Proceed with the analysis of missing data, and if necessary, impute the data or choose an alternative approach while providing a justification.
To visualize and impute missing data, I load the VIM and mice packages.
I restructure the dataset to combine categorical and numeric variables.
ATI <- CENTENARI_BIOCHIMICA
colnames(DATI2)
DATI2 <- data.frame(DATI[, 1:3], DATI[, 31:36], DATI[, 4:30])
DATI2
library(VIM)
aggr_plot <- aggr(DATI2, col = c('navyblue', 'red'), numbers = TRUE,
sortVars = TRUE, labels = names(DATI), cex.axis = 0.7, gap = 3,
ylab = c("Histogram of missing data", "Pattern"))
tempData <- mice(DATI[, 4:30], m = 1, meth = 'pmm', seed = 100)
summary(tempData)
completedData <- complete(tempData, 1)
I display which variables are mostly involved and notice that the number of missing data is not high. I can comfortably proceed with imputing them.
4- Calculate the mean, median, maximum, and minimum for each numeric variable in the dataset, and calculate the relative frequencies for categorical variables.
If the variables have been correctly reclassified, the summary function will automatically retrieve these measures of location and frequency. Therefore, the R command will be:
summary(DATA2)
5- Choose the most suitable graph to visualize the relationship between the 'age' and 'BMI' variables.
As a first step, consider the nature of the variables: both are continuous numeric variables. Therefore, the most appropriate graph to test the relationship between them is the scatterplot. You can create a scatterplot using the basic command:
plot(DATI$age,DATI$BMI)
However, you can achieve higher-quality graphs using the 'ggplot2' package.
The basic syntax could be:
library(ggplot2)
# Using DATI dataframe
ggplot(DATI, aes(x = age, y = BMI)) +
geom_point() +
geom_smooth(method = lm)
# Removing standard error
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
Let's say, hypothetically, you want to color the subjects based on a categorical variable, such as the 'Gruppi' variable. The syntax would become:
library(ggplot2)
ggplot(DATI, aes(x = age, y = BMI, color = Gruppi)) +
geom_point() +
geom_smooth(method = lm)
6-Choose the most suitable graph to visualize the distribution of the variable 'GLICEMIA'.
The most appropriate graph to visualize the distribution of a continuous variable is either a histogram or a density plot.
hist(DATI$GLICEMIA)
plot(density(DATI$GLICEMIA))
Using ggplot2:
p <- ggplot(DATI, aes(x=DATI$GLICEMIA)) + geom_density()
p
# Aggiungi linea media
p+ geom_vline(aes(xintercept=mean(DATI$GLICEMIA)), color="blue", linetype="dashed", size=1)
7- Does the variable 'GLICEMIA' have a similar distribution across different operating units (uo)?
The command to use is sm.density.compare() from the 'sm' package.
library(sm)
sm.density.compare(DATI$GLICEMIA,as.factor(DATI$Gruppo))
Please note that if the variables are already categorical, there's no need to use the as.factor function.
With ggplot2:
p<-ggplot(DATI, aes(x=DATI$GLICEMIA, color=DATI$Gruppo)) + geom_density()+
p
8- Choose the best graph to visualize how subjects are distributed based on operating units (uo).
Undoubtedly, the best graph for visualizing percentages is the pie chart.
pie(table(DATI$Gruppo))
# If you want to show percentages, create the 'percentuali' object by calculating them
percentuali <- round(table(DATI$Gruppo) / sum(table(DATI$Gruppo)) * 100)
# If you want to add labels, create a vector containing labels using DATI$Gruppo + percentuali, and combine them with paste()
labels <- paste(DATI$Gruppo, percentuali)
# Now create the pie chart again, this time adding labels
pie(table(DATI$Gruppo), labels)
9- Depict the average levels of 'GLICEMIA' among the groups of subjects CENT, FIGLI, and CTRL using a barplot.
barplot(tapply(DATI$GLICEMIA,DATI$Gruppo,mean)
p<-ggplot(data=DATI, aes(x=DATI$Gruppo, y=DATI$GLICEMIA)) + geom_bar(stat="identity") p
p
with GGplot2
Z<-levels(as.factor(DATI$Gruppo))
Y<-tapply(DATI$GLICEMIA,DATI$Gruppo,mean)
df <- data.frame(X=Z,
Y=Y)
ggplot(df, aes(Z, Y)) +
geom_col()
10- Compare the median cholesterol levels among the three groups of subjects using the most suitable graph.
boxplot(DATI$COL_TOT~DATI$Gruppo)
ggplot(DATI, aes(x=DATI$Gruppo, y=DATI$COL_TOT)) + geom_boxplot()