1- Import the database “CENTENARI e BIOCHIMICA” - Remove samples with missing data, save a new dataset called DATI, and reclassify the factorial variables mistakenly interpreted by R as numeric.
I use the read.table function after saving the data as a tab-delimited text file (.txt). I specify that the file has a header and that the separator between the columns is a tab using sep="\t".
Note: If you import the data through the graphical interface, be careful about MISSING DATA. They MUST BE INDICATED in the "import options" drop-down menu under NA. Choose how missing data are indicated—in this example, they are marked as "NA," but in other cases, they may be empty cells.
DATA <- read.table("DIRECTORY/DATI/CENTENARI_BIOCHIMICA.txt", header=TRUE, sep="\t", na.string="")
Note: In this example, an object called DATA was created that contains the data. If you import the CENTENARI_BIOCHIMICA file using R’s graphical interface, the object will be called CENTENARI_BIOCHIMICA instead of DATA. Of course, the subsequent commands need to be adjusted, or you can create a copy called DATA that contains CENTENARI_DATI.
I first check the variable types with:
str(DATA)
Most of the variables are numeric or integers. The only incorrect ones are Gruppo, uo, pid, FUMO, INFARTO, INSUFF_RENE, DIABETE, TEST_1_DIABETE, TEST_2_DIABETE. I convert these to factors. This is important because R will treat them accordingly during analysis.
DATA$Gruppo <- as.factor(DATA$Gruppo)
DATA$uo <- as.factor(DATA$uo)
DATA$pid <- as.factor(DATA$pid)
DATA$FUMO <- as.factor(DATA$FUMO)
DATA$INFARTO <- as.factor(DATA$INFARTO)
DATA$INSUFF_RENE <- as.factor(DATA$INSUFF_RENE)
DATA$DIABETE <- as.factor(DATA$DIABETE)
DATA$TEST_1_DIABETE <- as.factor(DATA$TEST_1_DIABETE)
DATA$TEST_2_DIABETE <- as.factor(DATA$TEST_2_DIABETE)
Check again:
str(DATA)
Now, remove missing data:
DATI <- na.omit(DATA)
dim(DATI)
2- Display a graph to show if the distribution of the variable calcium is different between centenarians, their children, and controls.
There are at least three types of graphs we can use to visualize the distribution of a variable: density plot, boxplot, and histogram-plot. We will use the density plot.
The challenge is that I want to visualize the distributions of the variable “Calcio” for the three groups at the same time. To do this, I can use the sm.density.compare() function from the sm package. I install and load the package:
install.packages("sm")
library(sm)
Generate the graph:
sm.density.compare(DATI$CALCIO, DATI$Gruppo)
The graph shows how the calcium variable is distributed among the three groups: Centenarians, their children, and Controls.
However, the legend is missing, which makes it hard to understand the graph. I also want to refine the graph so that all necessary information is included.
What’s missing:
Titles for the axes
A title for the graph
Choice of colors
A legend
1 - Add axis labels
We can add axis labels with xlab=("Calcium Levels") and ylab=("Density"). This is added to the previous command simply by separating it with a comma.
sm.density.compare(DATI$CALCIO, DATI$Gruppo, xlab=("Calcium Levels"), ylab=("Density"))
2 - Add a title
To add a title, just append the command title(main="Density-Plot of Calcium and Groups") to the previous code.
sm.density.compare(DATI$CALCIO, DATI$Gruppo, xlab=("Calcium Levels"), ylab=("Density")) title(main="Density-Plot of Calcium and Groups")
3 - Choose colors
I need to determine how many colors to use by checking how many groups we have, i.e., the levels of the factor variable DATI$Gruppo. The command:
levels(DATI$Gruppo)
Returns "CENT", "CTRL", "FIGLI", so there are three groups. I can now decide to color Centenarians, their children, and Controls in black, red, and blue, respectively. I add the option col=c("red","black","blue") to the sm.density.compare() command:
sm.density.compare(DATI$CALCIO, DATI$Gruppo, xlab=("Calcium Levels"), ylab=("Density"), col=c("red","black","blue")) title(main="Density-Plot of Calcium and Groups")
4 - Add the legend
Use the legend() command to add a legend. This command needs three pieces of information:
a. Where to place the legend on the graph (“topright”)b. The levels of the grouping variable (use levels(DATI$Gruppo))c. The colors used for the groups: fill=c("red","black","blue")
The full command is:
legend("topright", levels(DATI$Gruppo), fill=c("red","black","blue"))
The final graph code will be:
sm.density.compare(DATI$CALCIO, DATI$Gruppo, xlab=("Calcium Levels"), ylab=("Density"), col=c("red","black","blue")) title(main="Density-Plot of Calcium and Groups")
legend("topright", levels(DATI$Gruppo), fill=c("red","black","blue"))
3- There are five operational units (regions where the subjects were recruited). Some claim that, due to a technical problem, Unit 2 measures different glucose levels compared to the other units. Check if this hypothesis is valid.
This exercise involves determining whether subjects recruited by Unit 2 differ from the others in glucose levels. In other words, we compare Unit 2 with all the other subjects.
There are various strategies, but I decide to reduce it to a two-group comparison: Unit 2 versus all others.
I see that my data includes the variable DATI$uo, which indicates operational units. However, this is not binary as I want it to be (Unit 2 vs. all other units). So, I create a new variable:
DATI$uo2BIN <- as.factor(DATI$uo == "2")
levels(DATI$uo2BIN)
Now I want to rename the levels from "False" and "True" to 0 and 1:
levels(DATI$uo2BIN) <- c(0,1)
I have now created a binary factor variable that distinguishes Unit 2 from the others.
I can now visualize the glucose levels in the two groups I just created. A boxplot or a density plot can show if the distributions are similar or not.
boxplot(DATI$GLICEMIA~DATI$uo2BIN)
Replace:
The variable to be visualized
The grouping variable
The legend with new terms
Modify colors (there are now only two groups)
sm.density.compare(DATI$GLICEMIA, DATI$uo2BIN, xlab=("Glucose Levels"), ylab=("Density"), col=c("red","black")) title(main="Density-Plot of Glucose and Groups")
legend("topright", levels(DATI$uo2BIN), fill=c("red","black"))
The graphs seem to indicate a small difference in glucose values between Unit 2 and the others. The two distributions are slightly different. However, to be sure this difference is statistically significant, we need to apply a statistical test.
First: What test should I use—parametric or non-parametric?
I run a Shapiro test to check for normality:
shapiro.test(DATI$GLICEMIA)
Since the p-value is < 0.05, I conclude that glucose is not normally distributed. Therefore, I apply a non-parametric test for independent samples:
wilcox.test(DATI$GLICEMIA~DATI$uo2BIN, paired=FALSE)
The result shows that the difference is significant.
4- Study the relationship between glucose and diabetes. Use a regression model where Y=GLICEMIA, X=DIABETE. Interpret the result.
The regression model is:
model <- glm(DATI$GLICEMIA~DATI$DIABETE, family="gaussian")
summary(model)
The result shows an association between glucose levels and diabetic status. Specifically, the analysis tells us that the intercept (status of diabetic=0) has a mean glucose value of 86.35, and there is an increase of 90.68 units in glucose when moving from diabetic status 0 to 1. In other words, becoming diabetic is associated with an increase in glucose of 90.68 units. 5- Assuming that the glucose variable is normally distributed, calculate if the average glucose values differ between diabetics and non-diabetics. Compare the results with the previous exercise and comment.
If we assume that the glucose variable is normally distributed and we want to assess if there are significant differences between diabetics and non-diabetics, the appropriate statistical test is an independent samples t-test.
t.test(DATI$GLICEMIA~DATI$DIABETE)
The result tells us that the difference in average glucose values between the two groups is significant. The average glucose value in the healthy group is 86.35, while in the diabetic group it is 177.04. The difference between the average values of the two groups is 90.68 units. This result, when compared with the previous one, provides a better understanding and interpretation of the regression result.
6- Which variables are normally distributed?
To test if the variables in my dataset are normally distributed, I can use the Shapiro-Wilk test. I should apply it to all numerical variables that I assume have a normal distribution and want to test. This test is not applicable to categorical or qualitative variables. The formula to use is shapiro.test(VARIABLE) for all variables I want to test, but it could be a long process if there are many variables.
Alternatively, I can use the lapply function.
The formula is lapply(DATAFRAME, test-to-apply).
The dataframe must be composed of variables where the function we want to run can be applied. In other words, the dataframe must contain only columns compatible with the function we want to use.
In this case, the shapiro.test function does not make sense for categorical variables, so we will create a dataframe that contains only the numerical variables to test for normality with shapiro.test.
I can do this by creating a new dataframe by selecting the numerical variables I am interested in. In this case, the numerical variables are between columns 4 and 30 inclusive, so the command will be:
lapply(DATI[,4:30], shapiro.test)
I will receive an output list of results that I will need to observe.
The result suggests that the only normally distributed variable is COL_TOT, as the p-value is < 0.05.
7- HOMA-IR indicates insulin resistance. Do you expect that insulin resistance values will differ among diabetics? How could you test this hypothesis?
In theory, I expect there to be differences in insulin resistance between diabetics and non-diabetics. There are essentially two ways to test this hypothesis:
1- I evaluate whether there are differences in the mean or median values of HOMA-IR between diabetics and non-diabetics using an independent samples t-test or a non-parametric test if the variable does not meet the assumptions for the t-test.
2- I evaluate whether there is a relationship between the continuous variable HOMA-IR and the binary variable DIABETE. In this case, I apply a regression model, and depending on whether the binary variable is considered dependent (i.e., Y) or independent (i.e., X), we speak of logistic regression (Y=DIABETE) or simple regression (DIABETE=X).
The interpretation of the result will, of course, change depending on what is being evaluated. In particular, if DIABETE=X, I am evaluating how insulin resistance changes when moving from non-diabetics to diabetics. If DIABETE=Y, I am evaluating the risk of being diabetic versus not being diabetic for each unit change in HOMA-IR.
I apply the three tests:
t.test(DATI$'HOMA-IR'~DATI$DIABETE)
Note the use of quotes in DATI$'HOMA-IR'. These quotes are essential because the word contains a hyphen, a symbol that could be interpreted by R as a minus. For this reason, when words contain special symbols, quotes are used to tell R "consider it as a single word."
In the regression model, the command is:
model<-glm(DATI$'HOMA-IR'~DATI$DIABETE, family="gaussian")
summary(model)
The result shows that the relationship between HOMA-IR and DIABETE is significant, and there is an increase of 3.82 units of HOMA-IR when moving from non-diabetic to diabetic.
Note the result of the regression and that of the t-test.
model<-glm(DATI$DIABETE~DATI$'HOMA-IR', family="binomial")
summary(model)
This result actually gives us information regarding the risk of being diabetic with respect to a unit increase in HOMA-IR.
8- A rapid test has been proposed to identify diabetic subjects. This test is cheap and fast and is called test_2, while test_1 consists of a laboratory measurement of glucose levels. Test_1 represents the gold standard. Calculate the sensitivity, specificity of test_2, and the predictive value of the test.
The exercise involves comparing a field test with a validated laboratory test. I determine from the comparison of the two tests the false positives and false negatives, and then calculate sensitivity and specificity. To do this, I need to use the epiR package.
install.packages("epiR")
library(epiR)
The function is applied to a 2x2 table where I have indicated false positives and false negatives.
This table can be obtained using the table() function.
epi.tests(table(DATI$TEST_2_DIABETE,DATI$TEST_1_DIABETE))
9- Save the dataset DATI in a tab-delimited file.
I use the write.table() function.
write.table(DATI,"filename.txt",sep="\t", col.names=TRUE,row.names=FALSE,quote=FALSE)
Let's explain some elements of the command:
write.table() is the function. In the first position, I put the name of the dataframe I want to write to a file. In quotes, I put the name I want to save the file with, and I can also add the destination path similarly to the command used to load files. Then, I use the tab symbol as the separator. Next, I indicate that the column names should be included, while I specify that the row names should not be included. In this case, I have not assigned any names to the rows, so R would simply number the rows (which would be an extra column in my file that I don't need). Finally, with the term quote=FALSE, I tell R that each individual cell value should not be enclosed in quotes.
10- Merge the two datasets: 1) dataset CENTENARI and BIOCHIMICA, and 2) dataset DATI, using pid as the row identifier.
I use the merge() function.
DATI_MERGE<-merge(DATI,CENTENARI_BIOCHIMICA,by.x="pid",by.y="pid",all=T)
I create a new object DATI_MERGE, which is the merge of the two files.
The two files have been integrated using pid as the identifier.
11- Create a graph to describe whether the median values of smoking (FUMO) are different from the median values of heart attack (INFARTO).
Obviously, this is a trick question, and the answer is that it is not possible, and the question does not make sense.
12- Would you be able to estimate a probability measure of being diabetic for each unit increase in HOMA-IR values?
Described in exercise 7.