Download the database CENTENARI e BIOCHIMICA urly.it/36kc
Perform the following exercises:
1- Import the "CENTENARI_BIOCHIMICA" database
Use the read.table function after saving the data in tab-delimited text .txt.
Specify that the file has a header, and that the symbol separating the columns is tabulation with sep="\t".
Note: If you import the data using a graphical interface, be careful about the MISSING DATA. THEY MUST BE INDICATED in the import options screen under the dropdown menu for NA, choose how missing data is represented; in this example, they are indicated as NA, but in other cases, they could be empty cells.
DATA<- read.table("DIRECTORY/DATA/CENTENARI_BIOCHIMICA.txt",header=TRUE,sep="\t",na.string="")
Note: In this example, an object DATA is created containing the data. If you import the CENTENARI_BIOCHIMICA file using R's graphical interface, an object CENTENARI_BIOCHIMICA will be created instead of DATA. You will need to modify the following commands accordingly, or create a copy DATA that contains CENTENARI_DATI.
2- Remove samples with missing data, save a new dataset, and reclassify the factorial variables mistakenly interpreted by R as numerical
First, check the type of variables with:
str(DATA)
Most of the variables are numerical or integer; the only incorrectly classified ones are Gruppo, uo, pid, FUMO, INFARTO, INSUFF_RENE.
Convert them to categorical. This is IMPORTANT because R will appropriately handle them during the analysis.
DATA$Gruppo<-as.factor(DATA$Gruppo)
DATA$uo<-as.factor(DATA$uo)
DATA$pid<-as.factor(DATA$pid)
DATA$FUMO<-as.factor(DATA$FUMO)
DATA$INFARTO<-as.factor(DATA$INFARTO)
DATA$INSUFF_RENE<-as.factor(DATA$INSUFF_RENE)
#Check if everything is correct by typing again:
str(DATA)
#remove missing data
DATI<-na.omit(DATA)
dim(DATI)
3- Visualize with a graph the relationship between UREA and CREATININE. Do you think the relationship is significant?
plot(DATI$UREA,DATI$CREATININA)
To check if the two variables have a significant relationship, use the correlation test:
cor.test(DATI$UREA,DATI$CREATININA)
The correlation is significant, p-value< 0.05, and the correlation coefficient is positive and equal to 0.6.
4- Try to highlight any correlations that exist between all the numerical variables in the dataset with an exploratory graph
First, identify all numerical and continuous variables and create a new dataframe containing them. Pay attention to binary or factorial variables.
Using the command:
head(DATI)
You can highlight the first rows of the dataset and check which continuous numerical variables can be selected for testing correlations. Variables from positions 1 to 3 are factorial, so they are not suitable for testing correlation, and neither are the variables FUMO, INFARTO, and INSUFF_RENE in positions 31-32-33.
In other words, the continuous numerical variables we are interested in range from positions 4 to 30 inclusive. Select and save them into a new dataframe called DATI2. Use square brackets to select:
DATI2<-DATI[,4:30]
Download the corrgram package to test the correlation between all the variables in the new dataset DATI2. Then install it and remember to enable it using library(). If the package is already installed, simply enable it.
install.packages("corrgram")
library(corrgram)
corrgram(DATI2)
You can also use the corrplot package, which produces aesthetically different graphs. Unlike the previous one, this package requires the cross-correlation result of all variables as input. The necessary object for the corrplot function is the correlation matrix, which is obtained with the command:
cor(DATI2)
The final command will be:
corrplot(cor(DATI2))
To improve the aesthetics or shape of the graph, you can study the options that suit you best by following the instructions at the link:
https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
5- Is there a relationship between GLICEMIA and BMI? Visualize it, calculate if the relationship is significant, and calculate how much glicemia increases for every unit increase in BMI
The exercise asks not only whether there is a significant relationship between the two variables but also to determine how one variable changes as the other changes. The most appropriate statistical test is regression. Both variables are numerical and continuous, so linear regression is used. The question asks how much glicemia (i.e., Y) changes as BMI (i.e., X) changes.
fit<-glm(DATI$GLICEMIA~DATI$BMI, family="gaussian")
summary(fit)
The relationship is significant, p-value < 0.05, and in particular, glicemia increases by 1.8 units for each unit increase in BMI. The intercept tells us how much glicemia is when BMI equals zero (obviously, this value is not real because no one has a BMI of zero).
6- Is there a relationship between infarction and smoking (are there more infarct patients among smokers or non-smokers)? (Pay attention to variables that are frequencies)
Apply the chi-squared test because you want to test whether there are differences in the frequencies with which an event occurs between two groups. You need to construct a 2x2 contingency table using the table() function. Then, use the chisq.test() function to test whether the difference is significant.
chisq.test(table(DATI$INFARTO,DATI$FUMO))
7- Is there a relationship between infarction and kidney failure? (Pay attention: If there is, I expect the frequencies of infarct patients to differ between those with and without kidney failure)
Apply the chi-squared test because you want to test whether there are differences in the frequencies with which an event occurs between two groups. You need to construct a 2x2 contingency table using the table() function. Then, use the chisq.test() function to test whether the difference is significant.
fisher.test(table(DATI$INFARTO,DATI$INSUFF_RENE))
8- Is there a relationship between AGE and GLICEMIA? If so, demonstrate it and try to predict how much glicemia increases with each passing year
The exercise asks not only whether there is a significant relationship between the two variables but also to determine how one variable changes as the other changes. The most appropriate statistical test is regression. Both variables are numerical and continuous, so linear regression is used. The question asks how much glicemia (i.e., Y) changes as age (i.e., X) changes.
fit<-glm(DATI$GLICEMIA~DATI$AGE, family="gaussian")
summary(fit)
9- Is there a relationship between INFARCTION and GLICEMIA? If so, demonstrate it with a regression model and interpret the result by describing what the values under "Estimate" in the output indicate
The most appropriate method of analysis to answer this question is regression.
Put this way, the question is generic; it talks about the relationship between two variables but does not specify whether you want to know how glicemia changes based on having had an infarction or how the risk of having an infarction changes based on glicemia. In the first case, you would opt for linear regression with Y as glicemia (continuous) and X as infarction (binary).
fit<-glm(DATI$GLICEMIA~DATI$INFARTO, family="gaussian")
summary(fit)
In this case, p-value = 0.00241, so the relationship is not only significant, but the beta value, i.e., the estimate, i.e., the effect that infarction has on glicemia, tells us that glicemia increases by 12.794 points when moving from the non-infarct group to the infarct group.
Alternatively, if I use infarction as Y, then the model becomes:
fit<-glm(DATI$INFARTO~DATI$GLICEMIA, family="binomial")
summary(fit)
The estimate in this case tells us something about the increased risk of having an infarction for each unit increase in glicemia.
10- Is there a relationship between GLICEMIA and INSULIN? Analyze it with a linear regression model and comment on the results
fit<-glm(DATI$Insulina~DATI$GLICEMIA, family="gaussian")
summary(fit)
There is a relationship between insulin and glicemia, and with this model, we test how insulin varies for each variation in glicemia.
The relationship is significant, p-value= 6.72e-05, and in particular, insulin increases by 0.05 units for every unit increase in glicemia.