Download the database DATABASE CENTENARI AND BIOCHIMICA
urly.it/36kc
Do the exercise copying the commands in a script
Centenarians and Biochemistry Exercise
Perform the following exercises:
Import the "Centenari and Biochimica" database
Use the read.table function after saving the data in a tab-delimited text file (.txt). Indicate that the file has a header, and that the column separator is a tabulation with sep="\t"
Note: When importing data with a graphical interface, be careful about MISSING DATA. They MUST BE INDICATED in the import options menu dropdown for NA, specifying how the missing data are indicated (e.g., NA or empty cells).
DATA <- read.table("DIRECTORY/DATI/CENTENARI_BIOCHIMICA.txt", header=TRUE, sep="\t", na.string="")
Note: In this example, a DATA object containing the data is created. If you import the CENTENARI_BIOCHIMICA file using R's graphical interface, an object named CENTENARI_BIOCHIMICA will be created instead of DATA. Adjust the following commands accordingly or create a copy DATA that contains CENTENARI_DATI.
Assign the correct variable types ("factor" and "numeric")
First, check the variable types with:
str(DATA
Most variables are numeric or integer. The ones that need to be changed to factors are Gruppo, uo, pid, FUMO, INFARTO, INSUFF_RENE. This is important because R will handle them appropriately during analysis.
DATA$Gruppo <- as.factor(DATA$Gruppo)
DATA$uo <- as.factor(DATA$uo)
DATA$pid <- as.factor(DATA$pid)
DATA$FUMO <- as.factor(DATA$FUMO)
DATA$INFARTO <- as.factor(DATA$INFARTO)
DATA$INSUFF_RENE <- as.factor(DATA$INSUFF_RENE)
Check again to ensure everything is correct:
str(DATA)
How many samples and variables are in the dataset?
dim(DATA)
There are 461 samples and 33 variables.
Are there any missing data?
any(is.na(DATA))
Remove samples with missing data and create a new dataset called “DATI”
DATI <- na.omit(DATA)
dim(DATI
There are 382 subjects with complete data, meaning 79 subjects were removed (461-382).
How many subjects are in each group?
table(DATI$Gruppo)
#CENT: 79
CTRL: 85
FIGLI: 218
How many subjects are in each operational unit (uo)?
table(DATI$uo)
How many centenarians were recruited from operational unit 3?
table(DATI$Gruppo, DATI$uo)
What is the mean age of all subjects? And in the control group? And in the figli group?
mean(DATI$AGE)
# the mean age for each subject group is found by using the funciotn tapply() where i insert the variable in question, the categorical bariacle which indicates the categories and then the function i want to apply, in this case the mean
tapply(DATI$AGE, DATI$Gruppo, mean)
Highlight the relationship between urea and creatinine with the most suitable graph
plot(DATI$UREA, DATI$CREATININA)
Show the distribution of the variable triglycerides with a graph
You can use a density plot, histogram, or boxplot.
plot(density(DATI$TRIGLICERIDI))
Create a graph comparing the median values of COL_TOT between groups (CENT, CTRL, FIGLI)
boxplot(DATI$COL_TOT ~ DATI$Gruppo)
Create a graph showing the fractions (percentages) of subjects belonging to each group
pie(table(DATI$Gruppo))
#If you want to indicate percentages:
percentuali <- round(table(DATI$Gruppo) / sum(table(DATI$Gruppo)) * 100)
#if i want to add indications i create a vector containing the indications i want it to be
DATI$Gruppo + percentuali, i unite them with paste()
labels <- paste(DATI$Gruppo, percentuali)
#at this point i redo the grph pie but i add labels
pie(table(DATI$Gruppo), labels=labels)
Indicate the minimum and maximum values of the variable glucose
summary(DATI$GLICEMIA)
Indicate for which biochemical parameters the median values are lower in the centenarian group compared to the control group
CENTENARI <- subset(DATI, DATI$Gruppo == "CENT")
CONTROLLI <- subset(DATI, DATI$Gruppo == "CTRL")
summary(CONTROLLI)
summary(CENTENARI)
It would then be sufficient to compare the two data tables to understand where the median values of one group are higher than the other.
However, let's see if we can automate this process.
The result of summary is an output of this type: a table where, if the variable is numeric, the mean, median, etc., are calculated.
The median is always in the third row.
I can save the results of the summaries in R objects and then extract the third rows from these tables for both created objects. SUMMCENT[3,] indicates that from the summary result of the centenarians, I am taking only the third rows and all columns. This is achieved using square brackets. Specifically, to select the third row of the entire table (which contains the median), I use [3,]. The fact that there is nothing after the comma indicates "take all columns."
This way, I obtain two vectors that contain all the medians of the centenarians and all the medians of the controls. At this point, I just need to check where one vector is smaller than the other.
SUMMCENT <- summary(CENTENARI)
SUMMCONTROL <- summary(CONTROLLI)
#i ask where in the result summary of the centenari the median (position 3) is minor in the centenari then in the controls
SUMMCENT[3, ] < SUMMCONTROL[3, ]
This will provide a comparison indicating where the median values of the centenarians are lower than those of the control group.