EXERCISE ANALYSIS 1

Download the database DATABASE CENTENARI AND BIOCHIMICA

urly.it/36kc

Do the exercise copying the commands in a script

Centenarians and Biochemistry Exercise

Perform the following exercises:

Import the "Centenari and Biochimica" database

Use the read.table function after saving the data in a tab-delimited text file (.txt). Indicate that the file has a header, and that the column separator is a tabulation with sep="\t"

Note: When importing data with a graphical interface, be careful about MISSING DATA. They MUST BE INDICATED in the import options menu dropdown for NA, specifying how the missing data are indicated (e.g., NA or empty cells).

DATA <- read.table("DIRECTORY/DATI/CENTENARI_BIOCHIMICA.txt", header=TRUE, sep="\t", na.string="")

Note: In this example, a DATA object containing the data is created. If you import the CENTENARI_BIOCHIMICA file using R's graphical interface, an object named CENTENARI_BIOCHIMICA will be created instead of DATA. Adjust the following commands accordingly or create a copy DATA that contains CENTENARI_DATI.

Assign the correct variable types ("factor" and "numeric")

First, check the variable types with:

str(DATA

Most variables are numeric or integer. The ones that need to be changed to factors are Gruppo, uo, pid, FUMO, INFARTO, INSUFF_RENE. This is important because R will handle them appropriately during analysis.

DATA$Gruppo <- as.factor(DATA$Gruppo) 
DATA$uo <- as.factor(DATA$uo) 
DATA$pid <- as.factor(DATA$pid) 
DATA$FUMO <- as.factor(DATA$FUMO) 
DATA$INFARTO <- as.factor(DATA$INFARTO) 
DATA$INSUFF_RENE <- as.factor(DATA$INSUFF_RENE)

Check again to ensure everything is correct:

str(DATA)

How many samples and variables are in the dataset?

dim(DATA)

There are 461 samples and 33 variables.

Are there any missing data?

any(is.na(DATA))

Remove samples with missing data and create a new dataset called “DATI”

DATI <- na.omit(DATA) 
dim(DATI

There are 382 subjects with complete data, meaning 79 subjects were removed (461-382).

How many subjects are in each group?

table(DATI$Gruppo)
#CENT: 79
CTRL: 85
FIGLI: 218

How many subjects are in each operational unit (uo)?

table(DATI$uo)

How many centenarians were recruited from operational unit 3?

table(DATI$Gruppo, DATI$uo)

What is the mean age of all subjects? And in the control group? And in the figli group?

mean(DATI$AGE) 
# the mean age for each subject group is found by using the funciotn tapply() where i insert the variable in question, the categorical bariacle which indicates the categories and then the function i want to apply, in this case the mean 
tapply(DATI$AGE, DATI$Gruppo, mean)

Highlight the relationship between urea and creatinine with the most suitable graph

plot(DATI$UREA, DATI$CREATININA)

Show the distribution of the variable triglycerides with a graph

You can use a density plot, histogram, or boxplot.

plot(density(DATI$TRIGLICERIDI))

Create a graph comparing the median values of COL_TOT between groups (CENT, CTRL, FIGLI)

boxplot(DATI$COL_TOT ~ DATI$Gruppo)

Create a graph showing the fractions (percentages) of subjects belonging to each group

pie(table(DATI$Gruppo))
#If you want to indicate percentages:
percentuali <- round(table(DATI$Gruppo) / sum(table(DATI$Gruppo)) * 100) 
#if i want to add indications i create a vector containing the indications i want it to be 
DATI$Gruppo + percentuali, i unite them with paste()
labels <- paste(DATI$Gruppo, percentuali) 
#at this point i redo the grph pie but i add labels 
pie(table(DATI$Gruppo), labels=labels)

Indicate the minimum and maximum values of the variable glucose

summary(DATI$GLICEMIA)

Indicate for which biochemical parameters the median values are lower in the centenarian group compared to the control group

CENTENARI <- subset(DATI, DATI$Gruppo == "CENT") 
CONTROLLI <- subset(DATI, DATI$Gruppo == "CTRL") 
summary(CONTROLLI) 
summary(CENTENARI)

It would then be sufficient to compare the two data tables to understand where the median values of one group are higher than the other.

However, let's see if we can automate this process.

The result of summary is an output of this type: a table where, if the variable is numeric, the mean, median, etc., are calculated.

The median is always in the third row.

I can save the results of the summaries in R objects and then extract the third rows from these tables for both created objects. SUMMCENT[3,] indicates that from the summary result of the centenarians, I am taking only the third rows and all columns. This is achieved using square brackets. Specifically, to select the third row of the entire table (which contains the median), I use [3,]. The fact that there is nothing after the comma indicates "take all columns."

This way, I obtain two vectors that contain all the medians of the centenarians and all the medians of the controls. At this point, I just need to check where one vector is smaller than the other.

SUMMCENT <- summary(CENTENARI) 
SUMMCONTROL <- summary(CONTROLLI) 
#i ask where in the result summary of the centenari the median (position 3) is minor in the centenari then in the controls 
SUMMCENT[3, ] < SUMMCONTROL[3, ]

This will provide a comparison indicating where the median values of the centenarians are lower than those of the control group.

Training Opportunities

Training Opportunities

GLabStat

EXERCISE ANALYSIS 1

Download the database DATABASE CENTENARI AND BIOCHIMICA

urly.it/36kc

Centenarians and Biochemistry Exercise