EXERCISE DATASET_THYROID

Importing Data from the Command Line with the Command and Package for Excel Files instead of the Classic read.table:

There are different procedures to load data files into R. The approach depends significantly on the type of files we are dealing with. The most common and simple file types are ".txt" and ".csv" files.

In ".txt" and ".csv" files, columns are separated by delimiters such as tabulations or commas. We can always convert our original data file to one of these formats by saving the file and selecting the "Text (Tab delimited) (.txt)" option from the save as dropdown menu. Alternatively, we can save as "CSV (Comma delimited) (.csv)." Opening the file with Notepad or any text editor will show the data divided by tabs or commas.

If we have chosen to convert the original file to a .txt or .csv file, we can load it using the read.table() function.

The syntax for importing the data will be:

DATA_THYROID <- read.table("/path/to/file/filename.txt", header=TRUE, sep="\t") 
DATA_THYROID <- read.table("/path/to/file/filename.csv", header=TRUE, sep=",")
#To obtain the path, right-click on the data file, select properties, read the path, and copy it.
#Ensure that the slashes are "/" instead of "".
#header=TRUE indicates that the file has a header (with the variable names).
#sep="\t" indicates that the delimiter for the columns is tabulation.
#For CSV files, the delimiter is sep=",".
#If the PC has Italian language settings, it might happen that the CSV file uses semicolons instead of commas. 
#In this case, simply modify to sep=";".

Alternatively, it is possible to install a package that allows direct loading of xls files, called xlsx. After installing it, load it into R to activate its functions, including the read.xlsx() function:

install.packages("xlsx") 
library(xlsx) 
DATASET_THYROID <- read.xlsx("C:/Users/GENTILINI/Desktop/DATASET_THYROID.xls", sheetName = "DATASET_THYROID")

Viewing the Header and First Six Rows of the Dataset

Once the file is imported and loaded into an object "DATASET_THYROID," it is important to view the first few rows to ensure that the data has been imported correctly and that the headers with the variable names are correct.

The command is very simple:

head(DATASET_THYROID)

Viewing Column Names and Making Each Variable an Object

We can retrieve the variable names and check the position and number of columns in the dataset. With the attach() function, we make each variable in our dataset an R object, specifically a vector with the name indicated in the header. In other words, each variable becomes a vector and can be used independently.

colnames(DATASET_THYROID) 
attach(DATASET_THYROID)

Number of Subjects and Variables in the Dataset

Knowing how many subjects we have in our database is very important. This information can be obtained by calling the dimensions of the dataframe. The result of dim(dataframe) indicates the number of rows (subjects) and columns (variables) that make up the dataset.

dim(DATASET_THYROID)

Creating a New Dataset Without Missing Data

Missing data can be a problem, and one approach is to remove all samples with missing data. The result will be a dataframe without missing data. This new dataset, different from the initial one because it potentially has removed samples, must be saved in a new object called DATI.

DATASET_THYROID = initial dataset DATI = dataset without missing data

DATI<-na.omit(DATASET_THYROID)

Number of Subjects Remaining in the Dataset

dim(DATI)

Renaming Columns FT3.x, FT4.x, TSH.x to FT3, FT4, TSH

Sometimes it is necessary to rename columns of a dataset, either because they are abbreviated in a way we do not want or for other reasons. The command colnames(DATI) assigns new names to the columns of the "DATI" object. To assign new names, we create a vector with the new names using c("new_name", "new_name2").

colnames(DATI) <- c("pid", "Group", "uo", "sex", "age", "FT3", "FT4", "TSH", "Overt.Hypot", "SUBCL.Hypo", "Overt.HyperT", "Subclinical.Hyper", "Central.HypoT", "NTI", "Hyperthyroidism", "Hypothyroidism", "Thyroid_Therapy")

Type of Object and Attributes of Variables in DATI

class(DATI) 
str(DATI)

Modifying Attributes of Variables in DATI, Distinguishing Factorial and Numerical Variables

DATI$pid <- as.factor(DATI$pid)
DATI$sex <- as.factor(DATI$sex)
DATI$uo <- as.factor(DATI$uo)
DATI$Overt.Hypot <- as.factor(DATI$Overt.Hypot)
DATI$SUBCL.Hypo <- as.factor(DATI$SUBCL.Hypo)
DATI$Overt.HyperT <- as.factor(DATI$Overt.HyperT)
DATI$Subclinical.Hyper <- as.factor(DATI$Subclinical.Hyper)
DATI$Central.HypoT <- as.factor(DATI$Central.HypoT)
DATI$NTI <- as.factor(DATI$NTI)
DATI$Hypothyroidism <- as.factor(DATI$NTI)
DATI$Thyroid_Therapy <- as.factor(DATI$Thyroid_Therapy)
DATI$Hyperthyroidism <- as.factor(DATI$Hyperthyroidism)
str(DATI)

Recoding the Variable Group from (CENT, FIGLI, CTRL) to (CENTENARI, FIGLI_CENTENARI, CONTROLLI)

Option 1

levels(DATI$Group) <- c("CENTENARI", "FIGLI_CENTENARI", "CONTROLLI")

Option 2 Using dplyr Package

install.packages("dplyr") 
library(dplyr) 
recode(DATASET_THYROID$Group, CENT = "CENTENARIO", CTRL = "CONTROLLO", FIGLI = "FIGLI")

Recoding the Variable uo from (1, 2, 3, 4, 5) to (Unita1, Unita2, Unita3, Unita4, Unita5)

Option 1

levels(DATI$uo) <- c("Unita1", "Unita2", "Unita3", "Unita4", "Unita5")

Option 2 Using dplyr Package

recode(DATASET_THYROID$uo, 1 = "Unita1", 2 = "Unita2", 3 = "Unita3", 4 = "Unita4", 5 = "Unita5")

Number of Overt.HyperT Cases

table(DATI$Overt.HyperT)

Displaying the Identifiers of Overt.HyperT Subjects

OVERT.HYPER <- subset(DATI, DATI$Overt.HyperT == 1) 
OVERT.HYPER$pid 
# Alternatively: 
subset(DATI, DATI$Overt.HyperT == 1)[1,]

Creating a Dataset Named "FT3over" with Subjects Having FT3 > 3.46, and Number of Subjects

FT3over <- subset(DATI, DATI$FT3 > 3.46) 
dim(FT3over)

Displaying Maximum and Minimum of FT3

range(DATI$FT3)

Number of Males and Females (1=Male, 2=Female)

table(DATI$sex)

Number of Males and Females Among Centenarians (1=Male, 2=Female)

table(DATI$sex, DATI$Group)

Number of Males and Females Among Hypothyroid Centenarians (1=Male, 2=Female)

table(DATI$sex, DATI$Group, DATI$Hypothyroidism)

Calculating Mean and Standard Deviation of Hormones FT3, FT4, and TSH

attach(DATI) 
mean(FT3) 
mean(FT4) 
mean(TSH) 
sd(FT3) 
sd(FT4) 
sd(TSH)

Calculating Median of FT3, FT4, TSH

median(DATI$FT3) 
median(DATI$FT4) 
median(DATI$TSH)

Calculating Maximum, Minimum, Mean, Median, and Quartiles for All Numerical Variables and Number of Subjects for Each Categorical Variable in DATI

summary(DATI)

Calculating the Risk of Being Hypothyroid if Male (Odds Ratio)

X <- table(DATI$sex, DATI$Hypothyroidism) 
(X[2]  X[3]) / (X[1]  X[4])

Calculating the Risk of Being Hypothyroid if Centenarian (Odds Ratio)

Group2 <- DATI$Group 
levels(Group2) <- c("CENTENARI", "NON_CENTENARI", "NON_CENTENARI") 
X <- table(Group2, DATI$Hypothyroidism) 
(X[2]  X[3]) / (X[1]  X[4])

Plot Describing the Number of Subjects Enrolled for Each Operational Unit

pie(table(DATI$uo)) 
# Adding more details to the chart 
lbls <- levels(DATI$uo) 
pct <- round(table(DATI$uo) / sum(table(DATI$uo)) * 100) 
lbls <- paste(lbls, pct) # Add percents to labels 
lbls <- paste(lbls, "%", sep = "") # Add % to labels 
pie(table(DATI$Group), labels = lbls, col = rainbow(length(lbls)), main = "Pie Chart of Groups")

Plot Describing the Relationship Between FT3 and FT4

plot(DATI$FT3, DATI$FT4)

Plot Describing the Relationship Between Age and FT3 Using ggplot2

# Setting parameters 
DF <- DATI 
X <- DATI$age 
Y <- DATI$FT3 
group <- as.factor(1) 
library(ggplot2) 
ggplot(DF, aes(x = X, y = Y, color = group, shape = group)) + 
   geom_point() +    
   geom_smooth(method = lm, se = TRUE, fullrange = TRUE)

Plot Describing the Distribution of FT3, FT4, TSH, and Age Values

plot(density(DATI$FT3)) 
plot(density(DATI$FT4)) 
plot(density(DATI$TSH)) 
plot(density(DATI$age))

Boxplot of FT3 Levels by Group

boxplot(DATI$FT3 ~ DATI$Group, ylim = c(0, 6))

Plot of Mean FT4 Values by Group

A <- subset(DATI, DATI$Group == "CENTENARI") 
B <- subset(DATI, DATI$Group == "FIGLI_CENTENARI") 
C <- subset(DATI, DATI$Group == "CONTROLLI") 

x <- mean(A$FT4) 
y <- mean(B$FT4) 
z <- mean(C$FT4) 
J <- c(x, y, z) 
barplot(J, ylim = c(8, 12))

Correlogram of Thyroid Hormone Values and Age

# Creating a new dataframe with only the variables to be included in the correlogram 
# Loading corrgram library 
CORDF <- data.frame(DATI$FT3, DATI$FT4, DATI$TSH, DATI$age) 
library(corrgram) 
corrgram(CORDF) 

# Using the corrplot package as an alternative. This package plots the correlation matrix obtained from cor(DATAFRAME) 
# Saving the correlation matrix in an object M and then launching corrplot which graphs the correlation indices 
M <- cor(CORDF) 
library(corrplot) 
corrplot(M)

Comparing FT3 Distributions Across Groups

library(sm) 

sm.density.compare(DATI$FT3, DATI$Group, xlab = "FT3 Levels") 
title(main = "Distribution of FT3 Levels")

 # Adjusting the X-axis to exclude the sample with very high levels for a better plot sm.density.compare(DATI$FT3, DATI$Group, xlim = c(0, 20), xlab = "FT3 Levels") 
title(main = "Distribution of FT3 Levels") 

# Adding a legend and positioning it on the plot 
LEG <- DATI$Group 
colfill <- c(2:(2 + length(levels(LEG)))) 
legend(locator(1), levels(LEG), fill = colfill)

Plot Describing the Ratio of Hypothyroid Males to Females

barplot(table(DATI$sex, DATI$Hypothyroidism))

Plot Describing the Ratio of Overt Hypothyroid Cases Across Groups

barplot(table(DATI$Overt.Hypot, DATI$Group))

Is There a Difference in FT4 Levels Measured Across Units?

BOX <- boxplot(DATI$FT4 ~ DATI$uo) 
BOX$stats 
# How would you proceed with this result?

Comparing FT3, FT4, and TSH Levels Between Controls and Children of Centenarians

B <- subset(DATI, DATI$Group == "FIGLI_CENTENARI") 
C <- subset(DATI, DATI$Group == "CONTROLLI") 

DATINC <- rbind(B, C) 
boxplot(DATINC$FT3 ~ DATINC$Group) 
boxplot(DATINC$FT4 ~ DATINC$Group) 
boxplot(DATINC$TSH ~ DATINC$Group)

Merging DATASET_THYROID and DATI by Sample (Purely for Exercise)

DATIMERGE <- merge(DATASET_THYROID, DATI, by.x = "pid", by.y = "pid")

Training Opportunities