Importing Data from the Command Line with the Command and Package for Excel Files instead of the Classic read.table:
There are different procedures to load data files into R. The approach depends significantly on the type of files we are dealing with. The most common and simple file types are ".txt" and ".csv" files.
In ".txt" and ".csv" files, columns are separated by delimiters such as tabulations or commas. We can always convert our original data file to one of these formats by saving the file and selecting the "Text (Tab delimited) (.txt)" option from the save as dropdown menu. Alternatively, we can save as "CSV (Comma delimited) (.csv)." Opening the file with Notepad or any text editor will show the data divided by tabs or commas.
If we have chosen to convert the original file to a .txt or .csv file, we can load it using the read.table() function.
The syntax for importing the data will be:
DATA_THYROID <- read.table("/path/to/file/filename.txt", header=TRUE, sep="\t")
DATA_THYROID <- read.table("/path/to/file/filename.csv", header=TRUE, sep=",")
#To obtain the path, right-click on the data file, select properties, read the path, and copy it.
#Ensure that the slashes are "/" instead of "".
#header=TRUE indicates that the file has a header (with the variable names).
#sep="\t" indicates that the delimiter for the columns is tabulation.
#For CSV files, the delimiter is sep=",".
#If the PC has Italian language settings, it might happen that the CSV file uses semicolons instead of commas.
#In this case, simply modify to sep=";".
Alternatively, it is possible to install a package that allows direct loading of xls files, called xlsx. After installing it, load it into R to activate its functions, including the read.xlsx() function:
install.packages("xlsx")
library(xlsx)
DATASET_THYROID <- read.xlsx("C:/Users/GENTILINI/Desktop/DATASET_THYROID.xls", sheetName = "DATASET_THYROID")
Viewing the Header and First Six Rows of the Dataset
Once the file is imported and loaded into an object "DATASET_THYROID," it is important to view the first few rows to ensure that the data has been imported correctly and that the headers with the variable names are correct.
The command is very simple:
head(DATASET_THYROID)
Viewing Column Names and Making Each Variable an Object
We can retrieve the variable names and check the position and number of columns in the dataset. With the attach() function, we make each variable in our dataset an R object, specifically a vector with the name indicated in the header. In other words, each variable becomes a vector and can be used independently.
colnames(DATASET_THYROID)
attach(DATASET_THYROID)
Number of Subjects and Variables in the Dataset
Knowing how many subjects we have in our database is very important. This information can be obtained by calling the dimensions of the dataframe. The result of dim(dataframe) indicates the number of rows (subjects) and columns (variables) that make up the dataset.
dim(DATASET_THYROID)
Creating a New Dataset Without Missing Data
Missing data can be a problem, and one approach is to remove all samples with missing data. The result will be a dataframe without missing data. This new dataset, different from the initial one because it potentially has removed samples, must be saved in a new object called DATI.
DATASET_THYROID = initial dataset DATI = dataset without missing data
DATI<-na.omit(DATASET_THYROID)
Number of Subjects Remaining in the Dataset
dim(DATI)
Renaming Columns FT3.x, FT4.x, TSH.x to FT3, FT4, TSH
Sometimes it is necessary to rename columns of a dataset, either because they are abbreviated in a way we do not want or for other reasons. The command colnames(DATI) assigns new names to the columns of the "DATI" object. To assign new names, we create a vector with the new names using c("new_name", "new_name2").
colnames(DATI) <- c("pid", "Group", "uo", "sex", "age", "FT3", "FT4", "TSH", "Overt.Hypot", "SUBCL.Hypo", "Overt.HyperT", "Subclinical.Hyper", "Central.HypoT", "NTI", "Hyperthyroidism", "Hypothyroidism", "Thyroid_Therapy")
Type of Object and Attributes of Variables in DATI
class(DATI)
str(DATI)
Modifying Attributes of Variables in DATI, Distinguishing Factorial and Numerical Variables
DATI$pid <- as.factor(DATI$pid)
DATI$sex <- as.factor(DATI$sex)
DATI$uo <- as.factor(DATI$uo)
DATI$Overt.Hypot <- as.factor(DATI$Overt.Hypot)
DATI$SUBCL.Hypo <- as.factor(DATI$SUBCL.Hypo)
DATI$Overt.HyperT <- as.factor(DATI$Overt.HyperT)
DATI$Subclinical.Hyper <- as.factor(DATI$Subclinical.Hyper)
DATI$Central.HypoT <- as.factor(DATI$Central.HypoT)
DATI$NTI <- as.factor(DATI$NTI)
DATI$Hypothyroidism <- as.factor(DATI$NTI)
DATI$Thyroid_Therapy <- as.factor(DATI$Thyroid_Therapy)
DATI$Hyperthyroidism <- as.factor(DATI$Hyperthyroidism)
str(DATI)
Recoding the Variable Group from (CENT, FIGLI, CTRL) to (CENTENARI, FIGLI_CENTENARI, CONTROLLI)
Option 1
levels(DATI$Group) <- c("CENTENARI", "FIGLI_CENTENARI", "CONTROLLI")
Option 2 Using dplyr Package
install.packages("dplyr")
library(dplyr)
recode(DATASET_THYROID$Group, CENT = "CENTENARIO", CTRL = "CONTROLLO", FIGLI = "FIGLI")
Recoding the Variable uo from (1, 2, 3, 4, 5) to (Unita1, Unita2, Unita3, Unita4, Unita5)
Option 1
levels(DATI$uo) <- c("Unita1", "Unita2", "Unita3", "Unita4", "Unita5")
Option 2 Using dplyr Package
recode(DATASET_THYROID$uo, 1 = "Unita1", 2 = "Unita2", 3 = "Unita3", 4 = "Unita4", 5 = "Unita5")
Number of Overt.HyperT Cases
table(DATI$Overt.HyperT)
Displaying the Identifiers of Overt.HyperT Subjects
OVERT.HYPER <- subset(DATI, DATI$Overt.HyperT == 1)
OVERT.HYPER$pid
# Alternatively:
subset(DATI, DATI$Overt.HyperT == 1)[1,]
Creating a Dataset Named "FT3over" with Subjects Having FT3 > 3.46, and Number of Subjects
FT3over <- subset(DATI, DATI$FT3 > 3.46)
dim(FT3over)
Displaying Maximum and Minimum of FT3
range(DATI$FT3)
Number of Males and Females (1=Male, 2=Female)
table(DATI$sex)
Number of Males and Females Among Centenarians (1=Male, 2=Female)
table(DATI$sex, DATI$Group)
Number of Males and Females Among Hypothyroid Centenarians (1=Male, 2=Female)
table(DATI$sex, DATI$Group, DATI$Hypothyroidism)
Calculating Mean and Standard Deviation of Hormones FT3, FT4, and TSH
attach(DATI)
mean(FT3)
mean(FT4)
mean(TSH)
sd(FT3)
sd(FT4)
sd(TSH)
Calculating Median of FT3, FT4, TSH
median(DATI$FT3)
median(DATI$FT4)
median(DATI$TSH)
Calculating Maximum, Minimum, Mean, Median, and Quartiles for All Numerical Variables and Number of Subjects for Each Categorical Variable in DATI
summary(DATI)
Calculating the Risk of Being Hypothyroid if Male (Odds Ratio)
X <- table(DATI$sex, DATI$Hypothyroidism)
(X[2] X[3]) / (X[1] X[4])
Calculating the Risk of Being Hypothyroid if Centenarian (Odds Ratio)
Group2 <- DATI$Group
levels(Group2) <- c("CENTENARI", "NON_CENTENARI", "NON_CENTENARI")
X <- table(Group2, DATI$Hypothyroidism)
(X[2] X[3]) / (X[1] X[4])
Plot Describing the Number of Subjects Enrolled for Each Operational Unit
pie(table(DATI$uo))
# Adding more details to the chart
lbls <- levels(DATI$uo)
pct <- round(table(DATI$uo) / sum(table(DATI$uo)) * 100)
lbls <- paste(lbls, pct) # Add percents to labels
lbls <- paste(lbls, "%", sep = "") # Add % to labels
pie(table(DATI$Group), labels = lbls, col = rainbow(length(lbls)), main = "Pie Chart of Groups")
Plot Describing the Relationship Between FT3 and FT4
plot(DATI$FT3, DATI$FT4)
Plot Describing the Relationship Between Age and FT3 Using ggplot2
# Setting parameters
DF <- DATI
X <- DATI$age
Y <- DATI$FT3
group <- as.factor(1)
library(ggplot2)
ggplot(DF, aes(x = X, y = Y, color = group, shape = group)) +
geom_point() +
geom_smooth(method = lm, se = TRUE, fullrange = TRUE)
Plot Describing the Distribution of FT3, FT4, TSH, and Age Values
plot(density(DATI$FT3))
plot(density(DATI$FT4))
plot(density(DATI$TSH))
plot(density(DATI$age))
Boxplot of FT3 Levels by Group
boxplot(DATI$FT3 ~ DATI$Group, ylim = c(0, 6))
Plot of Mean FT4 Values by Group
A <- subset(DATI, DATI$Group == "CENTENARI")
B <- subset(DATI, DATI$Group == "FIGLI_CENTENARI")
C <- subset(DATI, DATI$Group == "CONTROLLI")
x <- mean(A$FT4)
y <- mean(B$FT4)
z <- mean(C$FT4)
J <- c(x, y, z)
barplot(J, ylim = c(8, 12))
Correlogram of Thyroid Hormone Values and Age
# Creating a new dataframe with only the variables to be included in the correlogram
# Loading corrgram library
CORDF <- data.frame(DATI$FT3, DATI$FT4, DATI$TSH, DATI$age)
library(corrgram)
corrgram(CORDF)
# Using the corrplot package as an alternative. This package plots the correlation matrix obtained from cor(DATAFRAME)
# Saving the correlation matrix in an object M and then launching corrplot which graphs the correlation indices
M <- cor(CORDF)
library(corrplot)
corrplot(M)
Comparing FT3 Distributions Across Groups
library(sm)
sm.density.compare(DATI$FT3, DATI$Group, xlab = "FT3 Levels")
title(main = "Distribution of FT3 Levels")
# Adjusting the X-axis to exclude the sample with very high levels for a better plot sm.density.compare(DATI$FT3, DATI$Group, xlim = c(0, 20), xlab = "FT3 Levels")
title(main = "Distribution of FT3 Levels")
# Adding a legend and positioning it on the plot
LEG <- DATI$Group
colfill <- c(2:(2 + length(levels(LEG))))
legend(locator(1), levels(LEG), fill = colfill)
Plot Describing the Ratio of Hypothyroid Males to Females
barplot(table(DATI$sex, DATI$Hypothyroidism))
Plot Describing the Ratio of Overt Hypothyroid Cases Across Groups
barplot(table(DATI$Overt.Hypot, DATI$Group))
Is There a Difference in FT4 Levels Measured Across Units?
BOX <- boxplot(DATI$FT4 ~ DATI$uo)
BOX$stats
# How would you proceed with this result?
Comparing FT3, FT4, and TSH Levels Between Controls and Children of Centenarians
B <- subset(DATI, DATI$Group == "FIGLI_CENTENARI")
C <- subset(DATI, DATI$Group == "CONTROLLI")
DATINC <- rbind(B, C)
boxplot(DATINC$FT3 ~ DATINC$Group)
boxplot(DATINC$FT4 ~ DATINC$Group)
boxplot(DATINC$TSH ~ DATINC$Group)
Merging DATASET_THYROID and DATI by Sample (Purely for Exercise)
DATIMERGE <- merge(DATASET_THYROID, DATI, by.x = "pid", by.y = "pid")