Import the "DATASET_BABY" database Use the read.table function after saving the data as a tab-delimited text file (.txt). Specify that the file has a header, and that the symbol separating columns is a tab using sep="\t".
Note: If you import the data through the graphical interface, be mindful of the MISSING DATA. It should be indicated in the import options menu under "NA," specifying how missing data is represented. In this example, they are indicated as "NA," but in other cases, they might be blank cells.
DATA <- read.table("DIRECTORY/DATA/DATASET_BABY", header=TRUE, sep="\t", na.string="")
Note: In this example, an object DATA is created containing the data. If you import the DATASET_BABY file using the R graphical interface, an object named DATASET_BABY will be created instead of DATA. You’ll need to modify the subsequent commands or create a copy called DATA that contains DATASET_BABY.
Assign the correct type to the variables: "factor" and "numeric." First, check the type of the variables with
str(DATA)
DATA$Case_Ctrl <- as.factor(DATA$Case_Ctrl)
DATA$Birth_Eut_Ces_Vent <- as.factor(DATA$Birth_Eut_Ces_Vent)
DATA$Sex <- as.factor(DATA$Sex)
DATA$Disease_during_Pregnancy <- as.factor(DATA$Disease_during_Pregnancy)
DATA$Diabetes <- as.factor(DATA$Diabetes)
DATA$Hypothyroidism <- as.factor(DATA$Hypothyroidism)
DATA$Smoke <- as.factor(DATA$Smoke)
DATA$Folic_Acid <- as.factor(DATA$Folic_Acid)
DATA$Folate_Assumption_Pre_Post_Pregnancy <- as.factor(DATA$Folate_Assumption_Pre_Post_Pregnancy)
DATA$Case_Ctrl<-as.factor(DATA$Case_Ctrl)
DATA$Case_Ctrl<-as.factor(DATA$Case_Ctrl)
How many samples and how many variables are in the dataset?
dim(DATA)
Provide a summary of the data.
summary(DATA)
ANALYZE MISSING DATA
library(mice)
md.pattern(DATA)
There are only 45 subjects with complete data; the others have varying degrees of missing data.
To quickly see which variables are most affected by missing data, use the VIM library.
library(VIM)
aggr_plot <- aggr(DATA, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(DATA), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
The NK variable has 34 missing values, which is a lot, so I decide to remove it.
I could have reached the same conclusion by using the summary function and reading the output where missing data is indicated.
How do I remove the NK column?
There are various solutions to remove the column.
One solution is to subtract it from the initial dataset.
DATAnew <- DATA[,-which(names(DATA) == "NK")]
However, I proceed with an alternative method that allows me not only to remove the NK column, which has too much missing data, but also to group all categorical and continuous variables. This is not essential but helps save time in later analysis stages, as continuous and categorical variables can be analyzed in blocks. Let’s proceed...
With the attach command, I turn each variable into a vector. This is useful for using them more conveniently when building a new dataframe.
attach(DATA)
colnames(DATA)
I construct a new dataframe using the data.frame function, inserting categorical variables first, then continuous ones, taking care to exclude the NK variable that I don’t want.
DATAnew<- data.frame(Sample_ID, Sample_Group, Case_Ctrl, Birth_Eut_Ces_Vent, Sex, Disease_during_Pregnancy, Diabetes, Hypothyroidism, Smoke, Folic_Acid, Folate_Assumption_Pre_Post_Pregnancy, Delta_Bodyweight_during_Pregnancy, Birth_Weight, Age_at_Delivery, `Mother's_BMI`, Bodyweight_before_Pregnancy, Pregnancy_Time_in_days, CD8T, CD4T, Bcell, Mono, Gran, PlasmaBlast)
I impute only the numeric data, which based on variable analysis, ranges from position 11 to 23.
I impute them using the mice function.
tempData <- mice(DATAnew[,12:23],m=5,meth='pmm',seed=50)
summary(tempData)
Extract the first dataset.
completedData <- mice::complete(tempData)
This command differs slightly from the one seen in class as it includes mice::complete instead of just complete. This is because there is a naming conflict with another function called complete in the rcurl package that could cause issues. Specifying it this way prevents errors.
In the final stage, we reconstruct our clean dataset by combining the categorical variables with the IMPUTED numeric ones.
I take the first 11 columns of the DATAnew dataset and join them with the columns of the IMPUTED dataset using the data.frame function.
DATI <- data.frame(DATAnew[,1:11], completedData)