1- Import the 'DATI_OBESI' database - Remove samples with missing data, save a new dataset named 'DATI,' and reclassify factorial variables that were mistakenly interpreted as numeric by R. Also, rearrange the variables so that categorical variables are grouped in the first columns of the dataframe, while continuous variables are placed subsequently.
I import the data using the readxl function to directly read Excel files. Then, I visualize the files using the head command to ensure they are well-structured. At this point, I use the str() command to check that the variables have been loaded correctly and that numeric and categorical variables correspond appropriately. I recode the inaccurate variables.
library(readxl)
DATI <- read_excel("C:/Users/GENTILINI/Desktop/DATI_OBESI.xls", na = "NA")
head(DATI_OBESI)
str(DATI_OBESI)
I recode the variables
DATI$IID<-as.factor(DATI$IID)
DATI$OBESITA_GRAVE<-as.factor(DATI$OBESITA_GRAVE)
DATI$sesso<-as.factor(DATI$sesso)
DATI$diabete<-as.factor(DATI$diabete)
DATI$tipo_diabete<-as.factor(DATI$tipo_diabete)
DATI$ipertensione<-as.factor(DATI$ipertensione)
DATI$OB_EDONICI<-as.factor(DATI$OB_EDONICI)
DATI$INFARTO<-as.factor(DATI$INFARTO)
To display the column names and their respective positions, I execute the command colnames(DATI).
At this point, I need to create a new dataframe that groups categorical and numeric variables.
I use the data.frame() function to create the new dataframe. Instead of rewriting all the new columns one by one, I take them in blocks because many of them are already grouped. For example, categorical variables are in the first three columns and the last five.
I use square brackets to indicate which columns to select from the DATI dataframe and then combine them in a new order using the data.frame function.
DATI<-data.frame(DATI[,1:3],DATI[,42:46], DATI[,4:41])
I have overwritten DATI for convenience and merged my variables.
2- Describe the dataset: How many subjects and variables are there? How many subjects have complete data? Evaluate the maximum, minimum, mean, and median values for all subjects and identify if any subject has values that can be considered outliers.
I calculate the dimensions of the dataframe and then, after removing missing data, I see how many samples have complete data in the new working dataset DATI_DEF.
I use summary(DATI_DEF) to assess all the data, and from here I notice that in the age and weight variables, there are two out-of-scale values, probably typos. I can correct them by opening the dataset with fix(DATI_DEF).
dim(DATI)
DATI_DEF<-na.omit(DATI)
dim(DATI_DEF)
summary(DATI_DEF)
fix(DATI_DEF)
3- Evaluate which variables are normally distributed.
To test if the variables in my dataset are normally distributed, I can use the Shapiro-Wilk test. I should apply it to all the numerical variables that I assume to have a normal distribution and want to test, excluding categorical or qualitative variables. I can apply the formula: shapiro.test(VARIABLE) to all the variables I want to test, but it could be time-consuming if there are many variables.
Alternatively, I can use the lapply function.
The formula is lapply(DATAFRAME, test-to-apply).
The dataframe should consist of variables for which the desired function can be used. In other words, the dataframe must contain columns compatible with the function we want to use.
In this case, the shapiro.test function doesn't make sense for categorical variables. Therefore, we need to create a dataframe containing only the numerical variables on which to test normality using the shapiro.test.
I can do this by creating a new dataframe and selecting the numerical variables of interest. In this case, the numerical variables are located between column 4 and column 30 inclusive, so the command will be:
lapply(DATI_DEF[,9:46], shapiro.test)
I will receive a list of results as output, which I will need to observe.
The result suggests that the only normally distributed variable is COL_TOT, with a p-value < 0.05.
4- Display a correlation matrix plot (correlogram) among all the continuous variables in the dataset and indicate for which variables the correlation is > |0.6|.
I load the 'corrgram' package to test the correlation among all the variables in the DATI_DEF dataset. Then, I install and load it using 'library()'. If the package is already installed, I just need to load it. Of course, the correlogram will be generated only for the continuous numerical variables, so on DATI_DEF[,9:46).
install.packages("corrgram")
library(corrgram)
corrgram(DATI_DEF[,9:46])
The graph provides limited information; in order to visualize the degree of correlation, we need a correlation matrix that highlights all possible intersections.
I obtain the result using:
CORRELAZIONE<-cor(DATI_DEF[,9:46])
As you have seen, I saved the result, which is a matrix, in an object. I can visualize it, but I can also utilize it. The question, in fact, asks which variables have a correlation > |0.6|.
To answer this question, I will simply ask:
CORRELAZIONE > abs(0.6)
# abs indicates absolute value
5- Is there a connection between severe obesity and hedonic obesity (OB_EDONICI)? Is this connection significant?
The variables are binary categorical, so when compared, they will generate a 2x2 contingency table where samples are distributed among the 4 cells. If there is a relationship between the variables, the subjects won't be distributed randomly, but will tend to predominantly fill one of the cells.
The test to verify this hypothesis is the chi-squared test, and I apply it in the following way:
chisq.test(table(DATI_DEF$OB_EDONICI,DATI_DEF$OBESITA_GRAVE))
The test result is significant, indicating a connection between severe obesity and hedonic obesity.
6- Evaluate using the most appropriate graph whether there are differences in BMI values between hedonic and non-hedonic obese individuals. Assess the BMI change transitioning from hedonic obese to non-hedonic obese.
To compare BMI values between hedonic and non-hedonic obese individuals, I can use a boxplot, which allows me to check for differences in median BMI values, for example.
boxplot(DATI_DEF$bmi~DATI_DEF$OB_EDONICI)
To assess the change from hedonic obese to non-hedonic obese, I apply a regression model where the independent variable X is binary (hedonic obese) and the dependent variable Y is BMI.
summary(glm(DATI_DEF$bmi~DATI_DEF$OB_EDONICI, family="gaussian"))
The result is significant and indicates a reduction of 3.2 BMI points when transitioning from non-hedonic obese to hedonic obese.
7- Evaluate how the caloric intake (reeday) changes based on being severely obese.
To address this question, I apply a model similar to the previous one.
summary(glm(DATI_DEF$reeday~DATI_DEF$OBESITA_GRAVE, family="gaussian"))
There is an increase in caloric intake of 358 calories per day when transitioning from obese to severely obese.
Note: Please temporarily disregard the warning that is produced at the end of the analysis.
8- Evaluate by introducing them into the previous model whether variables such as age, hedonic obesity, and gender are associated with caloric intake and indicate the 'pure' effects of these variables on caloric intake.
summary(glm(DATI_DEF$reeday~DATI_DEF$AGE+DATI_DEF$OB_EDONICI+DATI_DEF$SEX, family="gaussian"))
9- Is there a difference in the number of severely obese individuals between males and females? Apply the appropriate test.
chisq.test(table(DATI_DEF$OBESITA_GRAVE,DATI_DEF$SEX))
10- Is there an association between BMI and gender? Apply the appropriate test considering the result from exercise 9.
summary(glm(DATI_DEF$BMI~DATI_DEF$sesso, family="gaussian"))
11- Is there a relationship between heart attack and diabetes?
chisq.test(table(DATI_DEF$INFARTO,DATI_DEF$DIABETE))