1- Import the DATASET_ESERCIZIO1 data. There are different procedures to load data files in R. It greatly depends on the type of files we are dealing with. The most common and simple file types are ".txt" and ".csv" files.
In ".txt" and ".csv" files, the columns are separated by delimiters such as tabs or commas. We can always convert a file to one of these types by saving the data file we have and selecting from the drop-down menu "Text with tab-separated values .txt" (save as ---> tab-delimited text). Alternatively, we can save it as "CSV (Comma Delimited)". Opening the file with Notepad or any text editor, the data will appear divided by tabs or commas.
If we have chosen to convert the original file to a .txt or .csv file, we can load it using the read.table() function.
The syntax to import the data will be:
DATASET_ESERCIZIO1 <- read.table("/path/to/file/DATASET_ESERCIZIO1.txt", header=TRUE, sep="\t")
DATASET_ESERCIZIO1 <- read.table("/path/to/file/DATASET_ESERCIZIO1.csv", header=TRUE, sep=",")
How to get the path? Right-click on the data file, select "Properties," and copy the file path.
Ensure the slashes are "/" instead of "".
header=TRUE indicates that the file has a header (with the variable names).
sep="\t" indicates that the delimiter separating column values in R is a tab.
For CSV files, the delimiter is sep=",".
If the PC is set to the Italian language, it may happen that the CSV file does not have commas but semicolons instead. In this case, simply change to sep=";".
Alternatively, you can install a package that allows you to directly load xls files. The package is called xlsx. After installing it, load it into R to activate the functions it contains, including the read.xlsx() function.
Always remember to load any R package after installation with the command library(PACKAGE-NAME):
install.packages("xlsx")
library(xlsx)
DATASET_ESERCIZIO1 <-read.xlsx("C:/Users/GENTILINI/Desktop/DATASET_ESERCIZIOE1.xls", sheetName = "DATASET_TIROIDE")
2- Display the first six rows of the file, check that the formatting is correct, count the number of samples, and the number of variables in the file:
head(DATASET_ESERCIZIO1)
dim(DATASET_ESERCIZIO1)
The command will show that the file contains data for 100 subjects for 13 measured variables.
3- Display the structure of the file and the attributes of the contained variables. What conclusions can you draw?
str(DATASET_ESERCIZIO1)
The result indicates that DATASET_ESERCIZIO1 is a dataframe and highlights the attributes of the individual variables. All variables are considered numerical, but in reality, not all of them are numerical. Sex and heart attack are categorical variables, so it is appropriate to classify them correctly.
4- Correctly classify numerical and categorical variables:
DATASET_ESERCIZIO1$SEX <- as.factor(DATASET_ESERICZIO1$SEX)
DATASET_ESERCIZIO1$INFARTO <- as.factor(DATASET_ESERCIZIO1$INFARTO)
DATASET_ESERCIZIO1$FUMO <- as.factor(DATASET_ESERCIZIO1$FUMO)
Sex, heart attack, and smoking are categorical variables and not numerical. The values 1, 2, or 0, 1 indicate male/female, no heart attack/yes heart attack, and non-smoker/smoker, respectively. Repeating the command str(DATASET_ESERCIZIO1) will show how these variables have been reclassified.
Recoding categorical variables: Change SEX from (1,2) to (M,F), INFARTO from (0,1) to (STROKE, NO-STROKE), and FUMO from (0,1) to (SMOKERS, NO-SMOKERS):
levels(DATASET_ESERCIZIO1$SEX)
levels(DATASET_ESERCIZIO1$SEX) <- c("M", "F")
levels(DATASET_ESERCIZIO1$INFARTO)
levels(DATASET_ESERCIZIO1$INFARTO) <- c("NO-STROKE", "STROKE")
levels(DATASET_ESERCIZIO1$FUMO)
levels(DATASET_ESERCIZIO$FUMO) <- c("NO-SMOKERS", "SMOKERS")
The levels() command allows you to see the levels of a categorical variable and to recode them as deemed appropriate. The result of levels will be a vector containing the level names, which can be modified by assigning new names to the levels via a new vector <- c("new-name").
5- Are there missing data? If so, exclude them and check the number of samples removed due to missing data:
DATI <- na.omit(DATASET_ESERCIZIO1)
dim(DATI)
The na.omit() command eliminates all individuals with at least one missing value; this solution can be used to work only with samples with complete observations. The following dim() command shows that there were no missing data in the dataset and that no samples were discarded.
Now, the dataset on which I am working, the one cleaned of missing data, is saved in the "DATI" object, so I will refer to DATI in every subsequent phase of the analysis.
6- Describe the AGE variable: Calculate the mean, standard deviation, median, 1st quartile, 3rd quartile, and maximum and minimum values. Create a graph that shows the age distribution within the study sample:
mean(DATI$AGE)
sd(DATI$AGE)
median(DATI$AGE)
quantile(DATI$AGE)
# or alternatively
BX <- boxplot(DATI$AGE)
BX$stats
range(DATI$AGE)
Quartiles can be calculated with the quantile() function or the boxplot() function; both functions calculate not only quartiles but also the median. The boxplot() function, in this case, is applied to the variable DATA$AGE, and the result is saved in an object called BX. The boxplot() function automatically generates a boxplot graph that describes the distribution of the AGE variable values, albeit in a summary manner.
To more effectively visualize the distribution of AGE values, a histogram or density plot can be used. In the first case, bar charts are displayed for each value of DATI$AGE, showing the number of subjects in the dataset. The graph can be modified by choosing age intervals within which the number of samples is counted. The alternative is the density plot, which produces a graph with a line that better approximates the distribution of the samples by age.
hist(DATI$AGE)
hist(DATI$AGE, breaks=5, labels=TRUE)
# density plot
plot(density(DATI$AGE))
# combining both for a better graph
hist(DATI$AGE, freq=F, breaks=5, labels=TRUE)
lines(density(DATI$AGE), col="red")
The first command hist(DATI$AGE) is a very simple base command; in the second command, we have added elements to the graph such as breaks (partitions of the AGE variable within which we calculate the number of subjects) and labels that add to each column the number of subjects contained in that interval of the variable.
The density() function calculates the density distribution and can be visualized by including the result within the plot() command. In other words, plot(density(DATI$AGE)).
A more complex graph can be created by combining both the histogram and the density plot; in this case, add the density plot using the lines() function. This function allows adding another graph to the previous one.
7- What is the average age in the male and female groups?There are several ways to obtain the average age in the two groups. You can divide the data into males and females and create two separate datasets where you then calculate the average age variable as seen previously.
Use the subset function to split the datasets and save the dataset only with male values in a new object DATASET_MALES; do the same with DATASET_FEMALES:
DATASET_MALES <- subset(DATI, DATI$SEX=="M")
DATASET_FEMALES <- subset(DATI, DATI$SEX=="F")
# calculate the average age values simply with:
mean(DATASET_MALES$AGE)
mean(DATASET_FEMALES$AGE)
You can do the same thing much more quickly with the tapply() function.It applies any function, in this case, the mean, to a category of subjects identified by a variable.
The syntax is:
tapply(DATI$AGE, DATI$SEX, mean)
Between the parentheses, the first position is the variable on which we want to apply the function; the second position is the variable that determines the criteria for selecting the samples; and the third position is the type of function we want to use.
This can be applied to many other functions like median, sd, sum, and graphical functions like plot, boxplot, density, etc.
Example: tapply(DATA$AGE, DATA$SEX, density) generates a density analysis of the AGE variable for males and females.
Create a new variable OBESITY in which you assign the value "OBESE" to subjects with BMI > 35 and "NOT_OBESE" to those with BMI ≤ 35, and then add it to DATA.
OBESITY <- as.factor(DATA$BMI > 35)
levels(OBESITY) <- c("NOT_OBESE", "OBESE")
DATA <- data.frame(DATA, OBESITY)
With the command DATA$BMI > 35, I get a vector where each BMI value that meets the condition is marked as TRUE; otherwise, it is marked as FALSE. I save the result in an object called OBESITY and, at the same time, use the as.factor() command to treat it as a factor.
After obtaining a TRUE-FALSE factor, I recode it using the levels() command as seen previously.
Do males smoke more cigarettes than females?
tapply(DATA$N_CIGARETTES, DATA$SEX, mean)
To answer this question, we can use the tapply function as done previously and compare the average values of the number of cigarettes.
How many smokers are there among males and females?
table(DATA$SEX, DATA$SMOKING)
I use the table() command to display a table of my data.
Which type of chart would you use to visualize the result of the previous exercise? There are two ways to use bar charts: one obtained with the plot() function and the other with the barplot() function. In both cases, the object of the function is the result of the table() function.
plot(table(DATA$SEX, DATA$SMOKING), ylim = c(0, 100))
# Alternatively:
barplot(table(DATA$SEX, DATA$SMOKING), ylim = c(0, 100))
In the command, I added instructions for the length of the Y-axis by setting it from 0 to 100 to have margins for introducing the legend. If I want to add a legend, I use the following command legend():
legend(2, 100, legend = c("Females", "Males"), fill = c("grey", "dimgrey"))
In the previous command, 2, 100 refers to the coordinates on the chart where to place the legend (these can be adjusted by trying different positions and choosing the best one).
If I want to choose colors, I can use the R color tables available here:
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
To modify the colors of the previous chart, I just need to add the option col = c("color1", "color2"). In this case, I use a vector that carries the names of the colors to be used, for example:
barplot(table(DATA$SEX, DATA$SMOKING), col = c("red", "blue"), ylim = c(0, 100)) legend(2, 100, legend = c("Females", "Males"), fill = c("red", "blue"))
Create a table showing the number of heart attack cases among male smokers. To tabulate the data, I can use several methods and functions; the simplest is table().
There are other alternatives like ftable(), which is used when there are more than two variables to tabulate. In this case, there are three, and as highlighted by looking at the result of the two commands, ftable allows generating a table format with all the data, while the table command creates a table with a less readable format.
table(DATA$HEART_ATTACK, DATA$SEX, DATA$SMOKING)
ftable(DATA$HEART_ATTACK, DATA$SEX, DATA$SMOKING)
What type of graph would you use to explore a relationship between two numerical variables like BMI and WEIGHT? What kind of relationship exists between the two variables? To evaluate a potential relationship between the two variables, you can choose the Scatterplot graph, which is easily obtained with the plot() command.
plot(DATA$BMI, DATA$WEIGHT)
In this graph, for each value of X (BMI), the respective weight value is plotted, and it's possible to highlight a potential relationship between them.
You can see that there is indeed a relationship between the two variables: as the BMI variable increases, the weight variable also increases. This generates a graph where the points scatter around a line.
Which type of graph would you use to compare the median values of BLOOD_PRESSURE in groups of smokers and non-smokers? What do you deduce from the graph? What are the median values in the two groups?
The most appropriate type of graph seems to be the boxplot. The intent is to verify whether the blood pressure is different in the two groups.
BOX <- boxplot(DATA$BLOOD_PRESSURE ~ DATA$SMOKING)
BOX$stats
The boxplot function not only allows you to visualize the graph but also calculates the quartiles and outliers. Thus, it is possible to see the median values, which are saved in BOX$stats (BOX, of course, is the name of the object where I saved the result).
What type of chart would you use to visualize the percentage of males and females?
The most appropriate type of chart could be a pie chart, where it is possible to get an idea of the percentage of male and female subjects. This type of chart is useful when you want to describe how a certain qualitative variable is distributed and provide an image that allows you to quickly visualize the percentage of each category of the variable of interest.
The pie() function allows you to create a pie chart but must be used on the result of the table() function. In other words, if we want to create a chart that shows the frequencies with which some categorical variables are characterized, we first need to calculate the frequencies with the table() function and then use the pie() function to visualize them.
pie(table(DATA$SEX))
Are cholesterol levels similarly distributed between males and females? Visualize the distribution in both sexes to get an idea.
The function to highlight the distribution of a variable is density(). The density() function calculates the maximum, minimum, median, quartiles, mean, etc. It is possible to visualize these values that describe how the variable in question is distributed, or the result can be plotted by including it in the plot() function. In other words, to visualize the distribution of any numerical variable, I can use the density() function and include the result in the plot() function as follows:
plot(density(DATA$CHOLESTEROL))
However, the exercise did not ask to visualize the distribution of a variable but to visualize its distribution in both males and females by comparing them.
To do this, there is a specifically developed function called sm.density.compare(), which is contained in the sm package. Therefore, it is necessary to first install it and then load it if we want to use it.
Install and load it with the two commands:
install.packages("sm")
library(sm)
Launch the function:
sm.density.compare(DATA$SEX, DATA$CHOLESTEROL)
The result shows that cholesterol levels are similarly distributed between the two sexes.
What type of chart would you use to visualize a potential relationship between BMI and WEIGHT?
If we want to visualize the relationship between two variables, the most useful graph is the scatterplot.
The scatterplot allows us to visualize the relationship that exists between two variables under examination, in this case, BMI and WEIGHT. How do we obtain it? We obtain it using the plot() function and separating the two variables with a comma:
plot(DATA$BMI, DATA$WEIGHT)
The graph shows that the subjects (each point on the graph is a subject identified according to their BMI and WEIGHT values) do not distribute randomly but along an imaginary line. Specifically, as weight increases, BMI increases. This graph helps us understand that there is a linear relationship between BMI and WEIGHT.