Exercise 1
In a study, the effectiveness of the influenza vaccination is recorded. A sample of 1000 people is vaccinated on December 1st, while a second group of 2350 people is treated with a placebo. There are 20 cases of influenza in the vaccinated group, while the cases of influenza in the non-vaccinated group are 80. Can it be claimed that the vaccine has a significant effectiveness in prevention? Display the data in an appropriate graph.
The data we are dealing with are categorical (vaccine/placebo) and (influenza_yes/influenza_no).
I construct the contingency table considering that I will have E+ E- in columns and M+ M- in rows.
The total columns will be 1000 E+ and 2350 E-.
A<- 20
B<- 80
C<-(1000-20)
D<-(2350-80)
TAB<-matrix(c(A,C,B,D),2)
chisq.test(TAB,correct=FALSE)
Significant association between the drug and influenza.
In particular, if I want to calculate the risk associated with vaccination, I calculate the cross-ratio.
OR<- (A*D)/(B*C) == 0.5790816
Exercise 2
The dietitian Mario Smilzo claims that his diet leads to very rapid weight loss. To demonstrate it, he takes a group of 10 people and weighs them before and after the diet. Their weight before the diet and after the diet was
This is a comparison between a continuous variable, WEIGHT, and a categorical variable, PRE/POST.
It is a paired study.
The most suitable test appears to be a paired t-test, but before deciding, we need to check if the assumptions are met.
1- Are the samples independent? Yes.
2- Is the distribution normal?
PESO1<-c(80,90,96,85,102,105,99,89,98,78,75,79,98,82,95,102,95,80,97,76)
prepost<-c(rep("pre",10),rep("post",10))
To check this, I create a vector containing the data and another one indicating the categories.
shapiro.test(PESO1)
I use the Shapiro test to check if the distribution is normal.
The test is not significant, so the distribution can be considered normal.
3- Test for homoscedasticity of variances.
I apply the Bartlett test for the homogeneity of variances.
bartlett.test(PESO1,prepost)
In this case as well, the test is not significant, so the variances can be considered homogeneous.
At this point, I apply a one-tailed paired t-test. One-tailed because what I expect is a weight change only in the direction of loss, and I exclude the possibility of it increasing after the diet.
t.test(PESO1~prepost, paired=TRUE,alternative = "less")
The test is significant and has revealed a weight loss of 4.3 kg post-diet.
Exercise 3
Mario Smilzo's colleague, Dr. Tina Insala, claims that her diet is more effective, and she also puts 10 people on her diet. The weights before and after the treatment are described in the table:
90 85 96 85 102 105 99 89 98 78 pre
75 79 80 80 90 95 98 85 92 72 post
Check if the diet was effective and if there is a significant difference compared to the weight loss achieved by Mario Smilzo.
As previously, check for normality and homoscedasticity.
PESO2<-c(90, 85, 96, 85, 102, 105, 99, 89, 98, 78, 75, 79, 80, 80, 90, 95, 98, 85, 92, 72)
prepost<-c(rep("pre",10),rep("post",10))
Both tests (Shapiro and Bartlett) are non-significant, so I can apply a paired t-test.
t.test(PESO2~prepost, paired=TRUE,alternative = "less")
In this case as well, the diet is effective and results in a reduction of 8.1 kg.
To see if the weight loss is different between the two diets, I can approach it in various ways, one of which is to check if there are differences between the weight change (delta) values for the two diets. I will generate the two delta vectors for both diets (Pre-Post).
To see if the weight loss is different between the two diets, I can take different approaches, and one of them is to check if there are differences between the weight change (delta) values for the two diets.
I generate the two DELTA vectors for both diets (Pre-Post).
DELTA1<- PESO1[1:10]-PESO1[11:20]
DELTA2<- PESO2[1:10]-PESO2[11:20]
I check the normality and homoscedasticity of the Delta variable and then apply an unpaired t-test.
t.test(DELTA1,DELTA2)
Although there is a substantial weight delta between the two diets, the test is not significant, so we cannot reject the null hypothesis that there are no differences.
Exercise 4
Some researchers suspect that a gene polymorphism of the BRCA2 gene may confer a risk of breast cancer. To test this hypothesis, a case-control study is conducted. 200 patients and 200 controls are genotyped, and the frequencies of the identified genotypes are as shown in the table.
Calculate if the frequency of the T allele is significantly higher and estimate the risk using the most appropriate estimation.
I calculate the allelic frequency to reduce it to a 2x2 table.
Allele_A_Casi<-150*2+32
Allele_T_Casi<-18*2+32
Allele_A_CTR<-120*2+52
Allele_T_CTR<-28*2+52
I construct the contingency table.
TAB<-matrix(c(Allele_A_Casi,Allele_T_Casi,Allele_A_CTR,Allele_T_CTR),2)
I apply a Fisher's test to calculate the odds ratio directly.
fisher.test(TAB)
The allele A confers a significant risk factor, and the risk is quantified as OR = 1.8.
Exercise 5
A study aims to assess whether an anti-tumor drug can reduce tumor size. The drug is administered to three groups of guinea pigs: an untreated group (NO_F), a group treated with the experimental drug (F_SPE), and a group treated with the traditional drug (F_TRA). The results related to the dry weight of the biopsy in mg are as reported in the table.
NO_F<-c(200 ,189 ,201, 170, 145, 124, 150, 156, 158, 201 )
F_SPE<-c( 180, 150, 150, 120, 60, 120, 100, 91, 98, 180)
F_TRA <-c( 190, 120, 150, 110, 140, 110, 100, 90, 98, 150 )
To check if there is a difference between the experimental drug and the traditional one, I need to make a comparison between multiple groups.
ANOVA is the most suitable test, provided that the assumptions discussed earlier are met.
I create the treatment vector and its respective group vector.
TRATTAMENTI<-c(NO_F,F_SPE,F_TRA)
Gruppi<-c(rep("NO_F",length(NO_F)),rep("F_SPE",length(F_SPE)),rep("F_TRA",length(F_TRA)))
shapiro.test(TRATTAMENTI)
bartlett.test(TRATTAMENTI,Gruppi)
Both tests are not significant.
I apply ANOVA.
model<-(aov(TRATTAMENTI~Gruppi))
summary(model)
There is a significant difference only between the treated and untreated groups, but between the two treatments, the difference is not statistically significant.
Exercise 6
It is suspected that the lack of response to a drug is attributable to reduced expression of its cellular receptor. The drug is very expensive, and its administration would be futile in non-responders. A proteomic experiment is conducted to quantify protein levels in a group of responders and non-responders. The measured levels are in densitometric units.
In this case as well, we are dealing with a comparison between groups, and the t-test appears to be the most appropriate. So, I check if the conditions for its application exist.
Resp <-c(125, 127, 126, 123, 125, 136, 124, 122, 135, 140 )
Non_Resp <-c(100, 101, 105, 103, 99, 98, 105, 110, 115, 132)
gruppi<-c(rep("Resp",10),rep("Non_Resp",10) )
risposte<-c(Resp,Non_Resp)
Analysis of normality and homoscedasticity confirms that it is possible to apply a parametric test, so an independent t-test is chosen.
t.test(Resp,Non_Resp)
The test is significant and reveals average levels of 128.3 in the Responders group and 106.8 in the non-Responders group.
Exercise 7
In a hospital, it is hypothesized that living conditions and stress levels can influence the response to a particular therapy. If that were the case, many of the therapy's outcomes could be improved by addressing living conditions. To this end, a correlation is sought between the quality of life (QOL), assessed using questionnaire scores (ranging from 1 to 100), and the therapy response (R2T), assessed using a clinical improvement score (ranging from 1 to 10). The results obtained from a sample of 10 subjects are as follows:
Check if there is a correlation.
The data are ordinal categorical, so a non-parametric correlation test is applied as a choice.
QOL <-c(10, 90, 20, 99, 35, 55 ,46, 80, 75, 66 )
R2T <-c(1, 8, 2, 9, 3, 5, 4, 7, 7, 6 )
cor.test(QOL,R2T,method="kendall")
The two variables are highly correlated.
Exercise 8
Use the dataset CENTENARI_BIOCHEMISTRY and check if total cholesterol, smoking, and blood glucose represent risk factors for a heart attack. Calculate their effect on the risk of a heart attack.
I use the read.table function after saving the data in tab-delimited text format (.txt). I specify that the file has a header, and the symbol that separates the columns is a tabulation with sep="\t".
Note: If you import the data using a graphical interface, be careful with MISSING DATA. They should be indicated in the import options screen. Choose how missing data are represented; in the example, they are indicated as NA, but in other cases, they could be represented as empty cells.
DATA<- read.table("DIRECORY/DATI/CENTENARI_BIOCHIMICA.txt",header=TRUE,sep="\t",na.string="")
Note: In this example, an object named DATA has been created to contain the data. If you import the CENTENARI_BIOCHEMISTRY file using the graphical interface in R, an object named CENTENARI_BIOCHEMISTRY will be created instead of DATA. Of course, you should modify the following commands accordingly, or create a copy named DATA that contains CENTENARI_DATA.
First, check the type of variables with:
str(DATA)
Most of the variables are numeric or integer; the only incorrect ones are Gruppo, uo, pid, FUMO, INFARTO, INSUFF_RENE, DIABETE, TEST_1_DIABETE, and TEST_2_DIABETE.
I will make them categorical. This is an important operation because R will treat them appropriately during the analysis.
DATA$Gruppo<-as.factor(DATA$Gruppo)
DATA$uo<-as.factor(DATA$uo)
DATA$pid<-as.factor(DATA$pid)
DATA$FUMO<-as.factor(DATA$FUMO)
DATA$INFARTO<-as.factor(DATA$INFARTO)
DATA$INSUFF_RENE<-as.factor(DATA$INSUFF_RENE)
DATA$DIABETE<-as.factor(DATA$DIABETE)
DATA$TEST_1_DIABETE<-as.factor(DATA$TEST_1_DIABETE)
DATA$TEST_2_DIABETE<-as.factor(DATA$TEST_2_DIABETE)
# controllo che sia tutto corretto digitando nuovamente
str(DATA)
DATI<-na.omit(DATA)
dim(DATI)
At this point, I will answer the exercise question using a multiple logistic regression model since the dependent variable Y is binary.
summary(glm(DATI$INFARTO~DATI$FUMO+DATI$COL_TOT+DATI$GLICEMIA,family=binomial))
From the multiple regression model, it emerges that blood glucose and smoking are risk factors for a heart attack, while cholesterol is not significantly associated.
The exponential of the beta represents the odds ratio (OR).
EXERCISE 9
Use the dataset CENTENARI_BIOCHEMISTRY and check if total cholesterol, smoking, and blood glucose are associated with BMI (Body Mass Index). In particular, calculate their effect.
In this case, as in the previous one, I will use a multiple linear regression model because the dependent variable is continuous.
summary(glm(DATI$BMI~DATI$FUMO+DATI$COL_TOT+DATI$GLICEMIA,family=gaussian))
Blood glucose has emerged as the only variable significantly associated with BMI, after accounting for the other two variables.