CHI-SQUARE TEST WITH R

EXAMPLE 1

We want to check the effect of two toxic substances on two groups of animals:

Agent A, administered to 70 animals, caused the death of 22 individuals (48 survived).
Agent B, administered to 50 animals, caused the death of 24 individuals (26 survived).

Do the two substances have the same effects on mortality or survival (H1), or should the observed differences be considered random (H0)?

ANSWER USING R; the first line says: the frequencies observed are put in a two entry table

We begin by constructing the contingency table in R. To do this, we create a vector containing the data and then create a matrix from the vector as follows:

VEC<-c(22,24,48,26) 
TAB<-matrix(VEC,ncol=2) 
colnames(TAB)<-c("Deaths","Survived") 
row.names(TAB)<-c("AgentA","AgentB")

Once we have the object TAB, we apply the chisq.test() function to calculate the Chi-Square test:

RESULT<-chisq.test(TAB)

The result indicates an X-squared value of 2.7235 and a p-value of 0.098, and Yates' correction was automatically applied.

If we don't want Yates' correction, we can write:

RESULT<-chisq.test(TAB,correct=FALSE)

Actually, the object RESULT contains more elements of the test. We can view them using:

summary(RESULT)

We can explore and extract other information besides the p-value if needed:

RESULT$statistic # the Chi-Square value 
RESULT$parameter # the degrees of freedom 
RESULT$p.value # the p-value 
RESULT$observed # the observed subjects 
RESULT$expected # the expected subjects 
RESULT$residuals # the residuals

EXAMPLE 2

(Translation: in a propedeutic market research before launching a new daily newspaper, 2000 people above 18 were asked if they buy or not a newspaper a day. To understand the caractheristics of the interviewed, also the studying title was asked. The following distribution was obtained: - table-

Can the existence of a correlation between study title and the choice of buying a newspaper everyday be hypotized? Check using an appropriate test with a significance level of the 1%.

Again, we construct the table:

VEC<-c(10,90,150,230,120,190,310,650,220,30) 
TAB<-matrix(VEC,ncol=2) 
colnames(TAB)<-c("Yes","No") 
row.names(TAB)<-c("None","Elementary","Middle","HighSchool","Degree")

Apply the Chi-Square test:

RESULT<-chisq.test(TAB)

The result is highly significant.

EXAMPLE 3

Does Drug A have an effect on the number of recovered patients?

We solve this by applying Fisher's exact test.

We follow the same procedure to create the contingency table:

VEC<-c(3,4,6,2) 
TAB<-matrix(VEC,ncol=2) 
colnames(TAB)<-c("DrugA","DrugB") 
row.names(TAB)<-c("Recovered","NotRecovered")

We apply the Fisher test with the function fisher.test():

RESULT<-fisher.test(TAB)

We then evaluate the result.

The result is not significant, p = 0.31147.

The test also indicates the odds ratio, which is 0.27, with a confidence interval between 0.015 and 3.31.

In this case, we can observe that the confidence interval includes the value predicted by the null hypothesis (OR = 1).

EXAMPLE 4

Load the dataset CENTENARI_BIOCHIMICA and evaluate whether the variable INFARTO (Heart Attack) depends on the variable FUMO (Smoking).

We load the dataset using the read.table function after saving the data as a tab-delimited text file (.txt). We specify that the file has a header and that the columns are separated by tabs (sep="\t").

Note: If you import the data using the graphical interface, be mindful of MISSING DATA. You must indicate them in the import options menu. Choose the NA symbol for missing data; in this example, they are indicated as NA, but in other cases, they could be empty cells.

DATA<- read.table("DIRECORY/DATI/CENTENARI_BIOCHIMICA.txt",header=TRUE,sep="\t",na.string="")

Assign the correct data types to the variables ("factor" and "numeric"). In this case, we limit ourselves to the two variables of interest: INFARTO and FUMO.

DATA$INFARTO<-as.factor(DATA$INFARTO)
 DATA$FUMO<-as.factor(DATA$FUMO)

The data for the two variables are coded as 0 and 1, which could make the table hard to interpret. For this reason, we recode the two variables:

levels(DATA$INFARTO)<-c("HEALTHY","HEART_ATTACK") 
levels(DATA$FUMO)<-c("NON_SMOKER","SMOKER")

We build the contingency table:

TAB<-table(DATA$FUMO,DATA$INFARTO)

We apply Fisher's exact test because we want to estimate the odds ratio:

RESULT<-fisher.test(TAB)

INTERPRETATION OF THE RESULT

The test result is significant. The variable FUMO and INFARTO are not independent.

In particular, the variable HEART_ATTACK was found to be associated with the variable SMOKER. The measured ODDS RATIO is 3.11, with a confidence interval ranging from 1.8 to 5.4. This means that the odds of disease for exposed individuals are 3.11 times higher than the odds of disease for non-exposed individuals.

EXAMPLE 5

Again, using the CENTENARI_BIOCHIMICA dataset, test the dependency between FUMO (Smoking) and INFARTO (Heart Attack), but stratify by GROUP because there is a suspicion that GROUP (which also describes age) is potentially a confounder, as it is theoretically associated with both heart attack and smoking.

To do this, install the DescTools package:

install.packages("DescTools") 
library(DescTools)

The function of interest is contained in the package and is mantelhaen.test(), which allows us to apply the Cochran–Mantel–Haenszel test.

This test requires a three-level contingency table, which is very simple to obtain with the table() function:

TAB<-table(DATA$INFARTO,DATA$FUMO,DATA$Gruppo)

We apply the test:

mantelhaen.test(TAB)

The result gives us the association between HEART_ATTACK and SMOKER, stratified by the levels of the variable GROUP.

Training Opportunities

Training Opportunities

GLabStat

CHI-SQUARE TEST WITH R