To see this working, head to your live site.

foRum

Improve your Knowledge in R

New Posts

MARIA PIA CALABRIA
Oct 26, 2023
Mixed exercises on the alcoholism dataset
Exercises with R
1. Load the alcoholism dataset alcolismo <- read.csv2("/Users/rebeccacavagnola/Desktop/tutor/alcolismo.csv", sep=";") 2. Proceed with checking the dataset and assign the correct class to the variables if necessary str(alcolismo) alcolismo$drinks <- as.numeric(alcolismo$drinks) alcolismo$sesso <- as.factor(alcolismo$sesso) # Convert the 'drinks' variable to numeric since it was incorrectly classified as a character, and the 'sesso' variable to a factor since it was incorrectly classified as numeric. 3. Check for the presence of missing data and impute if necessary summary(alcolismo) table(is.na(alcolismo)) # There are no missing data 4. How many men drink more than 2 pints per day (sesso=2) uomini_pinte <- as.data.frame(subset(alcolismo, alcolismo$sesso == "2" & alcolismo$drinks > 2)) dim(uomini_pinte) 5. Create a new variable called RangeDrinks based on the Drinks variable, categorizing its values into the following ranges: (0-4, 5-8, 9-20) and naming them respectively: "low", "medium", "alcoholism" alcolismo["RangeDrinks"] = cut(alcolismo$drinks, c(0,4,8,20), c("0-4","5-8","9-20"), include.lowest=TRUE) levels(alcolismo$RangeDrinks) <- c("low", "medium", "alcoholism") 6. Calculate the mean of aspartate and represent the median values of alanine aminotransferase for the newly created categories # Mean mean(alcolismo$aspartate.aminotransferase) # 24.64 # To represent the median values, use a boxplot boxplot(alcolismo$aspartate.aminotransferase ~ alcolismo$RangeDrinks, main="Aspartate Levels in the Three RangeDrinks Categories", xlab="Groups", ylab="Aspartate", ylim=c(0,50)) # prettier Boxplot library(ggplot2) library(ggpubr) ggboxplot(alcolismo, x = "RangeDrinks", y = "aspartato.amminotrasferasi", color = "RangeDrinks", palette = "jco", add = "jitter", ylim=c(0,50)) + ggtitle("Livelli di aspartato nelle tre classi di RangeDrinks") 7. Verify with the most appropriate test (justifying the choice) if there are significant differences between the three groups of RangeDrinks in aspartate aminotransferase levels. #ANOVA is the most suitable test provided that the assumptions of normality and homoscedasticity are met Let's check for normality using the Shapiro test shapiro.test(alcolismo$aspartate.aminotransferase) Since it doesn't have a normal distribution (p-value < 0.05: rejecting the null hypothesis of normality), I apply a non-parametric test, specifically the Kruskal-Wallis test (the non-parametric counterpart of the parametric ANOVA test). I could also check for homoscedasticity, but it's not necessary because I already know that I don't have a normal distribution. bartlett.test(aspartate.aminotransferase ~ RangeDrinks, data=alcolismo) # I apply the non-parametric Kruskal-Wallis test: kruskal.test(aspartate.aminotransferase ~ RangeDrinks, data = alcolismo) The p-value is < 0.05, so there are significant differences between the three groups of RangeDrinks: low, medium, and alcoholism. Let's determine which groups differ by applying the post-hoc Dunn's Test. library(FSA) dunnTest(aspartato.amminotrasferasi ~ RangeDrinks, data=alcolismo, method="bonferroni") 8. Represent the corpuscular volume in a graph for the three groups. boxplot(alcolismo$volume.corpuscolare ~ alcolismo$RangeDrinks,main="Volume corpuscolare ",xlab="gruppi",ylab="aspartato ", ylim=c(80,110)) # prettier library(ggplot2) library(ggpubr) ggboxplot(alcolismo, x = "RangeDrinks", y = "volume.corpuscolare", color = "RangeDrinks", palette = "jco", add = "jitter", ylim=c(80,110)) + ggtitle("Volume corpuscolare nei tre gruppi") 9. Evaluate the relationship between corpuscular volume and the number of pints consumed per day. summary(modello<- glm(volume.corpuscolare ~ drinks, data= alcolismo, family= "gaussian")) p-value = 2.92e-09: SIGNIFICANT. There is a significant relationship between corpuscular volume and the number of pints consumed per day. B0: 88.72 (the value of Y when X=0) B1: 0.42 (Change in Y for a one-unit increase in X, for a one-unit increase in the number of pints). 10. Evaluate the change in corpuscular volume among the three groups of RangeDrinks using a linear regression model. summary(modello<- glm(volume.corpuscolare ~ RangeDrinks, data= alcolismo, family= "gaussian")) • 89.15: Corpuscular volume for RangeDrinks low when RangeDrinks medium = 0 and RangeDrinks alcoholism = 0. • 3.3024: The difference in corpuscular volume between RangeDrinks low and RangeDrinks medium (p-value < 0.05, so the difference is significantly different from 0). • 3.3738: The difference in corpuscular volume between RangeDrinks low and RangeDrinks alcoholism (p-value < 0.05, so the difference is significantly different from 0). 1. Evaluate the association between corpuscular volume and aspartate aminotransferase and represent the association with the appropriate graph. # First, I check for normality with the Shapiro test: shapiro.test(alcolismo$aspartate.aminotransferase) shapiro.test(alcolismo$corpuscular.volume) # p-value < 0.05: rejecting the null hypothesis of normality, so I use the Spearman method cor.test(alcolismo$aspartate.aminotransferase, alcolismo$corpuscular.volume, method = "spearman") # p-value = 0.049 < 0.05 (barely), there is a statistically significant relationship between aspartate and corpuscular volume plot(alcolismo$aspartate.aminotransferase, alcolismo$corpuscular.volume)
Like
0
MARIA PIA CALABRIA
Oct 26, 2023
EXAM TEST 3 PROGRESS
Exercises with R
1- Import the 'CENTENARI e BIOCHIMICA' database - Remove samples with missing data, save a new dataset called 'DATI,' and reclassify the factor variables that were erroneously interpreted as numerical. I import the dataset and save a new dataset 'DATI' without missing data. DATI<-na.omit(CENTENARI_BIOCHIMICA) I reclassify the categorical variables erroneously interpreted as numerical by R. DATI$uo<-as.factor(DATI$uo) DATI$FUMO<-as.factor(DATI$FUMO) DATI$INFARTO<-as.factor(DATI$INFARTO) DATI$INSUFF_RENE<-as.factor(DATI$INSUFF_RENE) DATI$DIABETE<-as.factor(DATI$DIABETE) DATI$TEST_1_DIABETE<-as.factor(DATI$TEST_1_DIABETE) DATI$TEST_2_DIABETE<-as.factor(DATI$TEST_2_DIABETE) 2- How many centenarian subjects are there with a BMI<25? CENTENARI<-subset(DATI,DATI$Gruppo=="CENT" & DATI$BMI<25) dim(CENTENARI) dim(CENTENARI) results in 49 and 36. 49 is the number of rows, and 36 is the number of columns. The number of rows indicates how many centenarian subjects are present in the dataset with a BMI<25. You can also obtain the same information with the following command: table(DATI$Gruppo,DATI$BMI<25) Centenarians with a BMI<25 are 49. 3- Display with a graph if the distribution of the GLICEMIA variable is different between Diabetics and non-diabetics. boxplot(DATI$GLICEMIA~DATI$DIABETE) 4- Display with a graph the percentages of subjects belonging to the various operational units. percentuali <-round(table(DATI$uo)/sum(table(DATI$uo))*100) labels<-paste(levels(DATI$uo), percentuali,"%", sep="_") pie(table(DATI$uo),labels) 5- Display a graph depicting the relationship between GLICEMIA and HOMA-IR. scatter.smooth(DATI$GLICEMIA,DATI$`HOMA-IR`) 6- Is the number of subjects who have had a heart attack significantly different between smokers and non-smokers? Choose the most appropriate test and indicate the p-value. chisq.test(table(DATI$INFARTO,DATI$FUMO)) Pearson's Chi-squared test with Yates' continuity correction data: table(DATI$INFARTO, DATI$FUMO) X-squared = 16.056, df = 1, p-value = 6.15e-05 The number of subjects who have had a heart attack is significantly different between smokers and non-smokers, as the p-value is <0.05. 7- Is there a difference in the values of Total Cholesterol (COL_TOT) between subjects with and without renal blockage (INSUFFICIENZA_RENE)? Choose the test you deem appropriate and interpret the result. I check for normality and homoscedasticity with the following tests: shapiro.test(DATI$COL_TOT) Shapiro-Wilk normality test data: DATI$COL_TOT W = 0.99263, p-value = 0.05678 bartlett.test(DATI$COL_TOT~DATI$INSUFF_RENE) Bartlett test of homogeneity of variances data: DATI$COL_TOT by DATI$INSUFF_RENE Bartlett's K-squared = 1.2554, df = 1, p-value = 0.2625 Both of them have a p-value >0.05, so I can apply a parametric test. In this specific case, I have a binary categorical variable and a continuous one, so I can apply the t-test. t.test(DATI$COL_TOT~DATI$INSUFF_RENE, paired = FALSE) Welch Two Sample t-test data: DATI$COL_TOT by DATI$INSUFF_RENE t = 1.256, df = 8.2289, p-value = 0.2436 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -17.22163 58.86178 sample estimates: mean in group 0 mean in group 1 200.5979 179.7778 There is no significant difference in the values of Total Cholesterol (COL_TOT) between subjects with and without renal blockage (p-value <0.05). 8- Evaluate the relationship between GLICEMIA and BMI in a regression model. Assess if the relationship is significant and interpret the result by explaining what the beta value and intercept indicate. summary(glm(DATI$BMI~DATI$GLICEMIA, family = gaussian)) Call: glm(formula = DATI$BMI ~ DATI$GLICEMIA, family = gaussian) Deviance Residuals: Min 1Q Median 3Q Max -13.0860 -2.7545 -0.2448 2.3224 22.7330 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 22.047938 0.758165 29.081 < 2e-16 *** DATI$GLICEMIA 0.046394 0.007826 5.928 6.89e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 20.28338) Null deviance: 8420.4 on 381 degrees of freedom Residual deviance: 7707.7 on 380 degrees of freedom AIC: 2237.8 Number of Fisher Scoring iterations: 2 The relationship between GLICEMIA and BMI is found to be significant as the p-value is <0.05. The intercept indicates the value of Y when X is zero. In this specific case, BMI has a value of 22.04 when blood glucose is 0. The beta value, on the other hand, indicates how much Y increases for a one-unit change in X. In this specific case, BMI increases by 0.04 units for a one-unit change in blood glucose. By taking the exponential of the beta, I obtain the Odds Ratio (which can be calculated with the 'exp()' command). 9- Evaluate the relationship between INFARTO = Y and GLICEMIA = X in a regression model. Assess if the relationship is significant and interpret the result by explaining what the beta value and intercept indicate. summary(glm(DATI$INFARTO~DATI$GLICEMIA, family = binomial)) Call: glm(formula = DATI$INFARTO ~ DATI$GLICEMIA, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.6856 -0.5646 -0.5369 -0.5026 2.0879 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.776554 0.419783 -6.614 3.73e-11 *** DATI$GLICEMIA 0.010860 0.004036 2.691 0.00712 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 321.91 on 381 degrees of freedom Residual deviance: 314.75 on 380 degrees of freedom AIC: 318.75 Number of Fisher Scoring iterations: 4 The relationship between heart attack and blood glucose is found to be significant (p-value 0.007). The intercept indicates that the heart attack value is -2.7765 when blood glucose is zero. The beta, in this case, is 0.01, indicating an increase in the risk of having a heart attack for each one-unit increase in blood glucose. 10- Singh et al. studied immune abnormalities in autistic children, and data related to the measurement of serum antigen concentration in units/ml are reported for three groups of children under the age of 10: autistic, normal, and delayed. Are there differences between these three groups of children in terms of antigen concentration? autistici<c(755,383,380,215,400,343,415,360,345,450,410,435,460,360,225,900,365,440,820,400,170,300,325,345,230,370,285,315,195,270,305,375,220) normali<-c(165,390,290,435,235,320,330,205,375,345,305,220,270,355,360,335,305,360,335,305,325,245,285,370,345) ritardati<-c(380,510,315,565,715,380,390,245,155,335,295,200,105,105,245) I create a vector with all the values of serum concentrations. concentrazionesierica<c(755,383,380,215,400,343,415,360,345,450,410,435,460,360,225,900,365,440,820,400,170,300,325,345,230,370,285,315,195,270,305,375,220,165,390,290,435,235,320,330,205,375,345,305,220,270,355,360,335,305,360,335,305,325,245,285,370,345,380,510,315,565,715,380,390,245,155,335,295,200,105,105,245) I create a vector that contains the categories corresponding to the values of serum concentrations. GRUPPI<-c(rep("AUT",length(autistici)),rep("NORM",length(normali)),rep("RIT",(length(ritardati))) Since R mistakenly considers the variable 'GRUPPI' as a character, I reclassify it as a factor. GRUPPI<-as.factor(GRUPPI) I check for normality. shapiro.test(concentrazionesierica) Shapiro-Wilk normality test data: concentrazionesierica W = 0.84468, p-value = 2.934e-07 Since I don't have a normal distribution, I apply a non-parametric test, specifically the Kruskal-Wallis test (the non-parametric counterpart of the parametric ANOVA test). I should set the continuous variable as Y and the categorical variable as X. kruskal.test(concentrazionesierica~GRUPPI) Kruskal-Wallis rank sum test data: concentrazionesierica by GRUPPI Kruskal-Wallis chi-squared = 3.808, df = 2, p-value = 0.149 The p-value is >0.05, so there are no significant differences between the three groups of autistic, normal, and delayed children.
Like
0
MARIA PIA CALABRIA
Oct 16, 2023
Solution to mixed exercises using the 'hepatitis' dataset
Exercises with R
1. Load the dataset hepatitis.csv. epatite<- read.csv2("/Users/Davide/Desktop/tutor/Hepatits/hepatitis.csv", sep=";", na.strings = "?") dim(epatite) 2. Proceed with dataset inspection and assign the correct class to variables if necessary. str(epatite) epatite[,c(1,3:12)]<-lapply(epatite[,c(1,3:12)],as.factor) str(epatite) We have converted the categorical variables that were mistakenly classified as int or character into factors, # and we have converted the variables that were mistakenly classified into numeric. 3. Perform missing data analysis: Assess where it is necessary to remove missing data and where data imputation is required. # Let's check for missing data table(is.na(hepatitis)) # There are 122 missing data points. Let's understand their distribution to determine the most appropriate resolution. library(VIM) aggr_plot <- aggr(hepatitis, col = c('navyblue', 'red'), numbers = TRUE, sortVars = TRUE, labels = names(hepatitis), cex.axis = 0.7, gap = 3, ylab = c("Histogram of missing data", "Pattern")) # The variable most affected by the presence of missing data is 'urea.' # The variable most affected is: urea. We can remove this variable and impute the other missing data with the mean: hepatitis1 <- hepatitis[, -17] table(is.na(hepatitis1)) # Now, we have 55 missing data points distributed in these variables: bilirubin, alk.phosphate, aspartate.transaminase, and albumin. Impute them with the mean: hepatitis1$bilirubin[is.na(hepatitis1$bilirubin)] <- mean(hepatitis1$bilirubin, na.rm = TRUE) hepatitis1$alk.phosphate[is.na(hepatitis1$alk.phosphate)] <- mean(hepatitis1$alk.phosphate, na.rm = TRUE) hepatitis1$aspartate.transaminase[is.na(hepatitis1$aspartate.transaminase)] <- mean(hepatitis1$aspartate.transaminase, na.rm = TRUE) hepatitis1$albumin[is.na(hepatitis1$albumin)] <- mean(hepatitis1$albumin, na.rm = TRUE) table(is.na(hepatitis1)) # No missing data remains 4. How many women are there with an age less than 38 who use steroids? (sex=2, steroids=2) donne_steroidi<-as.data.frame(subset(epatite1, epatite1$sex=="2" & epatite1$age<38 & epatite1$steroidi=="2")) dim(donne_steroidi) # 3 donne 5. Reclassify the 'class' variable using the 'level' function (2:dead, 1:alive). levels(epatite1$CLASS) #assegniamo vivo ad 1 e morto a 2 levels(epatite1$CLASS)<-c("vivo","morto") 6. Calculate the mean, mode, and median for the numerical variables in the dataset, and calculate the number of subjects in the following categorical variables: sex, steroids, fatigue, malaise. # Numerical variables: summary(hepatitis1[, c(2, 13:16)]) # Categorical variables: table(hepatitis1$sex) table(hepatitis1$steroids) table(hepatitis1$fatigue) table(hepatitis1$malaise) # Alternatively: table1(~ bilirubin + age + alk.phosphate + aspartate.transaminase + albumin, data = hepatitis1) table1(~ sex + steroids + fatigue + malaise, data = hepatitis1) 7. Evaluate if the number of deceased subjects is significantly different between those who have taken STEROIDS and those who haven't. hisq.test(table(hepatitis1$CLASS, hepatitis1$steroids)) # The p-value is greater than 0.05, so there is no significant difference between those who have taken steroids and those who haven't. 8. Assess if albumin levels are statistically different between steroid users and non-users using the appropriate test, justifying the test choice, and represent it graphically. hisq.test(table(hepatitis1$CLASS, hepatitis1$steroids)) # The p-value is greater than 0.05, so there is no significant difference between those who have taken steroids and those who haven't. 9. Choose the most appropriate graph to visualize the relationship between albumin and alk.phosphate. # First, I assess if the 'albumin' variable is normally distributed: shapiro.test(hepatitis1$albumin) # The p-value is less than 0.05, so we reject the null hypothesis of normality and apply a non-parametric test: wilcox.test(albumin ~ steroids, data = hepatitis1, paired = FALSE) # p-value = 0.0011: we reject the null hypothesis and, therefore, we conclude that albumin levels are statistically different between those who use steroids and those who don't. Let's try to visualize this: levels(hepatitis1$steroids) <- c("no", "yes") boxplot(albumin ~ steroids, data = hepatitis1) 10. Select the most suitable graph to display the frequencies of antiviral usage. barplot(prop.table(table(epatite2$antivirali)), main= "Utilizzo antivirali", xlab = "Gruppi", ylab = "Frequenze") #or ggplot(epatite2, aes(factor(antivirali), fill = factor(antivirali))) + geom_bar(aes(y = (..count..)/sum(..count..))) + ggtitle("Frequenze utilizzo antivirali") 11. Create a graph to depict the median levels of aspartate in the two classes, dead and alive, and in the use of steroids. # Boxplot of 'aspartate.transaminase' levels by 'CLASS' boxplot(hepatitis1$aspartate.transaminase ~ hepatitis1$CLASS, main = "Aspartate Levels in Two Classes", xlab = "Groups", ylab = "Aspartate", ylim = c(0, 280)) ## A cleaner version library(table1) library(ggplot2) library(ggpubr) ggboxplot(hepatitis1, x = "CLASS", y = "aspartate.transaminase", color = "CLASS", palette = "jco", add = "jitter", ylim = c(0, 280)) 12. Is there a significant relationship between bilirubin and phosphatase? # Model: model <- glm(bilirubin ~ alk.phosphate, data = hepatitis1, family = "gaussian") summary(model) # p-value = 0.0496: SIGNIFICANT. There is a significant relationship between bilirubin and phosphatase. # B0: 1.001432 (the value of Y when X=0) # B1: 0.004045 (the change in Y for a one-unit increase in X) # An alternative approach: # 1) Check the normality of both variables: shapiro.test(hepatitis1$bilirubin) shapiro.test(hepatitis1$alk.phosphate) # Since they are not normally distributed, use the Spearman method: cor.test(hepatitis1$bilirubin, hepatitis1$alk.phosphate, method = "spearman") # p-value less than 0.05: there is a relationship between bilirubin and phosphates. 13. Evaluate if there is a significant relationship between CLASS and albumin, considering the steroid variable as a confounder. # Summary of the model: summary(modello <- glm(hepatitis1$CLASS ~ hepatitis1$albumin + hepatitis1$steroids, family = "binomial")) # There is a statistically significant relationship between 'albumin' and 'CLASS,' after accounting for the 'steroids' variable. # B0: -6.7060 (the value of Y when X=0) # B1: exp(2.16): 8.67: An increase of one unit in 'albumin' multiplies the odds of death by 8.67 times, adjusting for steroid use. 14. Display a correlogram showing the correlation between all continuous variables and indicate which variables have a correlation greater than 0.6. library(corrplot) C<-cor(epatite2[,c(13:17)]) corrplot(C, method="number")
Like
0

Training Opportunities

Training Opportunities

GLabStat

foRum

foRum

Exercises with R

Python

MySQL

Bash