Can read the data directly from the website if your computer is connected online
LHON=read.table("http://faculty.washington.edu/tathornt/sisg/LHON.txt",header=TRUE,stringsAsFactors = TRUE)
If file is saved on your computer, could also read the data in from the directory that contains the file.
LHON2=read.table("../DATA_and_EXERCISES/LHON.txt",header=TRUE,stringsAsFactors = TRUE)
View the first few lines of the LHON data
head(LHON)
# IID GENO PHENO
# 1 ID1 TT CONTROL
# 2 ID2 CT CONTROL
# 3 ID3 TT CASE
# 4 ID4 CT CONTROL
# 5 ID5 TT CONTROL
# 6 ID6 TT CONTROL
Get information about the types of variables in the LHON data frame
str(LHON)
# 'data.frame': 328 obs. of 3 variables:
# $ IID : Factor w/ 328 levels "ID1","ID10","ID100",..: 1 112 223 263 274 285 296 307 318 2 ...
# $ GENO : Factor w/ 3 levels "CC","CT","TT": 3 2 3 2 3 3 1 3 3 3 ...
# $ PHENO: Factor w/ 2 levels "CASE","CONTROL": 2 2 1 2 2 2 2 2 2 2 ...
#2. Logistic regression
First create a 0 and 1 phenotype variable indicating Case/Control Status to perform the logistic regression analysis
LHON$newpheno=with(LHON,ifelse(PHENO=="CASE",1,0))
What would be the reference genotype for a logistic regression analysis? Use the levels command in R. The first factor will be the reference genotype.
levels(LHON$GENO)
# [1] "CC" "CT" "TT"
logistmod1=glm(newpheno~GENO,family=binomial(link="logit"),data=LHON)
View the summary results of the logistic regression model, including parameter estimates and standard errors
summary(logistmod1)
#
# Call:
# glm(formula = newpheno ~ GENO, family = binomial(link = "logit"),
# data = LHON)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -0.9695 -0.8701 -0.8701 1.5197 2.1093
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.5108 0.5164 -0.989 0.3226
# GENOCT -1.5994 0.6378 -2.508 0.0122 *
# GENOTT -0.2654 0.5349 -0.496 0.6197
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 383.49 on 327 degrees of freedom
# Residual deviance: 368.48 on 325 degrees of freedom
# AIC: 374.48
#
# Number of Fisher Scoring iterations: 4
##2b. Obtain odds ratios and confidence intervals for the CT and TT genotypes.
Can obtain the odds ratio estimates by exponentiating the coefficient estimates from the logistic regression model. What is the odds ratio for the CT genotype?
exp(-1.5994)
# [1] 0.2020177
Can obtain a confidence interval for the odds ratio parameter for the CT genotype use the standard error of the coefficient estimates from the logistic regression model
myse=1.96*(.6378)
CI=c(-1.5994-myse,-1.5994+myse)
exp(CI)
# [1] 0.05787394 0.70517308
Similarly can obtain odds ratio estimates and 95% confidence intervals for genotype TT.
exp(-.2654)
# [1] 0.7668991
myse=1.96*(.5349)
CI=c(-.2654-myse,-.2654+myse)
exp(CI)
# [1] 0.2687956 2.1880353
Alternatively, can obtain the odds ratio estimates and confidence intervale for all paramaters in the logistic regression model by using the coef() and confint.default() function
exp(coef(logistmod1))
# (Intercept) GENOCT GENOTT
# 0.6000000 0.2020202 0.7668712
exp(confint.default(logistmod1))
# 2.5 % 97.5 %
# (Intercept) 0.21806837 1.650858
# GENOCT 0.05787424 0.705187
# GENOTT 0.26878265 2.187981
Way too many significant digits to report. Use the round() function
round(exp(coef(logistmod1)),2)
# (Intercept) GENOCT GENOTT
# 0.60 0.20 0.77
round(exp(confint.default(logistmod1)),2)
# 2.5 % 97.5 %
# (Intercept) 0.22 1.65
# GENOCT 0.06 0.71
# GENOTT 0.27 2.19
#3. Logistic regression with TT as the reference genotype
Use the relevel function to create a new genotype vector with reference genotype TT
LHON$NEWGENO=with(LHON,relevel(GENO, ref = "TT"))
levels(LHON$NEWGENO)
# [1] "TT" "CC" "CT"
Perform the logistic regression analysis TT as the reference genotype.
logistmod2=glm(newpheno~NEWGENO,family=binomial(link="logit"),data=LHON)
View the summary results of the logistic regression model, including parameter estimates and standard errors
summary(logistmod2)
#
# Call:
# glm(formula = newpheno ~ NEWGENO, family = binomial(link = "logit"),
# data = LHON)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -0.9695 -0.8701 -0.8701 1.5197 2.1093
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -0.7763 0.1395 -5.563 2.64e-08 ***
# NEWGENOCC 0.2654 0.5349 0.496 0.619739
# NEWGENOCT -1.3340 0.3995 -3.339 0.000841 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 383.49 on 327 degrees of freedom
# Residual deviance: 368.48 on 325 degrees of freedom
# AIC: 374.48
#
# Number of Fisher Scoring iterations: 4
Obtain the odds ratio estimates and confidence intervale for all paramaters in the logistic regression model
exp(coef(logistmod2))
# (Intercept) NEWGENOCC NEWGENOCT
# 0.4601227 1.3040000 0.2634343
exp(confint.default(logistmod2))
# 2.5 % 97.5 %
# (Intercept) 0.3500310 0.6048404
# NEWGENOCC 0.4570423 3.7204782
# NEWGENOCT 0.1203945 0.5764187
Why are the odds ratios different for CT now?
The reference genotype group has changed from CC to TT, and the TT genotype group is much larger, which increases the precision of the estimate of the effect. Can see this more clearly with a stacked barplot
plot(factor(PHENO)~factor(GENO), data=LHON)
Can also conduct a logistic regression based on an additive logistic regression model. First create a genotype variable with an additive coding based on the counts of the number of T alleles
LHON$genoadd <- with(LHON, 0 + 1*(GENO=="CT") + 2*(GENO=="TT"))
Now perform the logistic regression analysis with the additive genotype coding
logistmod3 <- glm(newpheno~genoadd,family=binomial(link="logit"),data=LHON)
summary(logistmod3)
#
# Call:
# glm(formula = newpheno ~ genoadd, family = binomial(link = "logit"),
# data = LHON)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -0.8436 -0.8436 -0.6854 1.5531 1.9797
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.8077 0.4554 -3.970 7.2e-05 ***
# genoadd 0.4787 0.2505 1.911 0.0559 .
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 383.49 on 327 degrees of freedom
# Residual deviance: 379.47 on 326 degrees of freedom
# AIC: 383.47
#
# Number of Fisher Scoring iterations: 4
Obtain the odds ratio estimates and confidence intervale for all paramaters in the logistic regression model
round(exp(coef(logistmod3)),3)
# (Intercept) genoadd
# 0.164 1.614
round(exp(confint.default(logistmod3)),3)
# 2.5 % 97.5 %
# (Intercept) 0.067 0.400
# genoadd 0.988 2.637