.BG Fits a logistic regression model to data arising from two phase designs .FN tps .DN Returns estimates and standard errors from logistic regression fit to data arising from two phase designs. Three semiparametric methods are implemented to obtain estimates of the regression coefficients and their standard errors. Use of this function requires existence of a finite number of strata (J) so that the phase one data consist of a joint classification into 2J cells according to binary outcome and stratum. This function can also handle certain missing value and measurement error problems with validation data. .IP The phase I sample can involve either cohort or case-control sampling. This software yields correct estimates (and standard errors) of all the regression coefficients (including the intercept) under cohort sampling at phase I. When phase I involves case-control sampling one cannot estimate the intercept, except, when the marginal odds of observing a case in the population is specified. Then the software yields a correct estimate and standard error for the intercept also. .CS tps(formula=formula(data), data=sys.parent(), nn0, nn1, group, contrasts=NULL, method="PL", cohort=T, alpha=1.0) .RA .AG formula a formula expression as for other binomial response regression models, of the form response ~ predictors, where both the response and predictors corresponds to observations at phase II sample. The response can be either a vector of 0's and 1's or else a matrix with two columns representing number of cases (response=1) and controls (response=0) corresponding to the rows of the design matrix. .AG nn0 control stratum frequency at phase I (vector of length J). .AG nn1 case stratum frequency at phase II (vector of length J). .AG group stratum identification for phase II data. Values should be in {1,...,J} , where J is the number of strata (vector of same length as the response and predictors). .OA .AG data an optional data frame for phase two sample in which to interpret the variables occurring in the formula. .AG contrasts a list of contrasts to be used for some or all of the factors appearing as variables in the model formula. See the documentation of glm for more details. .AG method three different procedures are available. The default method is "PL" which implements pseudo-likelihood as developed by Breslow and Cain (1988). Other possible choices are "WL" and "ML" which implements, respectively, weighted likelihood ( Flanders and Greenland , 1991; Zhao and Lipsitz, 1992) and maximum likelihood (Breslow and Holubkov, 1997; Scott and Wild , 1997). .AG cohort logical flag. FALSE indicates that separate samples of "cases" and "controls" were drawn at phase I. .AG alpha marginal odds of observing a case in the population. This is only used when cohort=F is specified and must be correctly specified in order to obtain a correct estimate of the intercept. .RT list of estimated regression coefficients and one or two estimates of their asymptotic variance-covariance matrix. .RC coef regression coefficients. .RC covm model based variance-covariance matrix. This is available for methods = "PL" ,"ML". .RC cove empirical variance-covariance matrix. This is available for all the three methods. .DT the WL method fits a logistic regression model to the phase II data with a set of weights. Each unit is weighted by the ratio of frequencies (phase I/phase II) for the corresponding outcome X stratum cell. This estimator has its origins in sampling theory and is well known as Horvitz-Thompson method. The PL method maximizes the product of conditional probabilities of "being a case" given the covariates and the fact of inclusion in the phase II sample. This is called the "complete data likelihood" by some researchers. The estimate is obtained by fitting a logistic regression model to the phase II data with a set of offsets. The ML procedure maximizes the full likelihood of the data (phase I and II) jointly with respect to the regression parameters and the marginal distribution of the covariates. The resulting concentrated score equations (Breslow and Holubkov (1997) , eq. 18 ) were solved using a modified Newton-Raphson algorithm. Schill's (1993) partial likelihood estimates are used as the starting values. .IP The output may be examined by the summary functions, which returns the estimated coefficients and their standard errors in a tabular form .SH REFERENCES Breslow, N.E. and Cain, K.C. ( 1988 ). "Logistic regression for two-stage case control data." Biometrika 75 , 11-20. Breslow, N.E. and Chatterjee (1999). "Design and Analysis of Two Phase Studies with Binary Outcome Applied to Wilms Tumour Prognosis." Applied Statistics 48 (part 3) -- in press Breslow, N.E. and Holubkov, R. (1997). "Maximum likelihood estimation for logistic regression parameters under two-phase, outcome dependent sampling." J. Roy. Statist. Soc. B. 59,447-461. Flanders W.D. and Greenland S. (1991). "Analytic methods for two-stage case-control studies and other stratified designs." Statistics in Medicine 10:739-747,1991. Schill, W., Jockel K.-H., Drescher, K., and Timm, J.(1993). "Logistic analysis in case-control studies under validation sampling." Biometrika 80, 339-352. Scott, A.J. and Wild, C.J. (1997). "Fitting regression models to case control data by maximum likelihood." Biometrika 78, 705-717. Zhao L.P., Lipsitz S. (1992)."Design and analysis of two-stage studies."Statistics in Medicine 11:769-782. .EX # Read in the complete Wilms Tumor Data (Breslow and Chatterjee , 1998) nwts.dat <- read.table("NWTS.dat",header=T) attach(nwts.dat) # Summarize by stg (stage) , instit (institutional histology) and # relapse status to generate the phase I data nn0 <- as.vector(crosstabs(control~ stg + instit)) nn1 <- as.vector(crosstabs(case~ stg + instit)) # Read in the phase II data (balanced sample) nwtsb.dat <- read.table("NWTSb.dat",header=T) # Create the stratum indicator variable. There are eight strata defined # by the four levels of stage and two levels of institutional histology strt <- c( rep(1:4,2) , rep(5:8,2)) options(contrasts=c("contr.treatment","contr.poly"),error=dump.frames) # Fit the data by method = "PL" fit.pl <- tps(cbind(case,control)~factor(stg)*factor(hist), data = nwtsb.dat, nn0=nn0 , nn1 = nn1 , group = strt , cohort = T) sfit.pl <- summary(fit.pl) # The output looks like this round(sfit.pl$coefficients,3) Value Model SE Emp SE (Intercept) -2.716 0.109 0.109 factor(stg)2 0.785 0.148 0.148 factor(stg)3 0.804 0.154 0.154 factor(stg)4 1.039 0.182 0.180 factor(hist) 1.380 0.282 0.281 factor(stg)2factor(hist) 0.088 0.370 0.371 factor(stg)3factor(hist) 0.447 0.353 0.355 factor(stg)4factor(hist) 1.375 0.444 0.450 .KW ~keyword .WR