# README File for cmprskPHContinMark     package:cmprskPHContinMark      R Documentation

# This code was written by Yanqing Sun, Peter Gilbert, and Ted Holzman 
# (completed in April, 2014).

# This program implements Sun and Gilbert (2012, Scand J Statistics) and
# Gilbert and Sun (2014, Journal of the Royal Statistical Society Series
# C), to analyze an input data-set.

# Description:
#      These papers provide estimation and testing methods for the
#      mark-specific proportional hazards model accommodating that some
#      failures have a missing mark, and allowing separate baseline
#      mark-specific hazard functions for different baseline subgroups.
#      Missing marks are handled via inverse-probability complete-case 
#      weighting (IPW) or augmented IPW.

# NOTE: This code as supplied here assumes only one baseline stratum.
# The code would need slight modification by the user to accommodate multiple
# baseline strata.

# The R function for performing the analysis has been tested on a macintosh, 
# a linux workstation, and a PC.

# Step 1: The user places the file cmprskPHContinMark_0.99.tar.gz in a directory where they wish to run this R package

# Step 2: To use the code from Linux, within that directory execute the commands:

mkdir Rlibs

R CMD INSTALL cmprskPHContinMark_0.99.tar.gz --library=Rlibs

# Step 3: Enter R, and type the command
> library("cmprskPHContinMark",lib.loc="./Rlibs")

# The package depends on another package called "stringr".  If the
# cmprskPHContinMark install fails because of lack of the stringr package, 
# then install stringr from CRAN like this:

> install.packages("stringr")

# Step 4: Having successfully installed cmprskPHContinMark, it can be used like this: 

R
> library("cmprskPHContinMark")
> cmprskPHContinMark(inputf,outpre,nboot,missmark,nauxiliary)

# inputf is the name of the datafile.  If you don't supply it, it defaults 
# to a practice data-set based on the RV144 data used in Gilbert and Sun (2014), 
# practicedataRV144.dat. A practice data-set is used both because it is much smaller
# (greatly shortening run time) and it is not possible to share the real data at this
# time in this forum. Running getPracticeData() in R generates the practice data-set
# practicedataRV144.dat.

# outpre is the prefix name of the output files.  There are four output 
# files named <outpre>_VE, <outpre>_Power, <outpre>_Plot1 and 
# <outpre>_Plot2.  The default value is the same as the value of inputf.
# The files contain
# 	_VE:    the estimated mark-specific VEs with standard errors 
#	_Power: the results of the hypothesis testing procedures
#	_Plot1: the data needed for graphical diagnostics of H_10
#	_Plot2: the data needed for graphical diagnostics of H_20 

# nboot is an integer, the number of bootstrap iterations.  The default 
# (and the maximum) is 500.

# missmark and nauxiliary are flags for options.  1 turns the options on.  
# 0 turns them off.  Importantly, the program can work for data-sets with no missing
# marks (by specifying missmark=0) and for data-sets with no auxiliary covariates 
# (by specifying nauxiliary=0).  

# Detailed help on the meaning of the input variables and how to assemble the input data file
# inputf is provided by typing the following within R:

?cmprskPHContinMark

# The cmprskPHContinMark itself function doesn't return anything useful; if you 
# examine the return value it will contain a list of the values of the 
# parameters you passed in.  Rather, all the results of the calculations are 
# contained in the four output files, and in some of the material 
# written to the screen.

# Any of the parameters can be defaulted, for example to run the program on
# the data-set practicedataRV144.dat type:

cmprskPHContinMark()

# The following will also work, defaulting a subset of the parameters:

cmprskPHContinMark("practicedataRV144.dat",,100,0,)

# In practice, the important step of the user is to assemble the input data-set
# inputf; given the importance of this the meaning of the eight space-separated 
# columns of data is provided here:

# The input data file must contain eight space-separated columns of data:
#        1. subject identifier 1 (could be arbitrary integers)
#        2. subject identifier 2 (could be arbitrary integers)
#        3. binary auxiliary (1 or 0) covariate used for predicting the mark V
#        4. treatment assignment (1 or 0 for vaccine or placebo)
#        5. infection status (1 or 0 for infected or right-censored, respectively)
#        6. time (minimum of failure time or right-censoring time)
#        7. R [indicator of observing the mark (1=infected and observed,
#           0=infected and unobserved, 2=uninfected)]
#        8. mark V (8888 = infected and missing; 99 = uninfected and thus
#           obviously missing)

# If nauxiliary=0, then the variable in column 3 will not be used, but some (arbitrary)
# binary variable still must be included.

# If the input data file is placed in a file name "analysisdata.dat", 
# the program would be run in R with:

cmprskPHContinMark("analysisdata.dat",,,,)

# After the program is run, it is of interest to run the 
# cmprskPHContinMark_makeplots() function to report the results

# The cmprskPHContinMark_makeplots() function takes eight arguments -- all can be defaulted

> cmprskPHContinMark_makeplots(
datafile_92="practicedataRV144_92.dat",
datafile_cm="practicedataRV144_cm.dat",
vefile_92="practicedataRV144_92.dat_VE",
vefile_cm="practicedataRV144_92.dat_VE",
plot1file_92="practicedataRV144_92.dat_Plot1",
plot1file_cm="practicedataRV144_cm.dat_Plot1",
plot2file_92="practicedataRV144_92.dat_Plot2",
plot2file_cm="practicedataRV144_cm.dat_Plot2")

# The defaults are included in the package, accessed with the command getPracticeData()

# The output files are written to the directory from which the 
# cmprskPHContinMark_makeplots routine was run.

###############################################################
# ILLUSTRATION

# To illustrate the implementation above, suppose two different analyses are done for 
# two different distances.
# Practice data-sets practicedataRV144_92.dat and practicedataRV144_cm.dat are provided 
# as defaults to allow this analysis.
# The following is the entire sequence of R commands to create the output. 

# As a pre-requisite, obtain the practice data-sets; this function creates the
# data files practicedata.dat, practicedataRV144_92.dat, and practicedataRV144_cm.dat 
# and places them in the directory where the coding files live:

library("cmprskPHContinMark",lib.loc="./Rlibs")

getPracticeData()

# First set of distances:
cmprskPHContinMark("practicedataRV144_92.dat",,500,1,1)
# The outputted results on point estimates and standard error estimates are put in the file:
# practicedataRV144_92.dat_VE  
# The outputted results on the testing procedures are put in the file:
# practicedataRV144_92.dat_Power

# Second set of distances:
cmprskPHContinMark("practicedataRV144_cm.dat",,500,1,1)
# The outputted results on point estimates and standard error estimates are put in the file:
# practicedataRV144_cm.dat_VE
# The outputted results on the testing procedures are put in the file:
# practicedataRV144_cm.dat_Power

#Plot the output
cmprskPHContinMark_makeplots(datafile_92 ="practicedataRV144_92.dat",
datafile_cm ="practicedataRV144_cm.dat",
vefile_92 ="practicedataRV144_92.dat_VE",
vefile_cm ="practicedataRV144_cm.dat_VE",
plot1file_92="practicedataRV144_92.dat_Plot1",
plot1file_cm="practicedataRV144_cm.dat_Plot1",
plot2file_92="practicedataRV144_92.dat_Plot2",
plot2file_cm="practicedataRV144_cm.dat_Plot2")

# The outputted plots of interest are put in the files:
# Figure1markmissingtestingmindistkwong.92cm_boxplots.eps      RV144_TestProcs_A1h3P_kwong_92cm.eps
# Figure1markmissingtestingmindistkwong.92cm_scatterplots.eps  RV144_VECI_A1h3P_mindist_kwong.92cm.eps