Tools for modelling categorical variables in Lisp-Stat ----------------------------------------------------- This documentation is somewhat disorganised because it reflects three different versions of the tools. They have all been kept because they may all be of some use. The first and most elegant is an object-oriented set, including methods for regression-model-proto and its descendants. The second is the functions ued to implement the object methods. These would be useful for models where a method hasn't yet been written. Finally there are is a set of primitive functions for constructing factor and interaction design matrices. -------------- Object-oriented methods ------------------ model-formula-proto is a prototype for a model formula class. To create a new model formula use the (as-formula) function (as-formula '((factor a) (factor b) (term x) (interaction (list a b)) (interaction (list a x) :is-factor (list t nil)))) The argument *must* be quoted. I'm sure there is some way of getting round this -- the equivalent of deparse(substitute()) in S but I'm not a good enough LISP hacker to know what it is. Please tell me if you know -- it probably involves using a macro rather than a function. Note that the :is-factor keyword in (interaction) is optional but that the default is to assume everything is a factor. If one of the variables is continuous and you forget :is-factor the results will be obviously wrong (even if you don't run out of stack space). This is the best I can do with a function that accepts lists of numbers as input. The Right Thing to do is probably what S does: force the user to predeclare factor variables as objects and treat everything else as a metric variable. This is not too difficult and I may well do a revised version that makes everything completely object oriented. See the Individual functions section for more documentation - as-formula is just an object version of (formula) The formula object responds to messaages :formula :design-matrix :name-list :block-names :block-indices The first sets or returns the formula (the argument to (as-formula)) and recomputes if necessary. The other four return the appropriate things: a design matrix, a list of column labels for it, a list of names for each block of columns (ie each factor, interaction or metric variable), and a list of lists of column numbers for each block. There is a function (regression-formula) which is exactly the same as (regression-model) except that it takes a model-formula-proto object instead of the x argument. It then stores the formula in a new slot in the regression-model object and optionally invokes a new method :display-with-formula which gives Wald tests for the various factors, interactions and other terms in the model. The method :display-with-formula will work on glim-proto's as well, but a new set of constructor functions similar to (regression-formula) would be needed. My first priority is to get this working on GEEs rather than GLMs. Example: > (def a (repeat (iseq 1 2) 30)) A > (def b (repeat (iseq 1 3) 20)) B >(def x (normal-rand 60)) X > (def y (+ x a (normal-rand 60))) Y > (def formula1 (as-formula '((term x) (factor a) (factor b) (interaction (list a b))))) FORMULA1 > (send formula1 :name-list) (send formula1 :name-list) ("((A 2) (B 2))" "((A 2) (B 3))" "(B 2)" "(B 3)" "(A 2)" "X") > (send formula1 :block-names) ("(INTERACTION A B)" "B" "A" "X") > (send formula1 :block-indices) ((1 0) (3 2) (4) (5)) > (regression-model (send formula1 :design-matrix) y :response-name "Why?" :predictor-names (send formula1 :name-list)) Least Squares Estimates: Constant 0.867491 (0.308863) ((A 2) (B 2)) -0.481680 (0.621703) ((A 2) (B 3)) -0.666604 (0.624665) (B 2) 0.782557 (0.441825) (B 3) 0.538866 (0.444595) (A 2) 1.16702 (0.437606) X 1.30912 (0.155051) R Squared: 0.634662 Sigma hat: 0.976272 Number of cases: 60 Degrees of freedom: 53 # > (def modela (regression-formula formula1 y :response-name "Why?")) Block Wald Chisq p-value X 71.287 0.0000 A 7.1119 0.0077 B 3.3431 0.1880 (INTERACTION A B) 1.2326 0.5399 MODELA > (send modela :display) Least Squares Estimates: Constant 0.867491 (0.308863) ((A 2) (B 2)) -0.481680 (0.621703) ((A 2) (B 3)) -0.666604 (0.624665) (B 2) 0.782557 (0.441825) (B 3) 0.538866 (0.444595) (A 2) 1.16702 (0.437606) X 1.30912 (0.155051) R Squared: 0.634662 Sigma hat: 0.976272 Number of cases: 60 Degrees of freedom: 53 NIL > (send modela :display-with-formula) Block Wald Chisq p-value X 71.287 0.0000 A 7.1119 0.0077 B 3.3431 0.1880 (INTERACTION A B) 1.2326 0.5399 NIL > (send modela :display-with-formula nil) Block Wald Chisq p-value X 71.287 0.0000 Variable Estimate Std.Err. p-value X 1.3091 (0.155051) 0.0000 A 7.1119 0.0077 Variable Estimate Std.Err. p-value (A 2) 1.1670 (0.437606) 0.0077 B 3.3431 0.1880 Variable Estimate Std.Err. p-value (B 3) 0.53887 (0.444595) 0.2255 (B 2) 0.78256 (0.441825) 0.0765 (INTERACTION A B) 1.2326 0.5399 Variable Estimate Std.Err. p-value ((A 2) (B 3)) -0.66660 (0.624665) 0.2859 ((A 2) (B 2)) -0.48168 (0.621703) 0.4385 NIL --------------- Individual functions --------------------- The (formula) function takes a quoted list describing a model as its argument and returns (design-matrix names block-names block-indices) where design-matrix is a design-matrix, names is a list of names for its columns, block-names is a list of names for multicolumn model terms such as categorical predictors and block-indices is a list of lists of integers describing which columns of the design matrix comprise which blocks. The canonical form for input is like this (formula '((factor a) (factor b) (term x) (interaction (list a b)) (interaction (list a x) :is-factor (list t nil)))) The argument *must* be quoted. I'm sure there is some way of getting round this -- the equivalent of deparse(substitute()) in S but I'm not a good enough LISP hacker to know what it is. Please tell me if you know -- it probably involves using a macro rather than a function. If you try something clever to save time like storing (interaction (list a x) :is-factor (list t nil)) in a variable you may get silly results. Don't do that, then. You can change the default variable names (though not the default block names) using the relevant keyword arguments in factor and interaction. See the documentation for the functions (below) and the example to find out how. You need to specify the new name everywhere in order to make the labels come out consistent. It's probably easier to use the right variable name to begin with. Note that the :is-factor keyword in (interaction) is optional but that the default is to assume everything is a factor. If one of the variables is continuous and you forget :is-factor the results will be obviously wrong (even if you don't run out of stack space). This is the best I can do with a function that accepts lists of numbers as input. The Right Thing to do is probably what S does: force the user to predeclare factor variables as objects and treat everything else as a metric variable. This is not too difficult and I may well do a revised version that makes everything properly object oriented. (formula-display) takes a vector of coefficient estimates, a covariance matrix and the output of (formula) and displays the Wald chisquare test for each block (factor, term or interaction), with or without the separate estimates and p-values for each variable. Details (formula-display beta cov-mat model-formula &key (block-only t) (intercept t)) beta is the coefficient estimates, cov-mat is the covariance matrix, model-formula is the output of the original formula command. Set block-only to nil to get tests for each separate variable. If the model does not have an intercept set intercept to nil. Example: > (def a (repeat (iseq 1 2) 30)) A > (def b (repeat (iseq 1 3) 20)) B >(def x (normal-rand 60)) X > (def y (+ x a (normal-rand 60))) Y > (def result (formula '((term x) (factor a) (factor b) (interaction (list a b))))) RESULT > (def model1 (regression-model (first result) y :predictor-names (second result))) Least Squares Estimates: Constant 0.867491 (0.308863) ((A 2) (B 2)) -0.481680 (0.621703) ((A 2) (B 3)) -0.666604 (0.624665) (B 2) 0.782557 (0.441825) (B 3) 0.538866 (0.444595) (A 2) 1.16702 (0.437606) X 1.30912 (0.155051) R Squared: 0.634662 Sigma hat: 0.976272 Number of cases: 60 Degrees of freedom: 53 MODEL1 > (def model1.cov (* (^ (send model1 :sigma-hat) 2) (send model1 :xtxinv))) MODEL1.COV > (formula-display (send model1 :coef-estimates) model1.cov result :block-only nil) Block Wald Chisq p-value X 71.287 0.0000 Variable Estimate Std.Err. p-value X 1.3091 (0.155051) 0.0000 A 7.1119 0.0077 Variable Estimate Std.Err. p-value (A 2) 1.1670 (0.437606) 0.0077 B 3.3431 0.1880 Variable Estimate Std.Err. p-value (B 3) 0.53887 (0.444595) 0.2255 (B 2) 0.78256 (0.441825) 0.0765 (INTERACTION A B) 1.2326 0.5399 Variable Estimate Std.Err. p-value ((A 2) (B 3)) -0.66660 (0.624665) 0.2859 ((A 2) (B 2)) -0.48168 (0.621703) 0.4385 > (formula-display (send model1 :coef-estimates) model1.cov result) Block Wald Chisq p-value X 71.287 0.0000 A 7.1119 0.0077 B 3.3431 0.1880 (INTERACTION A B) 1.2326 0.5399 NIL Now we recode the predictors and relabel. > (def a (repeat (list "Male" "Female") 30)) A > (def b (repeat (list "Never" "Past" "Current") 20)) B > (def result1 (formula '((term x :namebase "Age") (factor a :namebase "Sex") (factor b :namebase "Smoking") (interaction (list a b) :namebases (list "Sex" "Smoking"))))) RESULT1 > (formula-display (send model1 :coef-estimates) model1.cov result1) Block Wald Chisq p-value Age 71.287 0.0000 Sex 7.1119 0.0077 Smoking 3.3431 0.1880 (INTERACTION Sex Smoking) 1.2326 0.5399 NIL > (formula-display (send model1 :coef-estimates) model1.cov result1 :block-only nil) Block Wald Chisq p-value Age 71.287 0.0000 Variable Estimate Std.Err. p-value Age 1.3091 (0.155051) 0.0000 Sex 7.1119 0.0077 Variable Estimate Std.Err. p-value (Sex Male) 1.1670 (0.437606) 0.0077 Smoking 3.3431 0.1880 Variable Estimate Std.Err. p-value (Smoking Past) 0.53887 (0.444595) 0.2255 (Smoking Never) 0.78256 (0.441825) 0.0765 (INTERACTION Sex Smoking) 1.2326 0.5399 Variable Estimate Std.Err. p-value ((Sex Male) (Smoking Past)) -0.66660 (0.624665) 0.2859 ((Sex Male) (Smoking Never)) -0.48168 (0.621703) 0.4385 NIL ----------------------- primitives used by the above methods ----------------- These functions provide a basic level of support for categorical predictor variables to be used in the Lisp-Stat regression functions. You can use just these to get more control over the process. I should have used the functions provided with glim-proto but I didn't realise they existed. There are three main functions, (factor), (interaction), and (block-test). The first constructs a set of treatment contrasts -- indicator variables for all but the first level of the factor -- and optionally a set of appropriate names. The second takes a list of variables and constructs a matrix of variables for the interaction terms. It allows both continuous and factor variables. Only the highest order interaction terms are included, so to get a matrix for the model A + B + A:B you would need to call (factor) on each of A and B and then (interaction) on A and B together. This behaviour is useful for two reasons: it makes it much easier to implement a model formula system and it allows non-hierarchical models on the occasions when they are appropriate. The third function calculates and displays the Wald test for a block of variables all having zero coefficient and the individual coefficient estimates, standard errors and Wald p-values. There are also two functions (design) and (names) that return the design matrix or list of names from a list of terms of the sort returned by (factor) and (interaction) Details (factor x &key namebase) x is a sequence and namebase is either a string or a symbol. Returns a list containing the design matrix of treatment contrasts and a list of variable names. If namebase is nil returns only the design matrix (*not* a list containing only the design matrix) Examples: >(factor (list 2 3 4 2 3 4) :namebase "A") (#2A((0 0) (1 0) (0 1) (0 0) (1 0) (0 1)) ("(A 3)" "(A 4)")) >(factor (list 2 3 4 2 3 4) :namebase 'A) (#2A((0 0) (1 0) (0 1) (0 0) (1 0) (0 1)) ("(A 3)" "(A 4)")) > (factor (list 2 3 4 2 3 4) ) #2A((0 0) (1 0) (0 1) (0 0) (1 0) (0 1)) (interaction (xs &key (is-factor (repeat t (length xs))) (namebases (repeat nil (length xs))) ) xs is a list of sequences, is-factor is a list of t and nil values indicating whether the respective variable should be treated as a factor, namebases is a list of strings or symbols. Returns a design matrix (if namebases or (first namebases) is nil) or a list containing a design matrix and a list of names. Examples > (def a (list 1 2 1 2 1 2)) A > (def b (list 5 6 7 5 6 7)) B > (def c (list 0.1 0.2 0.3 0.4 0.5 0.6)) C > (interaction (list a b) :is-factor (list t t) :namebases (list "A" 'B)) (#2A((0 0) (1 0) (0 0) (0 0) (0 0) (0 1)) ("((A 2) (B 6))" "((A 2) (B 7))")) > (interaction (list a c) :is-factor (list t nil) :namebases (list "A" "C")) (#2A((0.0) (0.2) (0.0) (0.4) (0.0) (0.6)) ("((A 2) C)")) > (interaction (list a b c) :is-factor (list t t nil) :namebases (list "A" "B" "C")) (#2A((0.0 0.0) (0.2 0.0) (0.0 0.0) (0.0 0.0) (0.0 0.0) (0.0 0.6)) ("(((A 2) (B 6)) C)" "(((A 2) (B 7)) C)")) > (interaction (list a b c) :is-factor (list t t nil) ) #2A((0.0 0.0) (0.2 0.0) (0.0 0.0) (0.0 0.0) (0.0 0.0) (0.0 0.6)) (block-test (index beta covmat &key blockname names (block-only nil)) index is a sequence of integers indicating which components of the coefficient vector are in the block, beta is the coefficient vector, covmat is the covariance matrix, blockname is a string giving the name of the block as a whole. Prints out a table with the Wald test of the hypothesis that all the coefficients in the block are zero and if block-only is nil separate tests for each coefficient. Examples: > (def a (binomial-rand 100 4 0.3)) A > (def b (binomial-rand 100 4 0.3)) B > (def x (normal-rand 100 )) X > (def y (+ a x (normal-rand 100))) Y > (def factora (factor b :namebase "B")) FACTORA > (def factora (factor a :namebase "A")) FACTORA > (def factorb (factor b :namebase "B")) FACTORB > (def varx (list x '("X"))) VARX > (def ab (interaction (list a b) :namebases (list "A" "B"))) AB > (def ax (interaction (list a x) :namebases (list "A" "X") :is-factor (list t nil))) AX > (def bx (interaction (list b x) :namebases (list "B" "X") :is-factor (list t nil))) BX > (def abx (interaction (list a b x) :namebases (list "A" "B" "X") :is-factor (list t t nil))) ABX > (def model1 (regression-model (design (list factora varx)) y :predictor-names (names (list factora varx)))) Least Squares Estimates: Constant 0.299999 (0.200000) (A 1) 0.677777 (0.255555) (A 2) 1.74724 (0.277777) (A 3) 2.61529 (0.388888) (A 4) 4.33253 (0.699999) X 1.00357 (9.175541E-2) R Squared: 0.700000 Sigma hat: 0.933333 Number of cases: 100 Degrees of freedom: 94 MODEL1 > (def model1-cov (* (send model1 :sigma-hat) (send model1 :xtxinv))) MODEL1-COV >(block-test (list 0 1 2 3) (send model1 :coef-estimates) model1-cov :blockname "A" :names (names (list factora)) ) A 227.13 0.0000 (A 1) 0.29999 (0.211111) 0.1666 (A 2) 0.67777 (0.266666) 0.0096 (A 3) 1.7472 (0.288888) 0.0000 (A 4) 2.6153 (0.400000) 0.0000 > (block-test (list 0 1 2 3) (send model1 :coef-estimates) model1-cov :blockname "A" :block-only t) A 227.13 0.0000 (def model2 (regression-model (design (list factora factorb ax bx varx)) y :predictor-names (names (list factora factorb ax bx varx)))) Least Squares Estimates: Constant -6.656751E-2 (0.266666) (A 1) 0.722222 (0.255555) (A 2) 1.75989 (0.288888) (A 3) 2.69490 (0.399999) (A 4) 4.66799 (0.699999) (B 1) 0.533333 (0.255555) (B 2) 0.155555 (0.277777) (B 3) 0.422222 (0.355555) ((A 1) X) 0.233333 (0.277777) ((A 2) X) 0.455555 (0.299999) ((A 3) X) 0.199999 (0.388888) ((A 4) X) 1.24579 (0.999999) ((B 1) X) -0.288888 (0.277777) ((B 2) X) 0.133333 (0.299999) ((B 3) X) -0.611111 (0.399999) X 0.866666 (0.300000) R Squared: 0.744444 Sigma hat: 0.900000 Number of cases: 100 Degrees of freedom: 84 MODEL2 > (def model2-cov (* (send model2 :sigma-hat ) (send model2 :xtxinv))) MODEL2-COV > (block-test (list 0 1 2 3) (send model2 :coef-estimates) model2-cov :blockname "A" :block-only t) A 81.223 0.0000 > (block-test (list 4 5 6) (send model2 :coef-estimates) model2-cov :blockname "B" :block-only t) B 45.812 0.0000 > (block-test (iseq 7 10) (send model2 :coef-estimates) model2-cov :blockname "AX" :block-only t) AX 3.4602 0.4888 > (block-test (iseq 11 13) (send model2 :coef-estimates) model2-cov :blockname "BX" :block-only t) BX 4.8578 0.1888