data=
argument to svydesign
can specify a table or view in a relational database. At the moment (version 3.9-1) the database must have an R interface based on DBI or ODBC. Only read access to the database is needed; no attempts are made to modify the underlying dataThe DBI interface has only been tested using SQLite, but should work with the ROracle, RMySQL, and RJDBC packages, and probably with RdbiPgSQL from the Bioconductor project. The ODBC interface has been tested using SQLite under OS X and using MS Access and SQLite under Windows.
As an example, we analyse some data from NHANES III. The multiply-imputed data sets have been loaded into a SQLite database as core
, imp1
, ..., imp5
, and merged to create database views for analysis, with
create view set1 as select * from core inner join imp1 using(SEQN)We will use the first imputed data set,
set1
. The call to syvdesign
specifies data="set1"
as the data table and gives dbtype="SQLite"
, to use the SQLite driver, and dbname="~/nhanes/imp.db"
as the database name. SQLite does not need authentication, but any additional arguments that dbConnect
will need to specify network address, authentication,etc, can be passed to svydesign
.
> library(survey) > library(RSQLite) Loading required package: DBI > > dhanes<-svydesign(id=~SDPPSU6, strat=~SDPSTRA6, weight=~WTPFQX6, nest=TRUE, data="set1", dbname='~/nhanes/imp.db', dbtype="SQLite") > > dhanes DB-backed Stratified 1 - level Cluster Sampling design (with replacement) With (98) clusters. svydesign(ids = ids, probs = probs, strata = strata, data = design.data, fpc = fpc, variables = variables, nest = nest, check.strata = check.strata, weights = weights)Despite containing 33994 records on 268 variables, the object is relatively small,about 4Mb
> dim(dhanes) [1] 33994 268 > object.size(dhanes) [1] 4363168Most of the 4Mb is the design meta-data: weights, and PSU and strata identifiers. The entire data set would be 75Mb and reading it in would temporarily require more than 200Mb of memory. The savings are largest when the number of variables is large: the size of the object is proportional to the number of records but independent of the number of variables.
If we had the data in a database with an ODBC interface, as would usually be the case on Windows, the svydesign
call would be very similar. We use dbtype="ODBC"
to specify the ODBC interface and for the dbname
argument we use the declared DSN (data source name) for the database (declared in the ODBC applet in the Control Panel).
odhanes<-svydesign(id=~SDPPSU6, strat=~SDPSTRA6, weight=~WTPFQX6, nest=TRUE, data="set1", dbname='nhanes3', dbtype="ODBC")We can calculate some simple means and proportions for height, weight, and self-rated health
> svymean(~factor(HAB1MI), dhanes) mean SE factor(HAB1MI)-9 0.252690 0.0045 factor(HAB1MI)1 0.155718 0.0053 factor(HAB1MI)2 0.229727 0.0055 factor(HAB1MI)3 0.244180 0.0053 factor(HAB1MI)4 0.094067 0.0042 factor(HAB1MI)5 0.023618 0.0012 > svyby(~BMPWTMI,~factor(HAB1MI), design=dhanes, svymean) factor(HAB1MI) statistics.BMPWTMI se.BMPWTMI -9 -9 33.66272 0.4445430 1 1 72.12343 0.4095566 2 2 74.28381 0.3574284 3 3 76.24362 0.5225713 4 4 76.67077 0.6193902 5 5 74.04166 0.9048933 > svycoplot(BMPWTMI~HSAGEIR|HAB1MI, dhanes)
The people in poor health tend to be older, but not to have much difference in weight. The unavailable self-rated health is for children.
Hematocrit (red blood cell concentration) is slightly lower for people poor health (adjusted for age), and the difference between levels 1 and 5 is statistically significant.
> m<-svyglm(HTPMI~HSAGEIR+factor(HAB1MI),design=dhanes) > summary(m) Call: svyglm(HTPMI ~ HSAGEIR + factor(HAB1MI), design = dhanes) Survey design: svydesign(ids = ids, probs = probs, strata = strata, data = design.data, fpc = fpc, variables = variables, nest = nest, check.strata = check.strata, weights = weights) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.1120776 0.1441689 250.484 <2e-16 *** HSAGEIR -0.0001076 0.0020744 -0.052 0.959 factor(HAB1MI)1 5.8721946 0.1654730 35.487 <2e-16 *** factor(HAB1MI)2 5.8431314 0.1417736 41.215 <2e-16 *** factor(HAB1MI)3 5.8879374 0.1760945 33.436 <2e-16 *** factor(HAB1MI)4 5.4271810 0.1805892 30.053 <2e-16 *** factor(HAB1MI)5 5.0784499 0.2903547 17.491 <2e-16 *** --- > svycontrast(m, c(`factor(HAB1MI)1`=1, `factor(HAB1MI)5`=-1)) contrast SE contrast 0.79374 0.2416