To allocate a sample among different stages of sampling, the
contributions of the different stages to the variance of an estimator
must be considered. These components of variance generally depend on the
analysis variable and also on the form of the estimator. This vignette
covers some basic variance results for linear estimators in two-stage
and three-stage sampling and how the components can be estimated with
functions in PracTools
. Technical background is in Valliant, Dever, and Kreuter (2018b), ch.9.
First, the package must be loaded with
Alternatively, require(PracTools)
can be used.
Consider a two-stage sample design in which the first-stage units are selected using πps sampling, i.e., with varying probabilities and without replacement. We will also refer to this as ppswor sampling. Elements are selected at the second stage via simple random sampling without replacement (srswor). Quite a bit of notation is needed, even in this fairly simple case:
U = universe of PSUs
M = number of PSUs in universe
Ui = universe of elements in PSU i
Ni = number of elements in the population for PSU i
N=∑i∈UNi is the total number of elements in the population
πi = selection probability of PSU i
πij = joint selection probability of PSUs i and j
m = number of sample PSUs
ni = number of sample elements in PSU i
s = set of sample PSUs
si = set of sample elements in PSU i
yk = analysis variable for element k in PSU i (subscript i is implied)
ˉyU = mean per element in the population
ˉyUi = mean per element in the population in PSU i
The π-estimator of the population total, tU=∑i∈U∑k∈Uiyk, of an analysis variable y is
ˆtπ=∑i∈sˆtiπi
where ˆti=(Ni/ni)∑k∈siyk, which is the estimate of the total for PSU i with a simple random sample. The design variance of the estimated total can be written as the sum of two components: V(ˆtπ)=∑i∈U∑j∈U(πij−πiπj)tiπitjπj+∑i∈UN2iπini(1−niNi)S2U2i where
S2U2i=∑k∈Ui(yk−ˉyUi)2/(Ni−1)
is the unit variance of y among the elements in PSU i.
Formula (1) is difficult or impossible to use for sample size computations because the number of PSUs in the sample is not exposed. Another is to analyze srswor sampling of PSUs and SSUs as in Example 1 below. Determining sample sizes this way does not mean that you are necessarily locked into selecting PSUs and elements within PSUs via srswor or srswr. Basing sample sizes on a design that is less complicated than the one that will actually be used is a common approach, although it can be deceptive for some analysis variables.
Suppose the first stage is an srswor of m out of M PSUs and the second stage is a sample of ni elements selected by srswor from the population of Ni. The π-estimator is
ˆtπ=Mm∑i∈sNini∑k∈siyk
Its variance is equal to V(ˆtπ)=M2mM−mMS2U1+Mm∑i∈UN2iniNi−niNiS2U2i where S2U1=∑i∈U(ti−ˉtU)2M−1 with ti being the population total of y in PSU i and ˉtU=∑i∈Uti/M is the mean total per PSU.
If ˉn elements are selected in each PSU and the sampling fractions of PSUs and elements within PSUs are all small, then the relvariance can be written as V(ˆtπ)t2U=B2m+W2mˉn
where B2=S2U1/ˉt2U=M2S2U1/t2U is the unit
relvariance among PSU totals and W2=M∑i∈UN2iS2U2i/t2U. The term
B2 is called the “between
(PSU) component” while W2
is the “within component”. Expression (3) is the form used in the R function,
BW2stageSRS
. Textbooks often list a specialized form of
(3) that
requires that all PSUs have the same size, Ni≡ˉN, and that ˉn elements are selected in
each. In that case, the second-stage sampling fraction is ˉn/ˉN. This implies that
the sample is self-weighting: πiπk|i=mˉn/MˉN. The
relvariance based on (2) then simplifies to the less general form
V(ˆtπ)t2U=1mM−mMB2+1mˉnˉN−ˉnˉNW2
where W2=1Mˉy2U∑i∈US2U2i.
Assuming that ˉn elements are selected in each sample PSU, and m/M and ˉn/Ni are both small, the more general form of the relvariance in (3) can also be written in terms of a measure of homogeneity δ as follows:
V(ˆtπ)t2U≐˜Vmˉnk[1+δ(ˉn−1)] where ˜V=S2U/ˉy2U, k=(B2+W2)/˜V, and
δ=B2B2+W2. With some effort, it can be shown that when Ni=ˉN and both M and ˉN are large,
S2Uˉy2U=1ˉy2U∑i∈U∑k∈Ui(yk−ˉyU)2(N−1)≐B2+W2 i.e., the population relvariance can be written as the sum of between and within relvariances. If k=1, (4) equals the expression found in many textbooks. However, when the population count of elements per cluster varies, k may be far from 1, as will be illustrated in an example below. In those cases, (4) with an estimate of the actual k should be used for determining sample sizes and computing advance estimates of coefficients of variation.
Expressions (3) and (4) are useful for sample size calculation since the number of sample PSUs and sample units per PSU are explicit in the formula. Equation (4) also connects the variance of the estimated total to the variance that would be obtained from a simple random sample since ˜V/mˉn is the relvariance of the estimated total in an srswor of size mˉn when the sampling fraction is small. The product k[1+δ(ˉn−1)] is a type of design effect. When k=1, the term 1+δ(ˉn−1) is the approximate design effect found in many textbooks.
The next example uses the MDarea.popA
from
PracTools
. This dataset is based on the U.S. Census counts
from the year 2000 for Anne Arundel County in the US state of Maryland.
The geographic divisions used in this dataset are called tracts and
block groups. Tracts are constructed by the US Census Bureau to have a
desired population size of 4,000 people. Block groups (BGs) are smaller
with a target size of 1,500 people. Counts of persons in the dataset are
the same for most tracts and block groups as in the 2000 Census.
BW2stageSRS
will calculate the unit relvariance of a population, B2+W2 for comparison, the
ratio k=(B2+W2)/(S2U/ˉy2U), and the full version of
δ in (5). The function
assumes that the entire sampling frame is an input. The full R code for
this example is in the file Example 9.2.R
, available at
Valliant, Dever, and Kreuter (2018a). We
first compute the results using the PSU
and
SSU
variables as clusters. These fields are created so that
all PSU
s have the same size; likewise, all
SSU
s have the same size. For the variable y1
in the Maryland population, the code is require(PracTools)
data(MDarea.popA)
BW2stageSRS(MDarea.popA$y1, psuID=MDarea.popA$PSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.007979563 1.455575239 1.463117997 1.463554802 1.000298544 0.005452179
BW2stageSRS(MDarea.popA$y1, psuID=MDarea.popA$SSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.03661379 1.42824599 1.46311800 1.46485978 1.00119046 0.02499474
The values of δ are
0.005 for PSU
and 0.025 for SSU
. Next, to
illustrate the dramatic effect that varying sizes of clusters can have,
we compute the same statistics as above using tracts and block groups
(BGs) within tracts as clusters. These vary substantially in the number
of persons in each cluster. A new variable called trtBG
is
computed since the values of the variable, BLKGROUP
, are
nested within each tract:
trtBG <- 10*MDarea.popA$TRACT + MDarea.popA$BLKGROUP
BW2stageSRS(MDarea.popA$y1, psuID=MDarea.popA$TRACT)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.2603356 1.8405833 1.4631180 2.1009189 1.4359190 0.1239151
BW2stageSRS(MDarea.popA$y1, psuID=trtBG)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.3495410 1.9488250 1.4631180 2.2983660 1.5708685 0.1520824
The value of δ is
0.124 TRACT
s are clusters and 0.152 when trtBG
defines clusters. The measures of homogeneity increase substantially
when tracts or BGs are clusters compared to the PSU
and
SSU
results. This is entirely due to the increase in B2 when units with highly
variable sizes are used and an srs is selected. For example,
B2=0.0079 for
y1
when PSU
is a cluster but is 0.2605 when
TRACT
is a cluster.
Variances of estimators in two-stage designs more complicated than simple random sampling at each stage can be written as a sum of components. However, these have limited usefulness in determining sample sizes for the same reason that (1) is not. A more convenient formulation is the case where PSUs are selected with varying probabilities but with replacement, and the sample within each PSU is selected by srswor. With-replacement designs may not often be used in practice but have simple variance formulae. The pwr-estimator of a total (Särndal, Swensson, and Wretman 1992) is ˆtpwr=1m∑i∈sˆtipi where ˆti=Nini∑k∈siyk is the estimated total for PSU i from a simple random sample and pi is the one-draw selection probability of PSU i. The variance of ˆtpwr is
V(ˆtpwr)=1m∑i∈Upi(tipi−tU)2+∑i∈UN2impini(1−niNi)S2U2i.
Making the assumption that ˉn elements are selected in each PSU, the variance reduces to V(ˆtpwr)=S2U1(pwr)m+1mˉn∑i∈U(1−ˉnNi)N2iS2U2ipi where, in this case, S2U1(pwr)=∑i∈Upi(tipi−tU)2. Dividing this by t2U and assuming that the within-PSU sampling fraction, ˉn/Ni, is negligible, we obtain the relvariance of ˆtpwr as, approximately, V(ˆtpwr)t2U≐B2m+W2mˉn=˜Vmˉnk[1+δ(ˉn−1)]
with ˜V=S2U/ˉy2U, k=(B2+W2)/˜V, B2=S2U1(pwr)t2U,
W2=1t2U∑i∈UN2iS2U2ipi,
δ=B2/(B2+W2)
Expression (7) has the same form as (4) but with different definitions of B2 and W2. Expression (7) also has the interpretation of an srs variance of an unclustered variance, ˜V/mˉn, times a design effect, k[1+δ(ˉn−1)], in the same way that (4) did.
BW2stagePPS
computes
the population values of B2, W2, and
δ shown in (8), (9), and (10) which are
appropriate for ppswr sampling of clusters. The code for
y1
using PSU
or SSU
as clusters
is shown below. The variables, pp.PSU
and
pp.SSU
, hold the one-draw probabilities pi that appear in (6): pp.PSU <- table(MDarea.popA$PSU) / nrow(MDarea.popA)
pp.SSU <- table(MDarea.popA$SSU) / nrow(MDarea.popA)
BW2stagePPS(MDarea.popA$y1, pp=pp.PSU, psuID=MDarea.popA$PSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.007878320 1.455574516 1.463117997 1.463452835 1.000228853 0.005383378
BW2stagePPS(MDarea.popA$y1, pp=pp.SSU, psuID=MDarea.popA$SSU)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.03652265 1.42825477 1.46311800 1.46477742 1.00113417 0.02493392
The code for PSUs that are tracts and block groups is
pp.trt <- table(MDarea.popA$TRACT) / nrow(MDarea.popA)
pp.BG <- table(trtBG) / nrow(MDarea.popA)
BW2stagePPS(MDarea.popA$y1, pp=pp.trt, psuID=MDarea.popA$TRACT)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.009458290 1.454058071 1.463117997 1.463516361 1.000272271 0.006462716
BW2stagePPS(MDarea.popA$y1, pp=pp.BG, psuID=trtBG)
#> B2 W2 unit relvar B2+W2 k delta
#> 0.01643324 1.44797251 1.46311800 1.46440575 1.00088014 0.01122178
The between term when clusters are defined by PSU
is
about the same as when clusters are selected by srs because
PSU
’s all have the same size. With PSUs being either tracts
or block groups in the ppswr/srswor design, the between term is
much smaller than the within, compared to the results in the
srs/srs example. For example, with y1
and
srs sampling of tracts, B2=0.2604 but for pps sampling of tracts B2=0.0091.
When clusters are selected by srs, S2U1 is the variance of the cluster totals around the average cluster total. In contrast, with pps sampling of clusters, S2U1(pwr) is the variance of the estimated population totals, ti/pi around the population total, tU. When clusters are selected with probability proportional to Ni, then ti/pi=NiˉyUi. If these one-cluster estimates of the population total are fairly accurate, as they are here, the B2 term can be quite small. This leads to much smaller values of δ in pps sampling of clusters. This implies that the negative effect of clustering on the variance is lessened for a design that selects clusters with pp(Ni). This kind of comparison explains most practitioners’ preference for pps sampling of clusters, especially when the clusters vary in population size.
In the case of with-replacement sampling of PSUs with varying probabilities and srswor at the second and third stages, the relvariance can be written (with a few assumptions) in a form useful for sample size calculations. Treating the case where SSUs are selected via srs (either with or without replacement) is not too unrealistic since SSUs (like block groups) are often created to have about the same population sizes.
The variance formulae for a three-stage design with ppswor selection of first-stage units is complex enough that it is not useful for sample size planning. See Valliant, Dever, and Kreuter (2018b), sec. 9.2.4 for details. To obtain a simpler formula, suppose that ˉn SSUs are sampled in each sample PSU, the sampling fractions of SSUs in each PSU, ˉn/Ni, are small, and ˉˉq elements are selected in each sample SSU. The relvariance of the pwr-estimator is then
V(ˆtpwr)t2U=B2m+W22mˉn+W23mˉnˉˉq,
where B2=S2U1(pwr)/t2U is given by (8),
W22=1t2U∑i∈UN2iS2U2i/pi;
W23=1t2U∑i∈UNipi∑j∈UiQ2ijS2U3ij.
The relvariance can also be written in terms of two measures of homogeneity:
V(ˆtpwr)t2U=˜Vmˉnˉˉq{k1δ1ˉnˉˉq+k2[1+δ2(ˉˉq−1)]} where
k1=(B2+W2)/˜V with ˜V=1Q−1∑i∈U∑j∈Ui∑k∈Uij(yk−ˉyU)2/ˉy2U is the unit relvariance of y in the population.
k2=(W22+W23)/˜V
δ1=B2/(B2+W2)
W2=1t2U∑i∈UQ2iS2U3i/pi with S2U3i=1Qi−1∑j∈Ui∑k∈Uij(yk−ˉyUi)2 and ˉyUi=∑j∈Ui∑k∈Uijyk/Qi, i.e., S2U3i is the element-level variance among all elements in PSU i
δ2=W22/(W22+W23)
Note that the term W2 in δ1 does not enter the variance in (11) but is defined by analogy to the term in two-stage sampling. If elements were selected directly from the sample PSUs (instead of first sampling SSUs), then W2 above would be the appropriate within-PSU component.
The term δ1 is a measure of the homogeneity among the PSU totals. If the estimate of the population total from each PSU total, ti/pi, was exactly equal to the population total, tU, then B2=0 and δ1=0. That is, if the variation within PSUs is much larger than the variation among PSU totals, then δ1 will be small; this is the typical situation in household surveys if PSUs all have about the same number of elements. As we saw in the earlier example, the condition of equal-sized PSUs can be critically important to insure that B2 is small.
If the SSUs all have about the same totals, tij, then W22 will be small and δ2≐0. Although attempts may be made to create SSUs that have about the same number of elements Qij, the totals tij of other variables tend to vary, leading to values of δ2 that are larger than those of δ1.
The R function, BW3stagePPS
, will calculate B2, W2, W22, W23, δ1, and δ2 defined above for
ppswr/srs/srs and srswr/srs/srs sampling. The function
is appropriate if an entire frame is available and takes the following
parameters:
Parameter | Description |
---|---|
X |
data vector; length is the number of elements in the population. |
pp |
vector of one-draw probabilities for the PSUs; length is number of PSUs in population. |
psuID |
vector of PSU identification numbers. This vector must be as long as X. Each element in a given PSU should have the same value in psuID. PSUs must be in the same order as in X. |
ssuID |
vector of SSU identification numbers. This vector must
be as long as X. Each element in a given SSU should have the same value
in ssuID. PSUs and SSUs must be in the same order as in X. ssuID should
have the form psuID||(ssuID within PSU) . |
BW3stagePPS
for the variable
y1
in an srswr/srs/srs design is: M <- length(unique(MDarea.popA$TRACT))
trtBG <- 10*MDarea.popA$TRACT + MDarea.popA$BLKGROUP
pp.trt <- rep(1/M,M)
BW3stagePPS(X=MDarea.popA$y1, pp=pp.trt,
psuID=MDarea.popA$TRACT, ssuID=trtBG)
#> B W W2 W3 unit relvar k1
#> 0.2575952 1.8405833 0.2714429 2.1074463 1.4631180 1.4340460
#> k2 delta1 delta2
#> 1.6259039 0.1227709 0.1141049
We repeat the calculation but assuming ppswr sampling of
PSUs. The calculation for y1
using tracts and block groups
as the first- and second-stage sampling units is done via this call:
trtBG <- 10*MDarea.popA$TRACT + MDarea.popA$BLKGROUP
pp.trt <- table(MDarea.popA$TRACT) / nrow(MDarea.popA)
BW3stagePPS(X=MDarea.popA$y1, pp=pp.trt,
psuID=MDarea.popA$TRACT, ssuID=trtBG)
#> B W W2 W3 unit relvar k1
#> 0.009458290 1.454058071 0.272308839 1.685246592 1.463117997 1.000272271
#> k2 delta1 delta2
#> 1.337934080 0.006462716 0.139106579
Notice that δ1=0.123 with srs sampling of tracts but is 0.006 when tracts are sampled proportional to their population sizes.
An important practical, sample design problem that we do not cover in
this vignette is how to estimate variance components and measures of
homogeneity from a complex, multistage sample. This topic is covered in
detail in section 9.4 of Valliant, Dever, and
Kreuter (2018b). The PracTools
package includes a
variety of other functions relevant to two- and three-stage sampling
that are also not discussed in this vignette:
Function | Description |
---|---|
BW2stagePPSe | Estimate components of relvariance for a sample design where primary sampling units (PSUs) are selected with pps and elements are selected via srs. The input is a sample selected in this way. |
BW3stagePPSe | Estimate components of relvariance for a sample design where primary sampling units (PSUs) are selected with probability proportional to size with replacement (ppswr) and secondary sampling units (SSUs) and elements within SSUs are selected via simple random sampling (srs). The input is a sample selected in this way. |
clusOpt2 | Compute the sample sizes that minimize the variance of the pwr-estimator of a total in a two-stage sample. |
clusOpt2fixedPSU | Compute the optimum number of sample elements per primary sampling unit (PSU) for a fixed set of PSUs. |
clusOpt3 | Compute the sample sizes that minimize the variance of the pwr-estimator of a total in a three-stage sample. |
clusOpt3fixedPSU | Compute the sample sizes that minimize the variance of the pwr-estimator of a total in a three-stage sample when the PSU sample is fixed. |
CVcalc2 | Compute the coefficient of variation of an estimated total in a two-stage design. Primary sampling units (PSUs) can be selected either with probability proportional to size (pps) or with equal probability. Elements are selected via simple random sampling (srs). |
CVcalc3 | Compute the coefficient of variation of an estimated total in a three-stage design. Primary sampling units (PSUs) can be selected either with probability proportional to size (pps) or with equal probability. Secondary units and elements within SSUs are selected via simple random sampling (srs). |
deff | Compute the Kish, Henry, Spencer, or Chen-Rust design effects. |