brglm2 provides tools for the estimation and inference from generalized linear models using various methods for bias reduction or maximum penalized likelihood with powers of the Jeffreys prior as penalty. Reduction of estimation bias is achieved either through the mean-bias reducing adjusted score equations in Firth (1993) and I. Kosmidis and Firth (2009), or through the direct subtraction of an estimate of the bias of the maximum likelihood estimator from the maximum likelihood estimates as prescribed in Cordeiro and McCullagh (1991), or through the median-bias reducing adjusted score equations in Kenne Pagui, Salvan, and Sartori (2017).
In the special case of generalized linear models for binomial, Poisson and multinomial responses (both nominal and ordinal), mean and median bias reduction and maximum penalized likelihood return estimates with improved frequentist properties, that are also always finite, even in cases where the maximum likelihood estimates are infinite, like in complete and quasi-complete separation as defined in Albert and Anderson (1984).
The workhorse function is brglmFit()
,
which can be passed directly to the method
argument of the
glm
function. brglmFit
implements a quasi-Fisher
scoring procedure, whose special cases result in various explicit
and implicit bias reduction methods for generalized linear models (the classification of bias reduction methods into
explicit and implicit is given in I. Kosmidis 2014).
This vignette
The bias-reducing quasi Fisher scoring iteration is also described in detail in the bias vignette of the enrichwith R package. I. Kosmidis and Firth (2010) describe a parallel quasi Newton-Raphson procedure.
Most of the material in this vignette comes from a presentation by Ioannis Kosmidis at the useR! 2016 international conference at University of Stanford on 16 June 2016. The presentation was titled “Reduced-bias inference in generalized linear models” and can be watched online at this link.
Suppose that y1,…,yn are observations on independent random variables Y1,…,Yn, each with probability density/mass function of the form fYi(y)=exp{yθi−b(θi)−c1(y)ϕ/mi−12a(−miϕ)+c2(y)} for some sufficiently smooth functions b(.), c1(.), a(.) and c2(.), and fixed observation weights m1,…,mn. The expected value and the variance of Yi are then E(Yi)=μi=b′(θi)Var(Yi)=ϕmib″ Hence, in this parameterization, \phi is a dispersion parameter.
A generalized linear model links the mean \mu_i to a linear predictor \eta_i as g(\mu_i) = \eta_i = \sum_{t=1}^p \beta_t x_{it} where g(.) is a monotone, sufficiently smooth link function, taking values on \Re, x_{it} is the (i,t)th component of a model matrix X, and \beta = (\beta_1, \ldots, \beta_p)^\top.
Suppressing the dependence of the various quantities on the model parameters and the data, the derivatives of the log-likelihood about \beta and \phi (score functions) are \begin{align*} s_\beta & = \frac{1}{\phi}X^TWD^{-1}(y - \mu) \\ s_\phi & = \frac{1}{2\phi^2}\sum_{i = 1}^n (q_i - \rho_i) \end{align*} with y = (y_1, \ldots, y_n)^\top, \mu = (\mu_1, \ldots, \mu_n)^\top, W = {\rm diag}\left\{w_1, \ldots, w_n\right\} and D = {\rm diag}\left\{d_1, \ldots, d_n\right\}, where w_i = m_i d_i^2/v_i is the ith working weight, with d_i = d\mu_i/d\eta_i and v_i = V(\mu_i). Furthermore, q_i = -2 m_i \{y_i\theta_i - b(\theta_i) - c_1(y_i)\} and \rho_i = m_i a'_i with a'_i = a'(-m_i/\phi). The expected information matrix about \beta and \phi is i = \left[ \begin{array}{cc} i_{\beta\beta} & 0_p \\ 0_p^\top & i_{\phi\phi} \end{array} \right] = \left[ \begin{array}{cc} \frac{1}{\phi} X^\top W X & 0_p \\ 0_p^\top & \frac{1}{2\phi^4}\sum_{i = 1}^n m_i^2 a''_i \end{array} \right]\,, where 0_p is a p-vector of zeros, and a''_i = a''(-m_i/\phi).
The maximum likelihood estimators \hat\beta and \hat\phi of \beta and \phi, respectively, can be found by the solution of the score equations s_\beta = 0_p and s_\phi = 0.
Let A_\beta = -i_{\beta\beta} b_\beta and A_\phi = -i_{\phi\phi} b_\phi, where b_\beta and b_\phi are the first terms in the expansion of the mean bias of the maximum likelihood estimator of the regression parameters \beta and dispersion \phi, respectively. The results in Firth (1993) can be used to show that the solution of the adjusted score equations \begin{align*} s_\beta + A_\beta & = 0_p \\ s_\phi + A_\phi & = 0 \end{align*} results in estimators \tilde\beta and \tilde\phi with bias of smaller asymptotic order than the maximum likelihood estimator.
The results in either I. Kosmidis and Firth
(2009) or Cordeiro and McCullagh
(1991) can then be used to re-express the adjustments in forms
that are convenient for implementation. In particular, and after some
algebra the bias-reducing adjustments for generalized linear models are
\begin{align*}
A_\beta & = X^\top W \xi \,, \\
A_\phi & = \frac{(p - 2)}{2\phi} + \frac{\sum_{i = 1}^n m_i^3
a'''_i}{2\phi^2\sum_{i = 1}^n m_i^2
a''_i}
%A_\phi & = \frac{(p - 2)\phi\sum_{i = 1}^n m_i^2
% a''(-m_i/\phi) + \sum_{i = 1}^n m_i^3
% a'''(-m_i/\phi))}{2\phi^2\sum_{i = 1}^n
m_i^2
% a''(-m_i/\phi)}
\end{align*} where \xi = (\xi_1,
\ldots, \xi_n)^T with \xi_i =
h_id_i'/(2d_iw_i), d_i' =
d^2\mu_i/d\eta_i^2, a''_i =
a''(-m_i/\phi), a'''_i =
a'''(-m_i/\phi), and h_i is the “hat” value for the ith observation (see,
e.g. ?hatvalues
).
The results in Kenne Pagui, Salvan, and Sartori (2017) can be used to show that if \begin{align*} A_\beta & = X^\top W (\xi + X u) \\ A_\phi & = \frac{p}{2\phi}+\frac{ \sum_{i = 1}^n m_i^3 a'''_i}{6\phi^2\sum_{i = 1}^n m_i^2 a''_i} \, , \end{align*} then the solution of the adjusted score equations s_\beta + A_\beta = 0_p and s_\phi + A_\phi = 0 results in estimators \tilde\beta and \tilde\phi with median bias of smaller asymptotic order than the maximum likelihood estimator. In the above expression, u = (u_1, \ldots, u_p)^\top with \begin{align*} u_j = [(X^\top W X)^{-1}]_{j}^\top X^\top \left[ \begin{array}{c} \tilde{h}_{j,1} \left\{d_1 v'_1 / (6 v_1) - d'_1/(2 d_1)\right\} \\ \vdots \\ \tilde{h}_{j,n} \left\{d_n v'_n / (6 v_n) - d'_n/(2 d_n)\right\} \end{array} \right] \end{align*} where [A]_j denotes the jth row of matrix A as a column vector, v'_i = V'(\mu_i), and \tilde{h}_{j,i} is the ith diagonal element of X K_j X^T W, with K_j = [(X^\top W X)^{-1}]_{j} [(X^\top W X)^{-1}]_{j}^\top / [(X^\top W X)^{-1}]_{jj}.
The results in Ioannis Kosmidis, Kenne Pagui, and Sartori (2020) can be used to show that if \begin{align*} A_\beta & = X^\top W \xi \,, \\ A_\phi & = \frac{p}{2\phi}+\frac{ \sum_{i = 1}^n m_i^3 a'''_i}{6\phi^2\sum_{i = 1}^n m_i^2 a''_i} \, , \end{align*} then the solution of the adjusted score equations s_\beta + A_\beta = 0_p and s_\phi + A_\phi = 0 results in estimators \tilde\beta with mean bias of small asymptotic order than the maximum likelihood estimator and \tilde\phi with median bias of smaller asymptotic order than the maximum likelihood estimator.
The likelihood penalized by a power of the Jeffreys prior |i_{\beta\beta}|^a |i_{\phi\phi}|^a \quad a > 0 can be maximized by solving the adjusted score equations s_\beta + A_\beta = 0_p and s_\phi + A_\phi = 0 with \begin{align*} A_\beta & = X^\top W \rho \,, \\ A_\phi & = -\frac{p + 4}{2\phi}+\frac{ \sum_{i = 1}^n m_i^3 a'''_i}{6\phi^2\sum_{i = 1}^n m_i^2 a''_i} \, , \end{align*} where \rho = (\rho_1, \ldots, \rho_n)^T with \rho_i = h_i \{2 d_i'/(d_i w_i) - v_i' d_i/(v_i w_i)\}.
brglmFit
brglmFit()
implements a quasi Fisher scoring procedure
for solving the adjusted score equations s_\beta + A_\beta = 0_p and s_\phi + A_\phi = 0. The iteration
consists of an outer loop and an inner loop that implements
step-halving. The algorithm is as follows:
Initialize outer loop
k \leftarrow 0
\upsilon_\beta^{(0)} \leftarrow \left\{i_{\beta\beta}\left(\beta^{(0)}, \phi^{(0)}\right)\right\}^{-1} \left\{s_\beta\left(\beta^{(0)}, \phi^{(0)}\right) + A_\beta\left(\beta^{(0)}, \phi^{(0)}\right)\right\}
\upsilon_\phi^{(0)} \leftarrow \left\{i_{\phi\phi}\left(\beta^{(0)}, \phi^{(0)}\right)\right\}^{-1} \left\{s_\phi\left(\beta^{(0)}, \phi^{(0)}\right) + A_\phi\left(\beta^{(0)}, \phi^{(0)}\right)\right\}
Initialize inner loop
m \leftarrow 0
b^{(m)} \leftarrow \beta^{(k)}
f^{(m)} \leftarrow \phi^{(k)}
v_\beta^{(m)} \leftarrow \upsilon_\beta^{(k)}
v_\phi^{(m)} \leftarrow \upsilon_\phi^{(k)}
d \leftarrow \left|\left|(v_\beta^{(m)}, v_\phi^{(m)})\right|\right|_\infty
Update parameters
b^{(m + 1)} \leftarrow b^{(m)} + 2^{-m} v_\beta^{(m)}
f^{(m + 1)} \leftarrow f^{(m)} + 2^{-m} v_\phi^{(m)}
Update direction
v_\beta^{(m + 1)} \leftarrow \left\{i_{\beta\beta}\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}^{-1} \left\{s_\beta\left(b^{(m + 1)}, f^{(m + 1)}\right) + A_\beta\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}
v_\phi^{(m + 1)} \leftarrow \left\{i_{\phi\phi}\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}^{-1} \left\{s_\phi\left(b^{(m + 1)}, f^{(m + 1)}\right) + A_\phi\left(b^{(m + 1)}, f^{(m + 1)}\right)\right\}
Continue or break halving within inner loop
if m + 1 < M and \left|\left|(v_\beta^{(m + 1)}, v_\phi^{(m + 1)})\right|\right|_\infty > d
14.1. m \leftarrow m + 1
14.2. GO TO 10
else
15.1. \beta^{(k + 1)} \leftarrow b^{(m + 1)}
15.2. \phi^{(k + 1)} \leftarrow f^{(m + 1)}
15.3. \upsilon_\beta^{(k + 1)} \leftarrow v_b^{(m + 1)}
15.4. \upsilon_\phi^{(k + 1)} \leftarrow v_f^{(m + 1)}
Continue or break outer loop
if k + 1 < K and \left|\left|(\upsilon_\beta^{(k + 1)}, \upsilon_\phi^{(k + 1)})\right|\right|_\infty > \epsilon
16.1 k \leftarrow k + 1
16.2. GO TO 4
else
17.1. \tilde\beta \leftarrow \beta^{(k + 1)}
17.2. \tilde\phi \leftarrow \phi^{(k + 1)}
17.3. STOP
For K = M = 1, \beta^{(0)} = \hat\beta and \phi^{(0)} = \hat\phi, the above
iteration computes the bias-corrected estimates proposed in Cordeiro and McCullagh (1991). This is achieved
when the brglmFit()
function is called with
type = "correction"
(see ?brglmFit
).
The mean-bias reducing adjusted score functions are solved when
the brglmFit()
function is called with
type = "AS_mean"
, and the median-bias reducing adjusted
score functions with type = "AS_median"
(see
?brglmFit
). Estimation using mixed adjustments is through
type = "AS_mixed"
. type = "MPL_Jeffreys"
does
maximum penalized likelihood with a power of the Jeffreys prior as
penalty.
The steps where \phi and the \phi direction are updated are ignored for generalized linear models with known dispersion parameter, like in models with binomial and Poisson responses. Also, in that case, v_\phi^{(.)} and \upsilon_\phi^{(.)} in steps 9, 14 and 16 are set to zero.
The implementation of the adjusted score functions requires ready
implementations of d^2\mu_i/d\eta_i^2, a'(.), a''(.) and a'''(.). The enrichwith
R package is used internally to enrich the base family
and
link-glm
objects with implementations of those functions
(see ?enrich.family
and
?enrich.link-glm
).
The above iteration can be used to implement a variety of additive adjustments to the score function, by supplying the algorithm with appropriate adjustment functions A_\beta and A_\phi
The first version of the vignette has been written by Ioannis
Kosmidis. Eugene Clovis Kenne Pagui and Nicola Sartori contributed the
first version of the section “Median bias-reducing adjusted score
functions”, and Ioannis Kosmidis brought the expressions for the median
bias-reducing adjustments in the reduced form that is shown above and is
implemented in brglmFit()
.
Ioannis Kosmidis, Kenne Pagui, and Sartori (2020) provides more details about mean and median bias reduction in generalized linear models.
If you found this vignette or brglm2, in general,
useful, please consider citing brglm2 and the
associated paper. You can find information on how to do this by typing
citation("brglm2")
.