Processing math: 100%

Exponentional conditional mean models with endogeneity

The linear conditional mean model

The linear model yn=βxn+ϵn, with E(ϵx)=0 implies a linear conditional mean function: E(yx)=βx. It implies the following moment conditions:

E((yβ)x)=0

which can be estimated on a given sample by:

1NNn=1(ynβ)xn=X(ynβ)/N

Solving this vector of K empirical moments for β leads to the OLS estimator.

If some of the covariates are endogenous, a consistent estimator can be obtained if there is a set of LK series exogenous Z, some of them being potentially elements of X. Then, the L moment conditions can be estimated by:

ˉm=1NNn=1(ynβ)zn=Zϵ/N

The variance of Nˉm is Ω=NE(ˉmˉm)=1NE(ZϵϵZ), which reduce to σ2ϵZZ if the errors are iid and, more generaly, can be consistently estimated by 1NNn=1ˆϵ2nznzn where ˆϵ being the residuals of a consistent estimation.

The IV estimator minimise the quadratic form of the moments with the inverse of its variance assuming that the errors are iid:

ϵZ(σ2ZZ)1Zϵ=ϵPZϵ/σ2

which is the IV estimator. As PZX is the projection of the column of X on the subspace generated by the columns of Z, this estimator can be performed by first regressing every covariates on the set of instruments and then regressing the response on these fitted values (2SLS).

The IV estimator is consistent but inefficient if the errors are not iid. In this case, a more efficient estimator can be obtained by minimizing ˉmˆΩ1ˉm with:

ˆΩ=1NNn=1ˆϵ2nznzn

where ˆϵ can be the residuals of the IV estimator. This is the GMM estimator.

The exponential linear conditional mean model

The linear model is often inappropriate if the conditional distribution of y is asymetric. In this case, a common solution is to use lny instead of y as the response.

lnyn=βxn+ϵ

This is of course possible only if yn>0n. An alternative is to use an exponential linear conditional mean model, with additive:

yn=eβxn+ϵn

or with multiplicative errors:

yn=eβxnνn

If all the covariates are exogenous, E(yeβxx)=0 which corresponds to the following empirical moments:

X(yeβx)/N=0

This define a non-linear system of K equations with K unknown parameters (β) that is in particular used when fitting a Poisson model with a log link for count data. It can also be used with any non-negative response.

If some of the covariates are endogenous, as previously an IV estimator can be defined. For additive errors, the empirical moments are:

1NNn=1(yneβxn)zn=Z(yneβxn)/N

and the IV estimator minimize:

ϵZ(σ2ZZ)1Zϵ=ϵPZϵ/σ2

Denoting ˆϵ the residuals of this regression, the same ˆω matrix can be constructed and used in a second step to get the more efficient GMM estimator.

With additive errors, the only difference with the linear case is that the minimization process results in a set of non linear equations, so that some numerical methods should be used.

With mlultiplicative errors, we have: νn=yn/eβxn, with E(nunxn)=1 if all the covariates are exogenous. Defining τn=νn1, the moment conditions are then:

E((y/eβx1)xn)=0

If some covariates are endogenous, this should be replaced by:

E((y/eβx1)zn)=0

which leads to the following empirical moments:

1NNn=1(yn/eβxn1)zn=Z(y/eXβ1)/N=Zτn

Minimizing the quadratic form of these empirical moments with (ZZ)1 or (Nn=1ˆτ2nznzn)1 leads respectively to the IV and the GMM estimators.

Sargan test

When the number of external instruments is greater that the number of endogenous variables, the empirical moments can’t be simultanoulsy set to 0 and a quadratic form of the empirical moments is minimized. The value of the objective function at convergence time the size of the sample is, under the null hypothesis that all the instruments are exogenous a chi square with a number of degrees of freedom equal to the difference between the number of instruments and the number of covariates.

Cigarette smoking behaviour

Mullahy (1997) estimate a demand function for cigarettes which depends on the stock of smoking habits. This variable is quite similar to a lagged dependent variable and is likely to be endogenous as the unobservable determinants of current smoking behaviour should be correlated with the unobservable determinants of past smoking behaviour. The data set contains observations of 6160 males in 1979 and 1980 from the smoking supplement to the 1979 National Health Interview Survey. The response cigarettes is the number of cigarettes smoked daily. The covariates are the habit “stock” habit, the current state-level average per-pack price of cigarettes price, a dummy indicating whether there is in the state of residence a restriction on smokong in restaurants restaurant, the age age and the number of years of schooling educ and their squares, the number of family members famsize, and a dummy race which indicates whether the individual is white or not. The external instruments are cubic terms in age and educ and their interaction, the one-year lagged price of a pack of cigarettes lagprice and the number of years the state’s restaurant smoking restrictions had been in place.

The data set is called cigmales and is lazy loaded while attaching the ivexpcm package.

The starting point is a basic count model, ie a Poisson model with a log link:

library("micsr")
library("dplyr")
cigmales <- cigmales %>%
    mutate(age2 = age ^ 2, educ2 = educ ^ 2,
           age3 = age ^ 3, educ3 = educ ^ 3,
           educage = educ * age)                                 
pois_cig <- glm(cigarettes ~ habit + price + restaurant + income + age + age2 +
                    educ + educ2 + famsize + race, data = cigmales,
                family = quasipoisson)

The IV and the GMM estimators are provided by the ivexpcm function. Its main argument is a two-part formula, where the first part indicates the covariates and the second part the instruments. The instrument set can be constructed from the covariate set by indicating which series should be omited (the endogenous variables) and which series should be added (the external instruments).

iv_cig <- expreg(cigarettes ~ habit + price + restaurant + income + age + age2 +
                       educ + educ2 + famsize + race | . - habit + age3 + educ3 +
                       educage + lagprice + reslgth, data = cigmales,
                   twosteps = FALSE)
gmm_cig <- update(iv_cig, twosteps = TRUE)

the twosteps argument is a logical which default value is TRUE (the GMM estimator), setting this argument to FALSE leads to the IV estimator.

The results are presented in the following table:

ML  IV GMM
habit 0.0055 0.0031 0.0031
(88.8791) (1.2536) (1.2536)
price −0.0094 −0.0106 −0.0106
(−3.5253) (−2.3333) (−2.3333)
restaurant −0.0469 −0.0431 −0.0431
(−1.4949) (−0.7934) (−0.7934)
income −0.0028 −0.0076 −0.0076
(−1.7915) (−2.9016) (−2.9016)
age 0.0087 0.0993 0.0993
(1.5697) (2.8577) (2.8577)
age2 −0.0003 −0.0013 −0.0013
(−5.0396) (−3.6384) (−3.6384)
educ 0.0353 0.1299 0.1299
(1.8457) (3.0713) (3.0713)
educ2 −0.0031 −0.0088 −0.0088
(−3.7454) (−3.9537) (−3.9537)
famsize −0.0081 −0.0085 −0.0085
(−1.0498) (−0.6754) (−0.6754)
racewhite −0.0511 −0.0311 −0.0311
(−1.2462) (−0.4485) (−0.4485)
Num.Obs. 6160 6160 6160
F 958.818
RMSE 14.38
sargan(gmm_cig)
## 
##  Sargan Test
## 
## data:  cigmales
## chisq = 32.018, df = 4, p-value = 1.897e-06
## alternative hypothesis: the moment conditions are not valid

Birth weight

The second data set used by Mullahy (1997) consists on 1388 observations on birthweight from the Child Health Supplement to the 1988 National Health Interview Survey. The response is birthweight birthwt in pounds and the covariates are the number of cigarettes smoked daily during the pregnancy cigarettes, the birth order parity, a dummy for white women race and child’s sex sex. Smoking behaviour during the pregnancy is suspected to be correlated with some other unobserved “bad habits” that may be have a negative effect on birthweight. Therefore, performing a pseudo-Poisson regression should result in a upward bias in the estimation of the effect of smoking on birthweight. The external instruments are the number of years of education of the father edfather and the mother edmother, the family income faminc and the per-pack state excise tax on cigarettes.

ml_bwt <- glm(birthwt ~ cigarettes + parity + race + sex, data = birthwt,
              family = quasipoisson)
iv_bwt <- expreg(birthwt ~ cigarettes + parity + race + sex |
                      . - cigarettes + edmother + edfather + faminc + cigtax,
                  data = birthwt, method = "iv")
gmm_bwt <- update(iv_bwt, method = "gmm")

The results are presented in the following table:

sargan(gmm_bwt)
## 
##  Sargan Test
## 
## data:  birthwt
## chisq = 3.8743, df = 3, p-value = 0.2754
## alternative hypothesis: the moment conditions are not valid

Mullahy, John. 1997. “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior.” The Review of Economics and Statistics 79 (4): 586–93.