Advanced Econometrics 4
Problem set 1
Lukas Worku, Olli Putkiranta, Johannes Hirvonen
Problem 1
a) We start with the definition given:
$$ \epsilon = \begin{dcases} 1-X\beta &\text{with probability } Pr(Y=1)=P \ -X\beta &\text{with probability } Pr(Y=0)=1-P \end{dcases} $$
We assume $$ E\epsilon|X=0 $$ by definition. We are interested in the variance of the error term conditional on $X$, $$ Var(\epsilon|X)=\sigma^2 $$.
Now
$$ Var(\epsilon|X)= E[(\epsilon-E\epsilon|X)^2|X]=E(\epsilon-0)^2|X = E\epsilon^2|X. $$ For a discrete random variable Y the following holds: $$ EY^2=\sum_{y}y^2\cdot Pr(Y=y). $$ Using this, we can write: $$ Var(\epsilon|X)=E\epsilon^2|X= (1-X\beta)^2\cdot Pr(y=1)+(-X\beta)^2\cdot Pr(y=0) = (1-X\beta)^2P + (-X\beta)^2(1-P) $$
We also know that $$ P = X\beta $$. Hence
$$ Var(\epsilon|X)= (1-X\beta)^2(X\beta) + (-X\beta)^2(1-X\beta) $$ $$ Var(\epsilon|X)= (X\beta)(1-X\beta)(1-X\beta+X\beta) $$ $$ Var(\epsilon|X)= X\beta(1-X\beta)
$$ $$ \square $$
b)
In the multinomial logit model, the probability of choice $y=j$ is given by $$ Pr(yi=j)=P{ij}=\frac{e^{V{ij}}}{\sum_ke^{V{ik}}} $$ for individual $i$. We allow the $K$ regressors to differ between decision makers but not choices: $$ V{ij}=\bm x_i'\bm\beta_j.$$ The coefficient vector ($\bm\beta_j$), however, varies between choices $j$. Here $\bm x_i'\bm\beta_j$ is the scalar product of $K\times1$ -dimensional vectors $\bm x_i$ and $\bm \beta_j$: $$ \bm x_i'\bm\beta_j=x{i1}\beta{j1}+x{i2}\beta{j2}+...+x{iK}\beta{jK}, $$ where K is the amount of regressors. Combining the above and dropping the index for the individual to avoid sloppy notation, we can write $$ Pr(y=j)=P_j=\frac{e^{\bm{x'\beta_j}}}{\sum_ke^{\bm{x'\beta_k}}}. $$ Now notice that the above probability is conditional on $\bm x$, so we write: $$ Pr(y=j|\bm x)=P_j=\frac{e^{\bm{x'\beta_j}}}{\sum_ke^{\bm{x'\beta_k}}}. $$ In the derivations below, we use the following shorthand notation: $$ \sum{k\in1,...,J}=\sumk, $$ as we assume there are J possible choices in total. We calculate the marginal effect of a change in some $x_h$ ($h\in1,...,K$) on the probability of choosing $j$ by taking the partial derivative of $P_j$ with respect to $x_h$. $$ \frac{\partial Pr(y=j|\bm x)}{\partial x_h}=\frac{ \frac{\partial}{\partial x_h}( e^{\bm x'\bm\beta_j})\cdot\sum_ke^{\bm{x'\beta_k}}- e^{\bm x'\bm\beta_j}\cdot \frac{\partial}{\partial x_h} (\sum_ke^{\bm{x'\beta_k}}) } { \left\sum_ke^{\bm{x'\beta_k}}\right^2 } $$ $$ =\frac{ \frac{\partial}{\partial x_h} (e^{\bm x'\bm\beta_j})\cdot\sum_ke^{\bm{x'\beta_k}}}{ \left\sum_ke^{\bm{x'\beta_k}}\right^2} -\frac{e^{\bm x'\bm\beta_j}\cdot \frac{\partial}{\partial x_h}( \sum_ke^{\bm{x'\beta_k}})}{\left\sum_ke^{\bm{x'\beta_k}}\right^2} $$ Now, let's individually evaluate these two terms. The first term becomes: $$ \frac{\beta{jh}\cdot e^{\bm x'\bm\betaj}\cdot\sum_ke^{\bm{x'\beta_k}}}{(\sum_ke^{\bm{x'\beta_k}})\cdot(\sum_ke^{\bm{x'\beta_k}})}=\frac{e^{\bm x'\bm\beta_j}}{\sum_ke^{\bm{x'\beta_k}}}\beta{jh}=Pr(y=j|\bm x)\cdot\beta_{jh}. $$ The second term becomes: $$
\frac{e^{\bm x'\bm\betaj}\cdot \sum_k\beta{kh}e^{\bm x'\bm\beta_k}}{\left\sum_ke^{\bm{x'\beta_k}}\right^2}
\frac{e^{\bm x'\bm\betaj}}{\sum_ke^{\bm{x'\beta_k}}}\cdot\frac{\sum_k\beta{kh}e^{\bm x'\bm\betak}}{\sum_ke^{\bm{x'\beta_k}}} $$ $$ =Pr(y=j|\bm x)\cdot \frac{1}{\sum_ke^{\bm{x'\beta_k}}}\left( \beta{1h}e^{\bm x'\bm\beta1} + \beta{2h}e^{\bm x'\bm\beta2} + ... + \beta{Jh}e^{\bm x'\bm\betaJ} \right) $$ $$ = Pr(y=j|\bm x)\cdot \left( \frac{\beta{1h}e^{\bm x'\bm\beta1}}{\sum_ke^{\bm{x'\beta_k}}} + \frac{\beta{2h}e^{\bm x'\bm\beta2}}{\sum_ke^{\bm{x'\beta_k}}} + ... + \frac{\beta{Jh}e^{\bm x'\bm\betaJ}}{\sum_ke^{\bm{x'\beta_k}}}\right) $$ $$ =Pr(y=j|\bm x)\cdot \left[ Pr(y=1)\cdot\beta{1h} + Pr(y=2)\cdot\beta{2h} +... + Pr(y=J|\bm x)\cdot\beta{Jh}\right] $$ $$ =Pr(y=j|\bm x)\cdot \sumk\left[ Pr(y=k|\bm x)\beta{kh} \right] $$ Now, putting the two terms together, we arrive at the final form: $$ \frac{\partial Pr(y=j|\bm x)}{\partial xh}=Pr(y=j|\bm x)\cdot\beta{jh}-Pr(y=j|\bm x)\cdot \sumk\left[ Pr(y=k|\bm x)\beta{kh} \right]. $$
$$ \square $$
c) The probability for decision-maker $i$ to choose option $j$ is:
$$ Pr(yi=j)=P{ij}=Pr(U{ij}>U{ik}\; \forall j\neq k) $$ In our case, $j,k\in{1,2, 3}$. Dropping the index for the individual and letting $Uj=\mu_j+\epsilon_j$, we can write $Pr(y=0)$ as $$ P_0=Pr(\mu_0+\epsilon_0>\mu_1+\epsilon_1 \; \& \; \mu_0+\epsilon_0>\mu_2+\epsilon_2) $$ $$ =Pr(\epsilon_1<\mu_0-\mu_1+\epsilon_0\;\&\;\epsilon_2<\mu_0-\mu_2+\epsilon_0). $$ $$ =F{\epsilon1\epsilon_2}(\mu_0-\mu_1+\epsilon_0,\mu_0-\mu_2+\epsilon_0). $$ Above we used the following property of joint distributions: $$ Pr(X<a\;\&\;Y<b)=F{XY}(a,b), $$ where $F{XY}$ is the joint cumulative distribution function of random variables $X$ and $Y$. We notice that $P_0$ depends on $\epsilon_0$, which is unobservable to the researcher. We then write: $$ P_0|\epsilon_0=F{\epsilon_1\epsilon_2}(\mu_0-\mu_1+\epsilon_0,\mu_0-\mu_2+\epsilon_0). $$ To obtain the independent choice probability, we integrate $P_0|\epsilon_0$ weighted by the density of $\epsilon_0$ over the range of $\epsilon_0$:
$$ P0=\int{-\infty}^{\infty}F{\epsilon_1\epsilon_2}(\mu_0-\mu_1+\epsilon_0,\mu_0-\mu_2+\epsilon_0) f{\epsilon0}(\epsilon_0)d\epsilon_0. $$ Now we plug in the given (assumed) functions $F{\epsilon1,\epsilon_2}$ and $f{\epsilon0}$ to get: $$ P_0=\int{-\infty}^{\infty} e^{-e^{-\rho^{-1}(\mu_0-\mu_1+\epsilon_0)}+e^{-\rho^{-1}(\mu_0-\mu_2+\epsilon_0)}^\rho}\cdot e^{-e^{-\epsilon_0}}\cdot e^{\epsilon_0}\:d\epsilon_0. $$ Let's first manipulate the exponential of the first term into a friendlier form:
$$ -e^{-\rho^{-1}(\mu_0-\mu_1+\epsilon_0)}+e^{-\rho^{-1}(\mu_0-\mu_2+\epsilon_0)}^\rho $$ $$ =-e^{-\rho^{-1}\mu_0+\rho^{-1}\mu_1-\rho^{-1}\epsilon_0}+e^{-\rho^{-1}\mu_0+\rho^{-1}\mu_2-\rho^{-1}\epsilon_0}^\rho $$ $$ =-e^{-\rho^{-1}\mu_0}\cdot e^{\rho^{-1}\mu_1}\cdot e^{-\rho^{-1}\epsilon_0}+e^{-\rho^{-1}\mu_0}\cdot e^{\rho^{-1}\mu_2}\cdot e^{-\rho^{-1}\epsilon_0}^\rho $$ $$
=-e^{-\rho^{-1}\mu_0}\cdot e^{-\rho^{-1}\epsilon_0}(e^{\rho^{-1}\mu_1}+e^{\rho^{-1}\mu_2})^\rho
-e^{-\mu0}\cdot e^{-\epsilon_0}(e^{\rho^{-1}\mu_1}+e^{\rho^{-1}\mu_2})^\rho $$ $$ =-e^{-\epsilon_0}\cdote^{-\mu_0}(e^{\rho^{-1}\mu_1}+e^{\rho^{-1}\mu_2})^\rho=-e^{-\epsilon_0}\alpha, $$ where $\alpha=e^{-\mu_0}(e^{\rho^{-1}\mu_1}+e^{\rho^{-1}\mu_2})^\rho$. Now, let's plug this back into the integral to get: $$ P_0=\int{-\infty}^{\infty} e^{-e^{-\epsilon0}\alpha}\cdot e^{-e^{-\epsilon_0}}\cdot e^{-\epsilon_0}\:d\epsilon_0=\int{-\infty}^{\infty} e^{-e^{-\epsilon0}\alpha-e^{-\epsilon_0}}\cdot e^{-\epsilon_0}\:d\epsilon_0 $$ $$ =\int{-\infty}^{\infty} e^{-e^{-\epsilon0}(\alpha+1)}\cdot e^{-\epsilon_0}\:d\epsilon_0. $$ To evaluate this integral, we let $u=e^{-\epsilon_0}$. We then calculate $$\frac{du}{d\epsilon_0}=-e^{-\epsilon_0}\Leftrightarrow du=-e^{-\epsilon_0}d\epsilon_0.$$ We also note that as $\epsilon_0$ approaches negative infinity, $u$ approaches infinity. As $\epsilon_0$ approaches infinity, $u$ approaches zero. Thus, we adjust the boundaries of the integral accordingly. Plugging in the substitutions, we can now evaluate the following: $$ P_0=\int{\infty}^{0}-e^{-u(\alpha+1)}d\epsilon0=\left\frac{1}{\alpha+1}e^{-u(\alpha+1)}\right{\infty}^0 $$ $$ =\lim{u\to0}\frac{1}{\alpha+1}e^{-u(\alpha+1)}-\lim{u\to\infty}\frac{1}{\alpha+1}e^{-u(\alpha+1)} $$ $$ =\frac{1}{\alpha+1}-0=\frac{1}{\alpha+1} $$ Now, let's plug back in the expression of $\alpha$ to get: $$ P_0=\frac{1}{e^{-\mu_0}(e^{\rho^{-1}\mu_1}+e^{\rho^{-1}\mu_2})^\rho+1}. $$ Let's multiply both the numerator and denominator by $e^{\mu_0}$ to get: $$ P_0=\frac{e^{\mu_0}}{e^{\mu_0}\cdot e^{-\mu_0}(e^{\rho^{-1}\mu_1}+e^{\rho^{-1}\mu_2})^\rho+e^{\mu_0}} =\frac{e^{\mu_0}}{e^{\mu_0}+(e^{\rho^{-1}\mu_1}+e^{\rho^{-1}\mu_2})^\rho}. $$ $$ \square $$
Problem 2
We calculate the marginal effects by first taking the exponential of each parameter reported above. Then, we predict the choice for school and home for at the means of variables education, experience and experience squared for both black race and for non-black race. We calculate these probabilities by taking the predicted value of y and dividing it by the sum of all other choices (with the value of working normalized to 1, as it is the reference category). Hence we have the probability of choosing school, for example, for both black and non-black race evaluated at the means of other variables. Thus, the marginal effect of black race is the difference of these two.
As we can see, the marginal effects calculated by hand are very close to the marginal effects reported by the estimation software (Python in this case).
Marginal effects for choosing school for black race
By hand: -0.043
By software: -0.046
Marginal effects for choosing home for black race
By hand: 0.038
By software: 0.038
Part d)
Stata code below
use "fishsamp.dta"
nlogitgen type=alterntv(Inland:1|2, Boat:3|4)
nlogit modeused price crate || type: income, base(Inland) || alterntv:, noconst case(id)
We have the following output
tree structure specified for the nested logit model
type N alterntv N k
**********************************
Inland 1182 --- 1 591 64
+- 2 591 84
Boat 1182 --- 3 591 221
+- 4 591 222
----------------------------------
total 2364 591
k = number of times alternative is chosen
N = number of observations at each level
Iteration 0: log likelihood = -615.20944
Iteration 1: log likelihood = -614.85796 (backed up)
Iteration 2: log likelihood = -612.52747 (backed up)
Iteration 3: log likelihood = -602.41982 (backed up)
Iteration 4: log likelihood = -602.25174
Iteration 5: log likelihood = -599.36483
Iteration 6: log likelihood = -593.59764
Iteration 7: log likelihood = -592.19146
Iteration 8: log likelihood = -591.58143
Iteration 9: log likelihood = -590.96997
Iteration 10: log likelihood = -590.13488
Iteration 11: log likelihood = -589.60811
Iteration 12: log likelihood = -589.3945
Iteration 13: log likelihood = -589.25113
Iteration 14: log likelihood = -589.22057
Iteration 15: log likelihood = -589.20877
Iteration 16: log likelihood = -589.20443
Iteration 17: log likelihood = -589.20339
Iteration 18: log likelihood = -589.20337
RUM-consistent nested logit regression Number of obs = 2,364
Case variable: id Number of cases = 591
Alternative variable: alterntv Alts per case: min = 4
avg = 4.0
max = 4
Wald chi2(3) = 109.41
Log likelihood = -589.20337 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
modeused | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
alterntv |
price | -.0322812 .0031818 -10.15 0.000 -.0385174 -.0260451
crate | 1.172328 .4688659 2.50 0.012 .2533681 2.091289
------------------------------------------------------------------------------
type equations
------------------------------------------------------------------------------
Inland |
income | 0 (base)
-------------+----------------------------------------------------------------
Boat |
income | .0787695 .0596371 1.32 0.187 -.0381172 .1956561
------------------------------------------------------------------------------
dissimilarity parameters
------------------------------------------------------------------------------
type |
/Inland_tau | 9.64581 14.76371 -19.29054 38.58216
/Boat_tau | 10.79396 14.8031 -18.21959 39.80751
------------------------------------------------------------------------------
LR test for IIA (tau=1): chi2(2) = 48.99 Prob > chi2 = 0.0000
The test for the IIA assumption is reported in the bottom row of the output. As we can see, the Chi-squared distributed test value is 48.99 with two degrees of freedom. Hence, the p-value is very close to zero, and the IIA is rejected at all relevant levels of statistic significance.
Problem 3
a)
Estimating the Multinomial Logit parameters first and using them as the starting values leads to global optimum.
b)
In order for OLS to be valid in the three-point ranking scale, one would need to assume that the ranking variable is continous, rather than ordinal. In other words, OLS assumes equal "distances" between rankings, when there are none. Ordered logit, for example, might be ideal for this type of ranked ordinary variable as an outcome.
c)
The authors conduct a Likelihood ratio test
$$
\lambda\text{LR} = -2 \ln \left[ \frac{~ \sup{\theta \in \Theta0} \mathcal{L}(\theta) ~}{~ \sup{\theta \in \Theta} \mathcal{L}(\theta) ~} \right] =-2 \left~ \ell( \theta_0 ) - \ell( \hat{\theta} ) ~\right
$$
and find the test score to be -2392 with 8 degrees of freedom, meaning that the null hypothesis of the mixed logit model and multinomial logit model providing the same goodness of fit is rejected at all relevant levels of significance. In other words, the mixed logit model performs better in predicting the dependent value.
d)
It indicates that the heterogeneity in the preferences for those parameters (likely) exists. This means that as the multinomial logit model doesn't account for this heterogeneity, it is biased. Multinomial logit can account for this problem by dividing the error term into two parts, one satisfying the IID assumption and the other following a known distribution (e.g. normal, lognormal, triangular). Thus, with the distributional assumption on the non-IID part of the error, we can estimate the parameters with the Mixed Logit class of models.