Two steps ME estimation

Introduction

As stated in “Generalized Cross Entropy framework”, the common situation is the absence of prior information on $\mathbf{p} = (\mathbf{p_0},\mathbf{p_1},\dots,\mathbf{p_K})$. Yet, it is possible to include some pre-sample information in the form of $\mathbf{q} = (\mathbf{q_0},\mathbf{q_1},\dots,\mathbf{q_K})$.

Two steps

If we assume that generally there is no information on $\mathbf{p}$ we are defining a uniform distribution for $\mathbf{p}$ and ME estimation is done in the GME framework (see “Generalized Maximum Entropy framework”). From that estimation we can also obtain $\mathbf{\hat p}$. If we use $\mathbf{\hat p}$ as the prior distribution $\mathbf{q}$ we can perform a ME estimation in the GCE framework (see “Generalized Cross Entropy framework”). This procedures can be repeated as many times as required.

Consider dataGCE (see “Generalized Maximum Entropy framework” and “Choosing the supports spaces”).

coef.dataGCE <- c(1, 0, 0, 3, 6, 9)

The two steps GCE estimation can be done by assigning to the argument twosteps.n a value different from $0$. Let us consider $10$ GCE estimations after a first GME estimation (by default support.signal.points = c(1/5, 1/5, 1/5, 1/5, 1/5)).

res.lmgce.1se.twosteps <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    twosteps.n = 10
  )
#> Warning in GCEstim::lmgce(y ~ ., data = dataGCE, twosteps.n = 10): 
#> 
#> The minimum error was found for the highest upper limit of the support. Confirm if higher values should be tested.

The trace of the prediction CV-error can be obtained with plot and which = 6

plot(res.lmgce.1se.twosteps, which = 6)[[1]]

The pre reestimation CV-error is depicted by the red dot, intermediate CV-errors are represented by orange dots and final/reestimated CV-error corresponds to the dark red dot. The horizontal dotted line represents the OLS CV-error. Note that with the increase of reestimation the CV-error decreases.
Since we are working with simulated data, the true coefficients are known and the precision error can be determined. The arguments which = 7 and coef = coef.dataGCE of plot allows to obtain the trace

plot(res.lmgce.1se.twosteps, which = 7, coef = coef.dataGCE)[[1]]

We can see that, with the first two reestimations, we get a lower precision error but from that point forward the model tends to overfit data. Generally it is recommended to perform only $1$ GCE reestimation. That can be done by setting twosteps.n = 1, the default of lmgce

res.lmgce.1se.twosteps.1 <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE
  )

or use update

res.lmgce.1se.twosteps.1 <- update(res.lmgce.1se.twosteps, twosteps.n = 1)

or, since data is already stored in the object we can use the changestep function. This last options is the recommended in this case.

res.lmgce.1se.twosteps.1 <- changestep(res.lmgce.1se.twosteps, 1)

plot with which = 2 gives us the “Prediction Error vs supports” plot

plot(res.lmgce.1se.twosteps.1, which = 2)[[1]]

and with which = 3 we get the “Estimates vs supports” plot.

plot(res.lmgce.1se.twosteps.1, which = 3)[[1]]

In the last two plots are depicted the final solutions. That is to say that after choosing the support spaces limits based on the defined error, the number of points of the support spaces and their probability support.signal.points = c(1/5, 1/5, 1/5, 1/5, 1/5), twosteps.n = 1 extra estimation(s) is(are) performed. This estimation uses the GCE framework even if the previous steps were by default on the GME framework. The distribution of probabilities used is the one estimated for the chosen support spaces and it is stored in object$p0.

res.lmgce.1se.twosteps.1$p0
#>                   p_1        p_2        p_3       p_4       p_5
#> (Intercept) 0.0164583 0.04043601 0.09934627 0.2440815 0.5996780
#> X001        0.1999969 0.19999843 0.20000000 0.2000016 0.2000031
#> X002        0.1989669 0.19948212 0.19999866 0.2005165 0.2010358
#> X003        0.1856222 0.19254787 0.19973189 0.2071840 0.2149140
#> X004        0.1711658 0.18450259 0.19887859 0.2143747 0.2310783
#> X005        0.1457798 0.16891704 0.19572648 0.2267909 0.2627857

The final estimated vector of probabilities, object$p, is

res.lmgce.1se.twosteps.1$p
#>                    p_1        p_2        p_3       p_4       p_5
#> (Intercept) 0.01096511 0.03033771 0.08393684 0.2322322 0.6425282
#> X001        0.20220556 0.20109672 0.19999395 0.1988972 0.1978065
#> X002        0.18164405 0.19039062 0.19955835 0.2091675 0.2192394
#> X003        0.17754981 0.18812593 0.19933204 0.2112057 0.2237866
#> X004        0.15738899 0.17628610 0.19745213 0.2211595 0.2477133
#> X005        0.13074317 0.15871806 0.19267870 0.2339058 0.2839543

Conclusion

Doing a comparison between different methods we can conclude that generally we should use the two steps approach with only $1$ reestimation and choose the support spaces defined by standardized bounds with the 1se error structure.

	$OLS$	$GME_{(RidGME)}$	$GME_{(incRidGME_{1se})}$	$GME_{(incRidGME_{min})}$	$GME_{(std_{1se})}$	$GME_{(std_{min})}$	$GCE_{(std_{1se})}$
Prediction RMSE	0.405	0.459	0.423	0.406	0.423	0.406	0.407
Prediction CV-RMSE	0.436	0.513	0.455	0.435	0.472	0.435	0.424
Precision RMSE	5.809	2.192	2.018	3.166	1.612	4.495	2.178

References

Acknowledgements

This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020.