Generalized Cross Entropy framework

Jorge Cabral

Introduction

Although the common situation is the absence of prior information on \(\mathbf{p} = (\mathbf{p_0},\mathbf{p_1},\dots,\mathbf{p_K})\), in some particular cases pre-sample information exists in the form of \(\mathbf{q} = (\mathbf{q_0},\mathbf{q_1},\dots,\mathbf{q_K})\). This \(\mathbf{q}\) distribution can be used as an initial hypothesis to be incorporated in the consistency relations of maximum entropy formalism. Kullback and Leibler [1] defined cross-entropy (CE) between \(\mathbf{p}\) and \(\mathbf{q}\) as

\[\begin{align} I(\mathbf{p},\mathbf{q})=\sum_{k=0}^K \mathbf{p_k} \ln \left(\mathbf{p_k}/\mathbf{q_k}\right). \end{align}\]

\(I(\mathbf{p},\mathbf{q})\) measures the discrepancy between the \(\mathbf{p}\) and \(\mathbf{q}\) distributions. It is non-negative, and when \(\mathbf{p}=\mathbf{q}\) one gets \(I(\mathbf{p},\mathbf{q})=0\). So, according to the principle of minimum cross-entropy [2,3] probabilities that are as close as possible to the prior probabilities should be chosen.

Generalized Cross Entropy estimator

Given the previous, and for the reparameterized linear regression model, \[\begin{equation} \mathbf{y}=\mathbf{XZp} + \mathbf{Vw}, \end{equation}\]
the Generalized Cross Entropy (GCE) estimator is given by

\[\begin{equation} \hat{\boldsymbol{\beta}}^{GCE}(\mathbf{Z},\mathbf{V}) = \underset{\mathbf{p},\mathbf{q},\mathbf{w},\mathbf{u}}{\operatorname{argmin}} \left\{\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) \right\}, \end{equation}\]
subject to the same model constraints as the GME estimator (see “Generalized Maximum Entropy framework”).

Using set notation the minimization problem can be rewritten as follows: \[\begin{align} &\text{minimize} & I(\mathbf{p,q,w,u}) &=\sum_{m=1}^M\sum_{k=0}^{K} p_{km}ln(p_{km}/q_{km}) +\sum_{j=1}^J\sum_{n=1}^N w_{nj}ln(w_{nj}/u_{nj}) \\ &\text{subject to} & y_n &= \sum_{m=1}^M\sum_{k=0}^{K} X_{kn}Z_{kj}p_{kj} + \sum_{m=1}^M V_{nm}w_{nm} \\ & & \sum_{m=1}^M p_{km} = 1, \forall k\\ & & \sum_{j=1}^J w_{kj} = 1, \forall k. \end{align}\]

The Lagrangian equation \[\begin{equation} \mathcal{L}=\mathbf{p}' \ln \left(\mathbf{p/q}\right) + \mathbf{w}' \ln \left(\mathbf{w/u}\right) + \boldsymbol{\lambda}' \left( \mathbf{y} - \mathbf{XZp} - \mathbf{Vw} \right) + \boldsymbol{\theta}'\left( \mathbf{1}_{K+1}-(\mathbf{I}_{K+1} \otimes \mathbf{1}'_M)\mathbf{p} \right) + \boldsymbol{\tau}'\left( \mathbf{1}_N-(\mathbf{I}_N \otimes \mathbf{1}'_J)\mathbf{w}\right) \end{equation}\]
can be used to find the interior solution, where \(\lambda\), \(\theta\), and \(\tau\) are \((N\times 1)\), \(((K+1)\times 1)\), \((N\times 1)\) associated vectors of Lagrangian multipliers, respectively.
Taking the gradient of the Lagrangian and solving the first-order conditions yields the solutions for \(\mathbf{\hat p}\) and \(\mathbf{\hat w}\)

\[\begin{equation} \hat p_{km} = \frac{exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})}{\sum_{m=1}^M exp(-z_{km}\sum_{n=1}^N \hat\lambda_n x_{nk})} \end{equation}\] and \[\begin{equation} \hat w_{nj} = \frac{exp(-\hat\lambda_n v_{n})}{\sum_{j=1}^J exp(-\hat\lambda_n v_{n})}. \end{equation}\]

Note that when the prior distribution is uniform, maximum entropy and minimum cross entropy produce the same results.

Examples

Consider dataGCE (see “Generalized Maximum Entropy framework”).
Again under a “no a priori information” scenario for the parameters, one can assume that \(z_k^{upper}=100\), \(k\in\left\lbrace 0,\dots,6\right\rbrace\) is a “wide upper bound” for the signal support space. Using lmgce a model can be fitted under the GME or GCE framework. If support.signal.points is an integer, a constant vector or a constant matrix one is assuming a uniform distribution on \(\mathbf{q}\) and therefore considering the GME framework.

library(GCEstim)
res.lmgce.100.GME <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    cv = TRUE,
    cv.nfolds = 5,
    support.signal = c(-100, 100),
    support.signal.points = 5,
    twosteps.n = 0,
    seed = 230676
  )

The estimated GME coefficients are \(\widehat{\boldsymbol{\beta}}^{GME_{(100)}}=\) (1.026, -0.155, 1.822, 3.319, 8.393, 11.467).

(coef.res.lmgce.100.GME <- coef(res.lmgce.100.GME))
#> (Intercept)        X001        X002        X003        X004        X005 
#>   1.0255630  -0.1552375   1.8221235   3.3194530   8.3932055  11.4670530

But if there is some information, for instance, on \(\beta_1\) and \(\beta_2\), that can be reflected on support.signal.points. Lets suppose that one suspects that \(\beta_1=\beta_2=0\). Since the support spaces are centered in zero one can assign a higher probability to the support point in or around the center. One can set \(\mathbf{q_1}=\mathbf{q_2}=(0.1, 0.1, 0.6, 0.1, 0.1)'\), for instance. support.signal.points accepts information on the distribution of probabilities in the form of a \((K+1)\times M\) matrix. The first line corresponds to \(\mathbf{q_0}\), the second to \(\mathbf{q_1}\), and so on.

(support.signal.points.matrix <- 
  matrix(
    c(rep(1/5, 5),
      c(0.1, 0.1, 0.6, 0.1, 0.1),
      c(0.1, 0.1, 0.6, 0.1, 0.1),
      rep(1/5, 5),
      rep(1/5, 5),
      rep(1/5, 5)
      ),
    ncol = 5,
    byrow = TRUE))
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]  0.2  0.2  0.2  0.2  0.2
#> [2,]  0.1  0.1  0.6  0.1  0.1
#> [3,]  0.1  0.1  0.6  0.1  0.1
#> [4,]  0.2  0.2  0.2  0.2  0.2
#> [5,]  0.2  0.2  0.2  0.2  0.2
#> [6,]  0.2  0.2  0.2  0.2  0.2
res.lmgce.100.GCE <-
  GCEstim::lmgce(
    y ~ .,
    data = dataGCE,
    cv = TRUE,
    cv.nfolds = 5,
    support.signal = c(-100, 100),
    support.signal.points = support.signal.points.matrix,
    twosteps.n = 0,
    seed = 230676
  )

The estimated GCE coefficients are \(\widehat{\boldsymbol{\beta}}^{GCE_{(100)}}=\) (1.026, -0.143, 1.655, 3.228, 8.189, 11.269).

(coef.res.lmgce.100.GCE <- coef(res.lmgce.100.GCE))
#> (Intercept)        X001        X002        X003        X004        X005 
#>    1.026345   -0.143421    1.654828    3.227839    8.189040   11.269391

The prediction errors are approximately equal ( \(RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx\) 0.407 and \(RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx\) 0.407) as well as the prediction cross-validation errors ( \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GME_{(100)}} \approx\) 0.428 and \(CV\text{-}RMSE_{\mathbf{\hat y}}^{GCE_{{100}}} \approx\) 0.427).
The precision errors is lower for the GCE approach: \(RMSE_{\boldsymbol{\hat\beta}}^{GME_{(100)}} \approx\) 1.595 and \(RMSE_{\boldsymbol{\hat\beta}}^{GCE_{(100)}} \approx\) 1.458.

(RMSE_beta.lmgce.100.GME <-
   GCEstim::accmeasure(coef.res.lmgce.100.GME, coef.dataGCE, which = "RMSE"))
#> [1] 1.594821

(RMSE_beta.lmgce.100.GCE <-
    GCEstim::accmeasure(coef.res.lmgce.100.GCE, coef.dataGCE, which = "RMSE"))
#> [1] 1.457947

If there was some information on the distribution of \(\mathbf{w}\), a similar analysis could be done for noise.signal.points.

Conclusion

The minimum cross entropy formalism specifies weights that should be considered to improve the precision of estimations.

References

1.
Kullback S, Leibler RA. On information and sufficiency. The Annals of Mathematical Statistics. 1951;22:79-86. doi:10.1214/aoms/1177729694
2.
Lindley DV, Kullback S. Information theory and statistics. Journal of the American Statistical Association. 1959;54:825. doi:10.2307/2282528
3.
Good IJ. Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. The Annals of Mathematical Statistics. 1963;34:911-934. doi:10.1214/aoms/1177704014

Acknowledgements

This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020.