Estimation of Mutual Information

Abstract: This article explores the difficult problem of estimating the mutual information between two variables from a sample of data. I use examples to demonstrate the challenges, and introduce a new algorithm for estimating mutual information along with an explicit representation of its uncertainty.

measure of association is a function that rates the strength of statistical dependence between two variables. In casual talk, people often express this idea by asserting that two things are “correlated”; however, in statistics the term correlated has a more precise meaning, where correlation (or more precisely, Pearson correlation) is a measure of how linear a statistical dependence is. Also in common usage is rank correlation, which measures how monotonic a statistical dependency is. But many dependencies can elude these measures, motivating the use of other measures of association.

uncorrelated association

Figure 1: X & Y have a clear dependence, but correlation and rank correlation are zero.

Another possible measure is mutual information, I(x,y). In Figure 1, the variables have a zero correlation and zero rank correlation, but have 6.5 bits (4.5 nats) of mutual information (when x & y are each quantized into 32,000 distinct discrete values), demonstrating that this measure can pick up on dependencies undetected by straight correlation.

Although promising as a measure of association, it is difficult to estimate I(x,y) from empirical data. I will explore these difficulties in the remainder of this article, and introduce a new algorithm for doing so.

Mutual Information

I limit the scope of this article to discrete variables and discrete probability distributions. The issues here are also important to estimation in the case with continuous variables, since estimation techniques often partition continuous values into discrete bins (Marek & Tichavský (2009)).

Given a joint probability distribution, P(x,y), over two discrete variables, the mutual information I(x,y) between x and y is defined as

I(x,y) = \sum_{x,y} P(x,y) \ln {{P(x,y)}\over{P(x) P(y)}}

The sum excludes terms with P(x,y)=0. The base 2 logarithm is often used, in which case result is in units of bits. The natural logarithm can alternatively be used, in which case the result is in units of nats. In this article I use the natural log. The most important thing to notice about this definition is that it defines mutual information in terms of the underlying probability distribution, which is unknown to us when we are trying to estimate I(x,y) from sample data.

Mutual information is always non-negative. It is zero when x and y are statistically independent. It is not limited to the range from 0 and 1, but it is always less than the entropy of each variable. The entropy of a variable is the mutual information of that variable with itself, I(x,x).

H(x) = I(x,x)

0 \le I(x,y) \le \min\left( H(x), H(y) \right)

Simple Estimator

Let (X_i, Y_i) be a data set of N data points. To compute the empirical mutual information, count the frequencies of each combination of values, freq(x,y), and then compute:

freq(x) = \sum_y freq(x,y)

freq(y) = \sum_x freq(x,y)

I_N(x,y) = \sum_{x,y} freq(x,y) \ln {{freq(x,y)}\over{freq(x) freq(y)}}

when computing the sum, only terms where freq(x,y)>0.

Estimation Bias

The simple estimator, I_N(x,y), does not yield an unbiased estimate of I(x,y). To illustrate this, I ran 1000 experiments. In each, I used Analytica to generate a randomized underlying joint probability distribution, from which I computed the underlying “true” mutual information, I(x,y). Each of the two variables had 10 possible values. Using that distribution, I then sampled 100 data points and computed I_N(x,y) via the simple estimator. The following plot compares the true underlying mutual information with the empirical I_N(x,y) from the simple estimator.

estimation of I(x,y)

Figure 2: Empirically estimated IN(x,y) versus the true underlying I(x,y).

Each data point in the above graph corresponds to one experiment, there are 1000 points shown. Notice that when the true underlying I(x,y) is 0.1, the simple estimator consistently returns a value around 0.6. These very poor estimates illustrate the inherent bias in the simple estimator. Figure 3 shows the difference between the estimated and actual, or sampling error.

Figure 3 - Sampling error: The difference  I<sub>N</sub>(x.y) - I(x,y) versus the underlying I(x,y).

Figure 3 – Sampling error: The difference between the estimated and true I(x,y) as a function the true underlying I(x,y).

Figure 3 illustrates that the bias in the estimation decreases as the level of statistical dependence increases.

Independent variables

It is fairly easy to analyze the case of statistically independent variables. When x and y are independent, their mutual information is 0. But since the simple estimator only adds together positive quantities, it always returns a positive value, never reaching zero.

The simple estimation formula for the independent case, multiplied by 2 N, is the same as the G statistic used in a statistical G-test, and this statistic is known to have a chi-squared distribution (Harremoës & Tusnády (2012)).

The distribution of estimated I(x,y) values when x and y are independent.

Figure 4: The distribution of estimated I_N(x,y) values when x and y are independent.

In Analytica, you can plot of the distribution for I_N(x,y) using the expression

ChiSquared( (K-1)^2 ) / (2 * N)

where K are the number of distinct discrete values for x and for y, with K=10 in my example. N is the sample size and is 100. The plot shows that when x and y are independent, you can expect to get an estimated value around 0.4, and it is very unlikely you’ll see an estimate below 0.2.

The mean of this distribution, which is the mean bias of the simple estimator in the independent case, is (K-1)^2/(2 N), the standard deviation is (K-1)/(\sqrt{2} N), which is 0.064 in the current example.

I experimented with independent variables, varying same sizes (N) and number of discrete outcomes (K), and found the theoretical distribution in Figure 4 is a perfect match to empirical simulations.

Dependent variables

In Figure 3, the dispersion of I_N(x,y) is nearly twice the standard deviation predicted for the independent case. Plus, the bias in the estimate decreases as the underlying mutual information increases.

First, I obtain a mean estimate of the bias as follows. Empirically from Figure 3, we see that the mean bias decreases linearly as I(x,y) goes from the independent case to the maximum value, which I estimate as \min( H(x), H(y) ). At I(x,y), the mean bias is (K-1)^2 / (2 N), and at the maximum I(x,y) it is 0. From the data alone, we don’t have access to the true underlying I(x,y) or entropies H(x) and H(y), so we have to use I_N(x,y) and H_N = \min( I_N(x,x), I_N(y,y) ).

E[ I_N(x,y) ] \approx {( H_N - I_N(x,y) )\over{H-c}} c

where c = (K-1)^2 / (2 N). This is an empirically derived formula for the E[ IN(x,y) ], not a mathematically derived one. I discovered through experimentation that a slightly improved fit results when you multiply H_N by 0.8 before inserting it into the formula for E[ IN(x,y) ] — this twiddle reflects the reality that the empirical entropy is not a perfect estimator of the upper bound. With these adjustments, Figure 5 depicts the remaining sampling error.

Figure 5: Sampling error after mean adjustment.

Figure 5: Sampling error after mean adjustment.

The thing to notice from Figure 5 is that the adjusted sampling error (the vertical axis) is centered around zero, demonstrating that the adjustment accounts correctly for the sampling bias across all levels of statistical dependence.

The standard deviation of the adjusted sampling error in Figure 5 is 0.12, compared to 0.064 when I(x,y)=0. This difference between the dispersion in the dependent and independent cases grows when we look at larger sample sizes. When the sample size is 1000, the empirical dispersion for the dependent case is about 5 times greater than for the independent case, and at N=10,000 it is 16 times greater. So the dispersion in sampling error evidently scales differently in the dependent and independent cases. The log-log plot it Figure 6 shows how increasing the sample size, N, decreases the empirical standard deviation for both the dependent and independent cases.

Figure 6: Dispersion in sampling error as sample size changes. The variance for dependence reduces more slowly than for independent variables.

Figure 6: Dispersion in sampling error as sample size changes. The variance for dependence reduces more slowly than for independent variables.

Figure 6 shows that the standard deviation in sampling error decreases linearly when plotted on a log-log plot for both the dependent and the independent cases, but whereas the slope is -1 for the independent case, it is -0.5 for the dependent case. This leads to the following formula for predicting the standard deviation is sampling error in I_N(x,y) for dependent variables:

\sigma = \gamma (K-1) / \sqrt{2 N}

Where \gamma=0.18 accounts for the y-intercept in the log-log plot. Once again, this is an empirical observation, not a theorem.

Bias Compensation

Numerous authors have advocated methods of subtracting the expected bias from the simple estimator as a way of obtaining an unbiased estimator of I(x,y)  (Miller (1955), Chee-Orts & Optican (1993), Hertz et al. (1992), Panzeri & Treves (1996)). To a crude first order, one can simply subtract (K-1)^2/(2*N), the bias for the independent case. The various authors have approached this by finding various ways to obtain a more refined estimate of bias.

Bias subtraction has the obvious drawback that your estimation of I(x,y) after subtracting bias may be negative, which has no obvious interpretation. These papers generally focus on obtaining a single unbiased estimate of  I(x,y), whereas I have an interest in obtaining information about the degree of uncertainty in the estimate (the Chee-Orts & Opticon (1993) method has potential to be adapted to include uncertainty). Several of the proposed bias compensation offsets depend only or the granularity of the discetization and the sample size, and therefore do not capture the effect depicted in Figure 3, where it is seen that the bias decreases with increasing statistical dependence.  The algorithm that I introduce in the next section is intended to address these issues.

Estimation Algorithm

In this section, I introduce a new algorithm for estimating the posterior distribution of I(x,y) from sampled data. I derive the algorithm by inverting sampling error of IN from the preceding sections. It is equivalent to using Bayes’ rule with a uniform prior over I(x,y).

The idea is depicted graphically as follows. Suppose we receive a data set and compute I_N(x,y) = 1.1. The empirical model derived in the preceding sections gives us a statistical model of the sampling error, which characterizing the scatter of the points in Figure 7. We take a slice at the measured value for I_N(x,y) as shown by the dotted blue line.

Figure 7: Sample information tells us we're on the blue line.

Figure 7: Sample information tells us we’re on the blue line.

The posterior estimation of I(x,y) is a unimodal distribution along the blue dotted line in Figure 7. To make it easy to invert, I use the characterization of the bias to draw a line through the middle of the cloud of point, along with a band on each side.  To do this, let

y(x) = c + (1-c/{H_N}) x

c = (K-1)^2 / (2 N)

\sigma = 0.18 (K-1) / \sqrt{2 N}

Note: I have found that substituting 0.8 H_N for H_N works better, and have done so in the graphs and results here.

Figure 9 plots y(x) along with y(x) \pm \sigma, which is a graphical depiction of the distribution of IN(x,y) as a function of I(x,y).

Figure 8: Depiction of the distribution of IN using bands at ±1 standard deviation.

The bands depiction in Figure 8 is somewhat tailored to the actual data sample, unlike the scatter in Figure 7, since it used H_N. By inverting the lines in Figure 8, we obtain the posterior mean and standard deviation (ignoring the boundary at I(x,y)=0).

\mu_{post} = H_N ( I_N(x,y) - c) / (H_N - c)
\sigma_{post} = \sigma H_N / (H_N - c)

We can then represent the posterior estimate of I(x,y) using a truncated Normal distribution. For example, in Analytica syntax,

Truncate( Normal( mu_post, sigma_post ), 0 )

Note that \mu_{post} and \sigma_{post} are not the final posterior mean and standard deviation — they are the mean and standard deviation of the Normal before Truncation.

Examples

All examples in this section use discrete variables having K=10 possible values.

In the first example, I sampled N=100 points from a distribution having I(x,y)=0.736. The sample information was I_N(x,y)=1.1. The estimation algorithm produces posterior estimate shown in Figure 9.

Figure 9: Posterior of I(x,y) in Example 1.

Figure 9: Posterior of I(x,y) in Example 1.

In the second example, I sampled N=100 points from a distribution having I(x,y)=0.33. The empirical I_N(x,y)=0.7 and the posterior is shown in Figure 10.

Posterior distribution for Example 2.

Figure 10: Posterior distribution for Example 2.

The third example sampled N=100 points from an underlying distribution having I(x,y)=0.046. The empirical information was 0.55 and the posterior estimate for I(x,y) is shown in Figure 11.

Figure 11: Posterior for Example 3.

Figure 11: Posterior for Example 3.

Summary

Mutual information can capture complex associations between variables that are missed by standard correlation measures. However, estimating the underlying mutual estimation from a data sample is challenging, due to an inherent bias present in the empirical mutual information. In left uncompensated, this bias magnifies the apparent degree of statistical dependence for cases having little or no dependence.

This article introduces a new Bayesian-style algorithm, derived from an empirical study across a sampling of distributions, for estimating I(x,y) from a data sample, with an explicit representation of the uncertainty in the estimate.

You may find it disturbing that I developed the algorithm from an empirical study. It is entirely possible that my sampling of underlying distributions is not representative of a distribution you encounter in your own application, and that could throw off the quality of the results. If you are academically minded, you may find the absence of an analytical derivation behind the algorithm to be disconcerting. These are all valid criticisms.

References

  • M. N. Chee-Orts & L. M. Opticon (1993), Cluster method for the analysis of transmitted information in multivariate neuronal data. Biol. Cybernetics 69:29-35.
  • P. Harremoës & G. Tusnády (2012), Information divergence is more Χ^2-distributed than the Χ^2-statistics.
  • J. A. Hertz, T. W. Kjaer, E. N. Eskander and B. J. Richmond (1992), Measuring natural neural processing with artificial neural networks. Int. J. Neural Systems 3 supplement 91-103.
  • T. Marek & P. Tichavský (2009), On the estimation of mutual information, ROBUST 2008, JCMF 2009:263-269.
  • G. A. Miller (1955), Note on the bias on information estimates. Information Theory in Psychology: Problems and Methods II-B:95-100.
  • S. Panzeri & A. Treves (1996), Analytical estimates of limited sampling biases in different information measures. Network Computation in Neural Systems 7:87-107.
Be Sociable, Share!

    Comments

    1. Lonnie, enjoyed reading your article! Very simple and clean, much better than what I could have said.

      I am interested in entropy (and MI) estimation, and have been looking for a good example for ages. I could not find a really good one. All the ones I have seem are superficial ones. Do you have a good example with real data and real application, like genes or image detection etc.? If so, would you share?

      • The application that motivated this is in the area of Hedge Fund management, where this is being applied by a hedge fund manager who is an Analytica user. It is a real application and has been integrated into his tool suite.

        In a hedge fund, you are often looking for connections between things, but many connections are non-linear, and even non-monotonic. Options and spreads make it even more so. So to use a measure of association for these (for qualitative insight), standard measures of correlation don’t work well, which led to this work.

    2. “But since the simple estimator only adds together positive quantities, it always returns a positive value, never reaching zero.”

      This isn’t actually true – in order for freq(x,y) > freq(x)*freq(y), there must be another state, (x’,y’), in which freq(x’,y’) < freq(x')*freq(y'). Thus, you have terms where you are taking the log of numbers less than 1, and the term is negative.

      If you're sampling a 100-state distribution 100 times, you've got a terrible sample to make an estimate with. Look at how the error decreases as the number of samples increases, and you'll see that it quickly converges towards the true MI.

    3. I think you are right. It always returns a non-negative value. And it is fairly easy to see that it could reach zero. I don’t think that impacts any of the other result. Thanks for pointing it out.

      I’ve limited this study to cases where K^2N, it felt like I would be asking for trouble. Certainly at K=N=100 it would be highly degenerate.

    4. avatar Xavier Gréhant says:

      Lonnie, I found your article very convincing.
      I’m looking for a quick and accurate way to estimate mutual information with python. I would be interested in how you think pyentropy would handle it. They refer to bias to some extent on this page: http://robince.github.io/pyentropy/entropy.html. Thanks!

    5. avatar Arno Stefani says:

      Hi Lonnie,

      first it’s a nice article. I just have one question. How did generate you generate your true underlying distribution randomly? Or more specific did you make sure that you pick your distributions uniformly from the probability simplex?

      • Arno,

        Good question. In each “experiment”, I randomly generated a joint probability matrix p(x,y) — where x and y each have 10 values. Since I ran 1000 experiments, this joint distribution was different in each one. However, I didn’t use a uniform “meta” distribution for generating p(x,y). The uniform method picks a random number for each cell, and then normalize to they add to 1. The trouble is that this method tends to generate nearly uniform marginals, and a dearth of variety from experiment to experiment. To promote variety, I used exponential distributions initially to spread the points out. There is nothing magic or theoretical about what I ended up with, but I just sort of tweaked the algorithm until I ended up with what I felt was a rich variety of joint distributions. The Analytica code that generated these distributions was the following:

        var a := if Random(over:Real_case)<0.1
        Then Random(Exponential(1,over:i,j,Real_case))
        Else Random(Exponential(30,over:i,j,Real_case));
        a:= uniform(0,.5)*(i=j) + a;
        var m := Random(Uniform(.25,3,over:Real_case));
        a^m/sum(a^m,I,J);

        In this code, i, j and Real_case are indexes. Real_case is the "experiment" index (1..1000). Indexes i and j correspond to the x and y dimensions, each 1..10. The result in as array with dimensions [Real_case,i,j] — for each real_case, [i,j] is the joint probability table.

    Speak Your Mind

    *