Here I compare loss of information due to quantization or addition of noise
Quantizing and adding noise to a latent representation are two methods to
"reduce information", or create a "bottleneck". This note attempts to show
how that works, and how they are comparable in terms of the mutual
information between such variables.
Motivation
Deep learning models sometimes employ information reduction techniques
of quantization or noising as a means for regularizing a model. One of the
effects of such an operation is to discard information. Specifically, the
Markov chain X \leftrightsquigarrow \text{reduce}(X)
\leftrightsquigarrow Y will produce a lower value for
I(X;Y) than would the chain X \leftrightsquigarrow
Y. Seen against the backdrop of Tishby's Information
Bottleneck principle, we can ask, what is the mutual information
I(X; \text{reduce}(X)), where
\text{reduce}(X) is either quantization or adding
noise.
Although all random variables in an implemented model are effectively
discrete, we consider both continuous and discrete cases for theoretical
completeness. To avoid confusion, we are considering the mutual
information between a variable and its reduced transformation. One could
also ask, what is the difference in mutual information I(X; X) -
I(X; \text{reduce}(X)) as a way of comparing a model that does
or does not use a form of information reduction.
Setup
Assume the following (rather contrived) variables:
\begin{aligned}
X \in \Bbb R &\sim U[0, 8) \\[1ex]
Q \in \Bbb N &= \dfrac{\lfloor (X * 2) \rfloor}{2} & \text{"quantized" from X rounded down to nearest} \tfrac{1}{2} \\[1ex]
N \in \Bbb R &= X + U[0, \tfrac{1}{2}) \mod 8 & \text{X with added noise in }[0, \tfrac{1}{2})\\[1ex]
\\
\end{aligned}
Q takes the value of the nearest
\tfrac{1}{2} value lower bound of X's
value, while N adds noise in [0,
\tfrac{1}{2}), wrapping around in case it goes outside
X's domain. Thus, X, N,
and Q are all uniform distributions.
From these definitions we have the following properties. For clarity I
notate quantities on discrete variables with \Sigma and
continuous variables with \int. As I discuss below, these
two styles cannot be combined. However, mutual information always combines
two quantities of the same type.
Quantization Mutual Information
Here is the calculation of mutual information due to quantization.
\begin{aligned}
H_{\int}(X) &= 3 & \text{} \\
H_{\int}(X|Q) &= -1 \\
I_{\int}(X;Q) &= H_{\int}(X) - H_{\int}(X|Q) = 4 \\[2ex]
\end{aligned}
Knowing Q in all cases narrows down the value of
X to a \tfrac{1}{2} length interval of
uniform probability, which has differential entropy -1, so our mutual
information is 4.
We can calculate this the other way as well:
\begin{aligned}
H_\Sigma(Q) &= 4 & \\
H_\Sigma(Q|X) &= 0 \\
I_\Sigma(X;Q) &= H_\Sigma(Q) - H_\Sigma(Q|X) = 4 \\[2ex]
\end{aligned}
Q has entropy 4 bits because it is a uniform discrete
with 16 classes. And knowing X determines
Q exactly, making it a one-hot distribution, so again we
get a mutual information of 4. Unlike the previous, we are using discrete
entropies here.
Noise Mutual Information
Adding noise has a similar effect. We again start out with a
differential entropy of 3 for X. Conditioning on
N narrows X to a
\tfrac{1}{2} length uniform interval, just like with the
quantized case.
\begin{aligned}
H_{\int}(X) &= 3 & \\
H_{\int}(X|N) &= -1 \\
I_{\int}(X;N) &= 4
\end{aligned}
Calculating the other way, we see that conditioning on
X narrows the noise variable to the
\tfrac{1}{2} length uniform interval, entropy -1, so we
have again mutual information of 4.
\begin{aligned}
H_{\int}(N) &= 3 \\
H_{\int}(N|X) &= -1 \\
I_{\int}(X;N) &= 4
\end{aligned}
Comparison of Quantization and Noise
So quantizing to the nearest \tfrac{1}{2} value produces a
relationship with the same mutual information as adding a uniform noise
over the same sized interval. In general, the smaller the quantization
interval or noise interval, the more mutual information. In the limit, we
approach maximal mutual information.
Varying a quantization codebook density, or varying the variance of the
noise added per data point can allow a network to learn how much mutual
information to preserve on a per-sample basis.
An inconsistency between discrete and continuous ("differential")
entropy formulas
As an aside, in the case of continuous random variables, this limit is
infinity, which is not consistent with the formulation that I(X;X) =
H(X). In other words, a deterministic mapping from a continuous variable
to another continuous variable has infinite mutual information between
them. In practice, there are no such variables - everything is quantized.
Discrete entropy is formulated as:
H_\Sigma(X) \equiv - \sum_{x_i}^N{p(x_i) \log(p(x_i))}
It is bounded in [0, \log(N)].
Differential entropy is formulated as
H_{\int}(X) \equiv - \int_{\mathcal{X}}{d(x) \log(d(x)) \,dx}
It is bounded in [-\infty, \log(|\mathcal{X}|)]
Both attain their maximal value for uniform distributions. Discrete
entropy attains its minimum of 0 for any one-hot distribution, while
H_{\int} takes its minimum of -\infty for
any distribution with support of measure zero.
The Shannon coding interpretation to entropy is the minimum expected
codeword length for symbols sampled i.i.d. from the distribution.
Unfortunately, H_{\int} doesn't conform to this, because it
would take infinite bits to represent values in a continuous domain. Also,
the notion of entropy as self-information, H(X) = I(X;X)
doesn't work for a serendipitous reason: that I(X;X) is
calculated by subtracting entropies H_{\int}(X) - H_{\int}(X|X).
In this case, the second term would be -\infty and we
would indeed get \infty (correctly) as our value for
self-information.