I present a modified VAE that combines the prior and recognition model,
trains it using a predictive coding loss as well as a reconstruction loss
with the decoder.
Kingma et al. derive a training objective for a variational autoencoder.
The derivation can be understood, in my opinion a little easier than their
original one, as follows.
They start out with a data set X which is assumed to be
distributed according to a model p_\theta(z) p_\theta(x|z),
and the goal is to estimate \theta. At this point, the only
useful property of learning this relationship is that it can now be sampled
from to generate new samples from p_\theta(x). There is
nothing claimed about any useful properties of z itself.
Although they make the point that the prior p_\theta(z)
can in principle be complex, they work through an example in which it is an
isotropic standard multivariate normal distribution. The fact that this is a
smooth distribution allows one to visualize how the complex output space in
X changes as z gradually changes. Other
than this, the only practical significance of this is just the ability to
create samples.
For the moment, let's accept that learning \theta is a
worthy goal, and follow their argument about how to do so. The first step is
to minimize the cross-entropy \mathbb{E}_{x \sim
Data}[p_\theta(x)]. Calculating this using
p_\theta(x) \cong \mathbb{E}_{z \sim p_\theta(z)}[p_\theta(x|z)]
is intractable, because in general, the region in z-space where
p_\theta(x|z) has high density will be small, and it is
unknown where this region is. So, we can never get enough samples of
z to make the approximation accurate.
But, there is a nice identity:
p_\theta(x) \equiv \dfrac{p_\theta(x, z)}{p_\theta(z|x)}
which is true for any value of z. But now the
problem is that we cannot calculate p_\theta(z|x).
That density is defined implicitly, or "induced", as I like to think of
it, as
p_\theta(z|x) \equiv p_\theta(z) p_\theta(x|z) / \sum_z{p_\theta(z)
p_\theta(x|z)}
We cannot calculate it, but it is helpful to know that it is a well-defined
entity that is determined through our calculatable expressions
p_\theta(z) and p_\theta(x|z).
Instead, what Kingma and Welling do is introduce a distribution
q_\phi(z|x) and set up an objective so that it is forced to
be closer and closer to the induced distribution
p_\theta(z|x). They rewrite the above as an approximation
times an error factor:
p_\theta(x) \equiv \dfrac{p_\theta(x, z)}{q_\phi(z|x)}
\dfrac{q_\phi(z|x)}{p_\theta(z|x)}
or
\log p_\theta(x) \equiv \log \dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log
\dfrac{q_\phi(z|x)}{p_\theta(z|x)}
Note that, even though the formula contains the term parameterized by
\phi, the left-hand side does not depend on
\phi, which is why \phi doesn't appear in
the parameterization. Its effects cancel.
So now, we have two terms: one which is calculatable, which is an
approximation, and the second one, which is a correction factor, or error
term. Again, this is true for any z, no matter where
it comes from.
The next idea puts together four facts conveniently. First is that,
after the log transformation, we can express the product of the two factors
as a sum of logs. Second, because the full expression is constant over
z, we can take an expectation of that expression over any
distribution of z.
\log p_\theta(x) \equiv \mathbb{E}_{z \sim p_{any}(z)}[\log
\dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log
\dfrac{q_\phi(z|x)}{p_\theta(z|x)}]
Third, if we specifically choose
q_\phi(z|x) as the distribution, then the second expectation
becomes a KL-divergence term, which is known to be always positive.
\begin{aligned}
\log p_\theta(x) &\equiv \mathbb{E}_{z \sim q_\phi(z|x)}[\log
\dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log
\dfrac{q_\phi(z|x)}{p_\theta(z|x)}] && \text{Expectation of constant = constant}\\
&= \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)}] +
\mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{q_\phi(z|x)}{p_\theta(z|x)}]
&& \text{Separate out non-computable terms}
\\
&= \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)}] +
D_{KL}[q_\phi(z|x) \| p_\theta(z|x)]
&& \text{Recognize KL-divergence}
\\
&\ge \mathbb{E}_{z \sim q_\phi(z|x)}[\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)}]
&& \text{Positivity of KL-divergence}
\end{aligned}
Intuitively, what is happening is that each term in the expectation will
be pushed up. Recall that \dfrac{d}{dx} \log x = \dfrac{1}{x}
\, dx. So, the smaller q_\phi(z|x) is, the
steeper the gradient pushing it down. Meanwhile, the smaller
p(z,x) is, the steeper the gradient pushing it up.
Finally, now that we have two expectations, one of which is always
positive, this means the other is a lower bound for the full expression.
Conveniently, now, we can calculate the value of the first term. Even though
we cannot calculate the value of the KL-divergence term, it doesn't prevent
us from maximizing the full expression, since we can just maximize a lower
bound for it.
Recall that the full expression given above has the same value for any
z in the domain. But, for any choice of z,
the first term (the approximation) will vary, and the second term (the
correction factor)
First, the error term will be much more accurate for z
values in which both q_\phi(z|x) and
p_\theta(z|x) are high. This isn't a mathematically
rigorous statement really. But, intuitively, the ratio is less sensitive to
absolute differences when both values are high. Second, if we calculate an
expectation over samples from q_\phi(z|x):
\begin{aligned}
\mathbb{E}_{z \sim q_\phi(z|x)}[
\log \dfrac{p_\theta(x, z)}{q_\phi(z|x)} + \log
\dfrac{q_\phi(z|x)}{p_\theta(z|x)}]
\end{aligned}
then the second term becomes the always-positive KL-divergence between
q and what it is approximating. It is important to
interpret this formula correctly. Remember that the term
p_\theta(z|x) is not directly modeled, rather "induced", and
that it is not calculatable even from the modeled distributions it is induced
from. It's not obvious, but the two terms are connected in such a way that
maximizing the first term will drive the second term to zero. I'm not sure
what to make of this, since the second term is the always-positive
KL-divergence D_{KL}(q_\phi(z|x)||p_\theta(z|x)). So it
would appear that one of the forces might try to increase it, even though
overall, the whole expression is maximized when this term goes to zero. I
don't know how to think about this.
Putting that aside, we can calculate the gradients with respect to
\theta and \phi. So, training works, and
we learn all three modeled distributions p_\theta(z),
p_\theta(x|z) and q_\phi(z|x). This means
they are consistent with the underlying joint distribution p(x,
z). Finally, the induced marginal
p_{\theta,\phi}(x) asymptotically approaches the data
distribution X in the limit of a large amount of data.
Practice error
From the above, we have expressed the learning objective in terms of a
code prediction error and a reconstruction error. The code prediction error
corresponds with the human experience of a constant monitoring of
moment-to-moment expectations for what comes next, either passively, or when
we take action. The reconstruction error corresponds roughly to a human who
can perceive some data x in the sensory domain, and then,
based on that perception z, produce another
x in the data domain. In the way that we train an
autoencoder, the model is forced (teacher forcing) to reproduce the same
x and we measure the assigned density and optimize it.
I'd like to introduce a new objective, called practice error,
which simulates the human experience of practicing. By
"practicing", I mean, the human has an idea of a desirable behavior of some
sort, which he understands only through his representation of the sense
perceptions resulting from that behavior. The trial-and-error process of
practicing something consists of:
- Obtain (through observation or imagination) some
representation \hat{z} of a desirable observation
- Generate x^{(i)} \sim p_\theta(x|\hat{z}) through bodily
action and the resulting interaction with environment and senses
- Interpret the result through your senses: z^{(j)} \sim
q_\phi(z|x^{(i)})
- Measure the error between \hat{z} and
z^{(j)}
- Based on the error, adjust the behavior generation
p_\theta(x|z) and/or the perception
q_\phi(z|x)
- Repeat
Mapping this process onto an autoencoder might look like an inverted
reconstruction:
\mathbb{E}_{x^{(i)} \sim p_\theta(x|\hat{z})}[r_\phi(\hat{z}|x^{(i)}, g =
\text{open})]
Each x^{(i)} is a separate trial, representing a
particular bodily action, interaction with environment, and resulting sensory
data generated. Once x^{(i)} is generated, in this scheme,
instead of generating a z^{(j)} and measuring some distance,
we simply try to maximize the expected probability assigned to
\hat{z}. This is not the same as step five above, both
procedures will produce the same result, because XXX
In the above, \hat{z} was obtained "through observation
or imagination". If the goal is to imitate a particular observation
\hat{x}, then it makes sense to perform the above experiment over
multiple zs that arise from it. This would then be:
\mathbb{E}_{z^{(j)} \sim r_\phi(z|\hat{x}, g = \text{open}) \atop
x^{(k}) \sim p_\theta(x|z^{(j)})} [ r_\phi(z^{(j)}|x^{(k)}, g
= \text{open})]
What is a plausible model, inspired by the VAE, about how the brain
processes speech?
I have been considering the VAE model in the context of unsupervised
speech representation learning, using WaveNet as the decoder, where
X is speech audio data, and z is a
time-indexed latent that is supposed to represent something like a "phoneme".
From that point of view, I want to map the various cognitive processes and
experiences onto the expressions p_\theta(z),
q_\phi(z|x), and p_\theta(x|z). I settled
on the following:
First, I note that z and x are
time-indexed, so should be written z \equiv [z_1, z_2, ...,
z_n] and x \equiv [x_1, x_2, ..., x_n]. Second,
the process of silently imagining words in your head, I interpret as sampling
from p(z), in which the brain maintains a current state that
summarizes previously generated z_{t-i}'s. Thus, it is
written autoregressively as p(z_t | z_{t-k}, ..., z_{t-1}).
The process of pronouncing these imagined phonemes and words involves
another circuit that takes this generated state and translates it into motor
output for the tongue, larynx, vocal chords, jaw, etc. This could be modeled
as p_\theta(m_t | z_t) and where we have a fixed function,
determined by one's body itself, that translates the motor commands into
actual sound, x_t = f_{body}(m_t), which is not
parameterized. The final model thus would be p_\theta(x_t | z_t)
\equiv f_{body}(p_\theta(m_t|z_t)). An important point here
is that m_t is much lower-dimensional than
x_t since there are relatively few muscles involved, and
the sound waves produced have very high temporal resolution.
The peculiar thing now is imagining the process of listening to speech. I
feel it is very likely the case that the same circuits that are involved in
speech imagining, which we modeled as p(z_t | z_{t-k}, ...,
z_{t-1}) must also be involved in listening.
What about the process of listening to speech? The intuition is that
listening (or any experience) is a predictive activity. Based on the recent
past inputs, we are constantly generating predictions about what might come
next. And, moment-to-moment, the hypothesis is that our brains compare these
predictions with the encoding derived from the immediate sensory information.
So, if I were to model this, it would be a recognition circuit that is
augmented with autoregressive context, as q_\phi(z_t | z_{t-k},
..., z_{t-1}, x_t)
But, if the imagining process, and the listening process both involve
moment-to-moment integration of recent past state, isn't it likely they would
share the same circuits? The problem is that the VAE formulation doesn't
allow that, because they are two separate models (I think). But, I believe
we can combine them using an attentional trick as follows:
Allow for a gating mechanism mediated by inhibitory neurons, which allows
or prevents signal from the sensory processing circuit to reach the recurrent
circuit. The sensory processing circuit produces a representation
z^{(s)}_t = f_{sense}(x_t), which will be compared to
the autoregressively predicted z_t to produce an error
term e_t. The new setup is now:
\begin{aligned}
z &\equiv [z_1, ..., z_T] \\
p(z) &\equiv \prod_{t=k+1}^T { p_\theta(z_t | z_{t-k}, ..., z_{t-1},
e_{t-1}, g_{t-1} = \text{shut}) } \\
p(z|x) &\equiv \prod_{t=k+1}^T { p_\theta(z_t | z_{t-k}, ..., z_{t-1},
e_{t-1}, g_{t-1} = \text{open}) } \\
e &= f_{error}(z_{t-1}, z_{t-1}^{(s)}) && \text{error} \\
z^{(s)} &= f_{sense}(x_{t-1}) && \text{sense-derived encoding} \\
\end{aligned}
The error function could be a circuit that is always on, but upstream of
the gating mechanism. If the gate is open, then two things happen. First,
the autoregressive circuit incorporates the error signal and updates itself
to try to minimize error. Second, the error provides additional conditioning
input, along with z_{t-k},...,z_{t-1} which affects the
prediction of latent variable at the next timestep.
If the gate is closed, the error signal is ignored. This will be akin to
a person imagining words while ignoring other people talking. The sounds are
going into the person's ears, and some lower level circuits have no choice
but to process the information, but an attention mechanism prevents it from
affecting the person's train of thought. Thus, the prediction of latent
variable at the next timestep only incorporates the previous latent variables
as context, and there is also no training signal.
Crucially, the system would have to train in such a way that, for any given
z-context z_{t-k}, ..., z_{t-1}, the distribution of
e_t arising from the data will have zero mean, and thus:
p(z) \propto \sum_{x \sim X}{p(z|x)}
We then stipulate that the circuit learns over time to make prediction
errors that are zero-mean. The problem is that, by the VAE formulation, the
prior and recognition model are completely separate. I wondered how one
could combine them in a probabilistically consistent way, and have the
following proposal:
Now, taking the insight from McAllester and van den Oord (ref, predictive
coding), imagine that the activity of listening to spoken speech and
recognizing phonemes, words, phrases, etc, involves the brain maintaining a
current state which allows it to use information from previous latent states.
As it does this, another circuit processes the current x_t
and generates an "actual" z_t from which it Now, there are
three main activities. First is the activity of listening to spoken speech,
and recognizing phonemes, words, phrases, etc in real time.
The first thing that occurred to me is that the idea of sampling from the
prior p(z) is how I should think about the mental process of
imagining (but not pronouncing) words.