Kingma et al. derive a training objective for a variational autoencoder. The derivation can be understood, in my opinion a little easier than their original one, as follows.

They start out with a data set

Although they make the point that the prior

For the moment, let's accept that learning

is intractable, because in general, the region in z-space where

But, there is a nice identity:

which is true *for any* value of

That density is defined implicitly, or "induced", as I like to think of
it, as

Instead, what Kingma and Welling do is introduce a distribution

or

Note that, even though the formula contains the term parameterized by

So now, we have two terms: one which is calculatable, which is an
approximation, and the second one, which is a correction factor, or error
term. Again, this is true for *any*

The next idea puts together four facts conveniently. First is that,
after the log transformation, we can express the product of the two factors
as a sum of logs. Second, because the full expression is constant over

Third, if we specifically choose

Intuitively, what is happening is that each term in the expectation will
be pushed up. Recall that

Finally, now that we have two expectations, one of which is always positive, this means the other is a lower bound for the full expression. Conveniently, now, we can calculate the value of the first term. Even though we cannot calculate the value of the KL-divergence term, it doesn't prevent us from maximizing the full expression, since we can just maximize a lower bound for it.

Recall that the full expression given above has the same value for any

First, the error term will be much more accurate for

then the second term becomes the always-positive KL-divergence between

Putting that aside, we can calculate the gradients with respect to

From the above, we have expressed the learning objective in terms of a
code prediction error and a reconstruction error. The code prediction error
corresponds with the human experience of a constant monitoring of
moment-to-moment expectations for what comes next, either passively, or when
we take action. The reconstruction error corresponds roughly to a human who
can perceive some data

I'd like to introduce a new objective, called *practice error*,
which simulates the human experience of *practicing*. By
"practicing", I mean, the human has an idea of a desirable behavior of some
sort, which he understands only through his representation of the sense
perceptions resulting from that behavior. The trial-and-error process of
practicing something consists of:

- Obtain (through observation or imagination) some
representation
\hat{z} of a desirable observation - Generate
x^{(i)} \sim p_\theta(x|\hat{z}) through bodily action and the resulting interaction with environment and senses - Interpret the result through your senses:
z^{(j)} \sim q_\phi(z|x^{(i)}) - Measure the error between
\hat{z} andz^{(j)} - Based on the error, adjust the behavior generation
p_\theta(x|z) and/or the perceptionq_\phi(z|x) - Repeat

Mapping this process onto an autoencoder might look like an inverted reconstruction:

Each

In the above,

I have been considering the VAE model in the context of unsupervised
speech representation learning, using WaveNet as the decoder, where

First, I note that

The process of pronouncing these imagined phonemes and words involves
another circuit that takes this generated state and translates it into motor
output for the tongue, larynx, vocal chords, jaw, etc. This could be modeled
as

The peculiar thing now is imagining the process of listening to speech. I
feel it is very likely the case that the same circuits that are involved in
speech imagining, which we modeled as

What about the process of listening to speech? The intuition is that
listening (or any experience) is a predictive activity. Based on the recent
past inputs, we are constantly generating predictions about what might come
next. And, moment-to-moment, the hypothesis is that our brains compare these
predictions with the encoding derived from the immediate sensory information.
So, if I were to model this, it would be a recognition circuit that is
augmented with autoregressive context, as

But, if the imagining process, and the listening process both involve moment-to-moment integration of recent past state, isn't it likely they would share the same circuits? The problem is that the VAE formulation doesn't allow that, because they are two separate models (I think). But, I believe we can combine them using an attentional trick as follows:

Allow for a gating mechanism mediated by inhibitory neurons, which allows
or prevents signal from the sensory processing circuit to reach the recurrent
circuit. The sensory processing circuit produces a representation

The error function could be a circuit that is always on, but upstream of
the gating mechanism. If the gate is open, then two things happen. First,
the autoregressive circuit incorporates the error signal and updates itself
to try to minimize error. Second, the error provides additional conditioning
input, along with

If the gate is closed, the error signal is ignored. This will be akin to a person imagining words while ignoring other people talking. The sounds are going into the person's ears, and some lower level circuits have no choice but to process the information, but an attention mechanism prevents it from affecting the person's train of thought. Thus, the prediction of latent variable at the next timestep only incorporates the previous latent variables as context, and there is also no training signal.

Crucially, the system would have to train in such a way that, for any given
z-context

We then stipulate that the circuit learns over time to make prediction errors that are zero-mean. The problem is that, by the VAE formulation, the prior and recognition model are completely separate. I wondered how one could combine them in a probabilistically consistent way, and have the following proposal:

Now, taking the insight from McAllester and van den Oord (ref, predictive
coding), imagine that the activity of listening to spoken speech and
recognizing phonemes, words, phrases, etc, involves the brain maintaining a
current state which allows it to use information from previous latent states.
As it does this, another circuit processes the current

The first thing that occurred to me is that the idea of sampling from the
prior