Astral Codex Ten

Welp, is that a fire alarm I hear?

Expand full comment

KT George

Does this help explain the mere exposure effect or semantic satiation or is that something only the brain experiences?

Expand full comment

A Wild Boustrophedon

Cutting the computer in half isn't actually as silly as you make it sound!

The problem with backprop is you have to apply it to one layer of the neural network at a time. The result of layer k is required before you can backprop k-1. This reduces the degree to which you can parallelise or distribute computation.

One of the HN commenter pulled out this quote from the paper:

>>> In the brain, neurons cannot wait for a sequential forward and backward sweep. By phrasing our algorithm as a global descent, our algorithm is fully parallel across layers. There is no waiting and no phases to be coordinated. Each neuron need only respond to its local driving inputs and downwards error signals. We believe that this local and parallelizable property of our algorithm may engender the possibility of substantially more efficient implementations on neuromorphic hardware.

In other words, using the predictive coding approach, you can be updating the weights on layer k asynchronously without waiting for the weights to be updated on subsequent layers. The idea seems to be that the whole system will eventually converge as a result of these local rules. This approach lets you scale out by distributing the computation between different physical machines, each running the local rules on a piece of the network. With backprop this doesn't work because at any given time, most of the weights can't be updated until calculations for other weights to complete.

Right now this hasn't made for any huge performance wins because (a) the researchers didn't put a lot of effort into leveraging this scaling ability, and (b) they have to do a few hundred iterations of updating in order for the algorithm to converge on the value you'd get out of backprop. The hope is that the opportunities for scaling outweigh the disadvantages of needing to do multiple passes to get convergence.

Expand full comment

Reply (3)

Daisy

I don't see any mention of nonlocality in the article, and to the best of my knowloedge backpropagation is entirely local process, which helps parallel architectures such as GPUs to perform it efficiently.

The issue with backprop seems to be the fact that it goes backwards, while neurons can only perform forward computations.

Expand full comment

Imperu

"But there’s no process in the brain with a god’s-eye view, able to see the weight of every neuron at once."

This was what was bothering me when I studied ML in university and made me to believe that the whole thing was BS. It's the first time I read a phrase where someone put a finger on it.

Expand full comment

A C Harper

Perhaps backpropagation doesn't happen in the brain... but loops back through the environment.

So, see a vague snakish thing in the grassy stuff and your priors direct your pulse to increase, hormones to be released, and your attention (or eye direction and focus) swivels to the potential theat. You look properly at the snakish thing in the grassy stuff and it is resolved into a snake in the grass, or a hosepipe. No backpropagation is required... just different weights being input to the predictive processes from another beginning.

With a highly parallel system you don't have to 'refine' the first process through the brain, you just redo it (although many of the 'priors' will be available more quickly because you are now focused on the snakey thing, not what you are going to have for lunch).

Expand full comment

Hyperion

https://twitter.com/TimKietzmann/status/1361673150828838913

I'm an ML PhD student - I read this paper when it came out. My impression was that the paper does elide the difference between *predictive coding networks*, which are a type of NN inspired by *predictive coding theory* in neuroscience, and that this has led to confusion on the part of people who might miss this.

From the paper:

"Of particular significance is Whittington and Bogacz (2017) who show

that predictive coding networks – a type of biologically plausible network which learn through a hierarchical process of prediction error minimization – are mathematically equivalent to backprop in MLP models. In this paper we extend this work, showing that predictive coding can not only approximate backprop in MLPs, but can approximate automatic differentiation along arbitrary computation graphs. This means that in theory there exist biologically plausible algorithms for differentiating through arbitrary programs, utilizing only local connectivity."

The key phrase is "biologically plausible", which basically just means an algorithm which is vaguely similar to something in the brain, at a given level of abstraction. I think practically this is just another backprop alternative (though certainly an interesting one) which like the others, will probably turn out to be less useful than backprop.

On the topic of the link between neural networks and PCNs however, there is a more recent paper showing a much more direct link than the one linked in your post:

This one is more interesting to me because it shows that predictive coding-like dynamics naturally emerge when you train an NN with backprop in a particular way, rather than having to do the more involved backprop approximation stuff in the original paper.

Expand full comment

Beren Millidge

Author of the paper here. Really excited to see this get featured on SSC and LW. Happy to answer any questions people have.

Here are some comments and thoughts from the discussion in general:

1.) The 100x computational cost. In retrospect I should have made this clearer in the paper, but this is an upper bound on the actual cost. 100s of iterations are only needed if you want to very precisely approximate the backprop gradients (down to several decimal places) with a really small learning rate. If you don't want to exactly approximate the backprop gradients, but just get close enough for learning to work well, the number of iterations and cost comes down dramatically. With good hyperparam tuning you can get in the 10x range instead.

Also, in the brain the key issue isn't so much "computational cost" of the iteration as it is time. If you have a tiger leaping at you, you don't want to be doing 100s of iterations back and forth through your brain before you can do anything. If we (speculatively) associate alpha/beta waves with iterations in predictive coding, then this means that you can do approx 10-30 iterations per second (i.e. 1-3 weight updates per second), which seems about in the right ballpark .

The iterative nature of predictive coding also means it has the nice property that you can trade-off computation time with inference accuracy -- i.e. if you need you can stop computation early and get a 'rough and ready' estimate while if something is really hard you can spend a lot longer processing it.

2.) Regarding the parallelism vs backprop. Backprop is parallel within layers (i.e. all neurons within a layer can update in parallel), but is sequential between layers -- i.e. layer n must wait for layer n+1 to finish before it can update. In predictive coding all layers and all neurons can update in parallel (i.e. don't have to wait for layers above to finish). "A Wild Boustrephedon" explains this well in the comments. Of course there is no free lunch and it takes time (multiple iterations) for the information about the error at one end of the network to propagate through and affect the errors at the other end.

Personally, I would love to see a parallel implementation of predictive coding. I haven't really looked into this because I have no experience with parallelism or a big cluster (running everything on my laptop's GPU) but in theory it would be doable and really exciting, and potentially really important when you are running huge models.

One key advantage of neuromorphic computers (in my mind), beyond being able to simply 'cut the computer in half' is that it is much more effective to simulate truly heterarchical architectures. A big reason most ML architectures are just a big stack of large layers is that this is what is easily parallelizable on a GPU (a big matrix multiplication). The brain isn't like this -- even in cortex though there are layers, each layer is composed of loads of cortical columns, which appear to be semi-self-contained, there are lateral connections everywhere, and cortical pyramidal cells project and receive inputs from loads of different areas, not just the layer 'above' and the layer 'below'. Simulating heterarchical systems like this on a GPU would be super slow and inefficient, which is why nobody does this at scale, but could potentially be much more feasible with neuromorphic hardware.

3.) The predictive coding networks used in this paper are a pretty direct implementation of the general idea of predictive coding as described in Andy's book Surfing Uncertainty, and are essentially the same as in the original Rao and Ballard. The key difference is that to make it the same as supervised learning, we need to reverse the direction of the network, so that here we are predicting labels from images rather than images from labels, as in the original Rao and Ballard work. Conversely, this means that we can understand what 'normal' predictive coding is doing is backprop on the task of generating data from a label. I explain this a bit more in my blog post here https://berenmillidge.github.io/2020-09-12-Predictive-Coding-As-Backprop-And-Natural-Gradients/

4.) While it's often claimed that predictive coding is biologically plausible and the best explanation for cortical function, this isn't really all that clear cut. Firstly, predictive coding itself actually has a bunch of implausibilities. Predictive coding suffers from the same weight transport problem as backprop, and secondly it requires that the prediction and prediction error neurons are 1-1 (i.e. one prediction error neuron for every prediction neuron) which is way too precise connectivity to actually happen in the brain. I've been working on ways to adapt predictive coding around these problems as in this paper (https://arxiv.org/pdf/2010.01047.pdf), but this work is currently very preliminary and its unclear if the remedies proposed here will scale to larger architectures.

Also there's persistent problems with how to represent negative prediction errors (i.e. having negative errors be represented as lower than average firing (requires a high average to get good dynamical range which is energy inefficient), or else using separate populations of positive and negative prediction errors (which must be precisely connected up)). Interestingly, in the one place we know for sure there are prediction error neurons -- dopaminergic reward prediction errors in the basal ganglia involved in model-free reinforcement learning, the brain appears to use both strategies simultaneously in different neurons. I don't know of any research showing anything like this in cortex though.

Also, there's not a huge amount of evidence in general for prediction error neurons in the brain -- see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7187369/ for a good overview -- although it turns out to be surprisingly difficult to get unambiguous experimental tests of this. For instance if you have a neuron which responds to a spot of light, then is it a prediction neuron which 'predicts' the spot, or is it a prediction error neuron which predicts nothing and then signals a prediction error when there is a spot?

A final key issue is that predictive coding is only really defined for rate-coded neurons (where we assume that neurons output a scalar 'average firing rate' rather than spikes), and it's not necessarily clear how to generalize and get predictive coding working for networks of spiking neurons. This is currently a huge open problem in the field imho.

5.) Predictive coding isn't everything. There has actually been loads of cool progress over the last few years in figuring out other biologically plausible schemes for backprop. For instance, target propagation (https://arxiv.org/pdf/2007.15139.pdf), equilibrium propagation (https://www.frontiersin.org/articles/10.3389/fnins.2021.633674/full), and direct feedback alignment (https://arxiv.org/pdf/2006.12878.pdf) have recently also been shown to scale to large-scale ML architectures. For a good review of this field you can look at https://www.sciencedirect.com/science/article/pii/S1364661319300129.

6.) The brain being able to do backprop does not mean that the brain is just doing gradient descent like we do to train ANNs. It is still very possible (in my opinion likely) that the brain could be using a more powerful algorithm for inference and learning -- just one that has backprop as a subroutine. Personally (and speculatively) I think it's likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where each neuron or small group of neurons represents a single 'particle' following its own MCMC path. This approach naturally uses the stochastic nature of neural computation to its advantage, and allows neural populations to represent the full posterior distribution rather than just a point prediction as in ANNs.

Expand full comment

Reply (5)

Feral Finster

Forgive me, but isn't backpropagation something like the way science is at times practiced?

"We get that when we do this, what model will explain these results?"

Expand full comment

boylermaker

Neat!

On the other hand, outside view: previous attempts to explain the brain by reference to the most complicated technology of the time (mills, hydraulic systems, etc), have typically ended up not super helpful and almost immediately hilarious-in-retrospect. Are we confident that our most-complicated-technologies have progressed far enough that this time will end up being any different?

Expand full comment

T. Oren

https://arxiv.org/pdf/1806.07366.pdf

Reminds me of another relatively recent “grand unification” of neural networks with Ordinary Differential Equations:

Expand full comment

CounterBlunder

I think it's worth noting that "Predictive coding is the most plausible current theory of how the brain works." is not a sentence that, in my estimation (current PhD in cog sci at an R1 university), would receive widespread agreement from cog sci researchers. (I feel confident that you wouldn't get majority agreement, and I'd bet against even plurality.)

Of course, that doesn't mean it's wrong -- but I think this sentence misleadingly makes it seem like there's expert consensus here, when I think in fact Scott is relying on his own judgment.

Expand full comment

Experimental Quasiliterature

Ozymandias

"This paper permanently fuses artificial intelligence and neuroscience into a single mathematical field."

No it doesn't.

Expand full comment

Staying Alive (Path findings)

Patrick

The Dragonsphere Examiner

Thanks Beren! I'd rather imagine my neurons are performing highly parallelized advanced Hamiltonian MCMC algorithms too, but wouldn't put it past the sneaky blighters to be indulging in a bit of backpropagation on the side, though it sounds complicated, immoral and possibly illegal 😉

Expand full comment

Alephwyr

As someone who already has a crank's refutation I guess it's my responsibility to ask the stupid question: how do we know no current AI is conscious?

Expand full comment

Daniel H.

I'd like to take the chance to re-remind everyone that there's also a subreddit at https://www.reddit.com/r/PredictiveProcessing and I'd love to see more active discussion there on current papers.

And to all those preferring media to written text:

The brains@bay meetup video here: https://www.youtube.com/watch?v=uiQ7VQ_5y5c&t=14 includes a discussion of evidence of predictive processing, mechanisms for learning and predictive-processing-imitating AI implementations in scratch. I plan to do some notes and will add them to the subreddit in the next days (but as always, I don't have my notes ready yet when Scott posts something fitting on predictive processing).

And since I'm already posting youtube links: There's also a video discussing the paper here: https://m.youtube.com/watch?v=LB4B5FYvtdI - at least if I didn't miss something it's not by any of the authors but unrelated? (and I should add I haven't watched it so far)

Expand full comment

Daniel Franke

If you generalize "computer" a little bit, then "a computer that doesn't break when you cut it in half" just unpacks as "a partition-tolerant distributed system", i.e. a network of computers that keeps that mostly keeps working if some nodes become unable to communicate with each other due to network outages. This is a well-studied problem and, while "neuromorphic" systems may well have this property, lots of non-neuromorphic systems already do.

Expand full comment

Pontifex Minimus 🏴󠁧󠁢󠁳󠁣󠁴󠁿

Pontifex Minimus

Unifying ML and biological learning ought to be worth a Nobel Prize or Turing Award.

Expand full comment

bagel

Apr 15, 2021

One (possibly outdated) complaint I've heard from a neurologist about ANNs is that the math of natural neural networks isn't plus/minus, it's plus/divide. The way we understood it, ANNs sum the previous layer, applying positive or negative factors to it so if you double a signal it adds or subtracts twice as much for the next layer. But when neurons signal each other chemically, more positive signaling increases the binding rate, but negative signal interferes with the binding, having an outsized effect. He even speculated that a natural neural net that used plus/minus would be considered pathological.

I learned my machine learning before innovations like deep and recurrent ANNs. Back then, the plus/minus vs plus/divide contrast that he drew about ANNs seemed apt, leaving open the possibility that plus/divide would perform better. Can newer architectures model plus/divide? Have researchers investigated plus/divide and found it to not improve ANNs?

Expand full comment