Continuous Thought Machines

iandanforth · 2025-05-12T09:24:17 1747041857

This paper is concerning. While divorced from the standard ML literature there is a lot of work on biologically plausible spiking, timing dependant artificial neutral networks. The nomenclature here doesn't seem to acknowledge that body of work. Instead it appears as a step toward that bulk of research coming from the ML/LLM field without a clear appreciation of the ground well traveled there.*

In addition some of the terminology is likely to cause confusion. By calling a synaptic integration step "thinking" the authors are going to confuse a lot of people. Instead of the process of forming an idea, evaluating that idea, potentially modifying it and repeating (what a layman would call thinking) they are trying to ascribe "thinking" to single unit processes! That's a pretty radical departure from both ML and ANN literature. Pattern recognition/signal discrimination is well known at the level of synaptic integration and firing, but "thinking?" No, that wording is not helpful.

*I have not reviewed all the citations and am reacting to the plain language of the text as someone familiar with both lines of research.

program_whiz · 2025-05-12T21:19:12 1747084752

Sorry I should have responded to this comment, but I wrote a separate response in the parent thread. I didn't feel the pdf / paper was really trying to mimick spiking biological networks in all but the loosest sense (there is a sequence of activations and layers of "neurons"). I think the major contribution is just using the dot product on output transpose output, the rest is just diffusion / attention on inputs. Its conceptually a combination of "input attention" and "output attention" using a kind of stepped recursive model.

TeMPOraL · 2025-05-12T19:16:35 1747077395

I'm sort of not surprised; my impression is that, for the past decade or two, ML researchers who did acknowledge related work in neuroscience were broadly accused of hubris for daring to compare their work to biological brains.

mountainriver · 2025-05-12T18:27:34 1747074454

Agree, they are presenting this like its a new idea without hardly any reference to the decades of work on spiking neural nets and similar.

cepera · 2025-05-12T13:13:41 1747055621

>there is a lot of work on biologically plausible spiking

I ask you kindly to share the list (or even better brief review) of most insightful books/papers in your opinion with neuroscience inspired algorithms concepts/implementation details.

erewhile · 2025-05-12T13:49:59 1747057799

Not the original poster, but:

- Theoretical Neuroscience Computational and Mathematical Modeling of Neural Systems - Peter Dayan, L. F. Abbott (2001) is quite good, more mathematical than computational.

- Neuronal dynamics, available here: https://neuronaldynamics.epfl.ch/ is also quite good, and free to read. Has python exercises as well. If I recall correctly, it mostly goes into simulations of singular neurons, and not so much entire networks and what we can do with them, but it does a good job at bridging the chemistry / biology / math to computation.

If we're talking about papers, one I mentioned in my other comment:

- Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations, https://doi.org/10.1162/089976602760407955

- Dynamics of Sparsely Connected Networks of Excitatory and Inhibitory Spiking Neurons, by Nicolas Brunel (Don't have a DOI on hand for this one)

- Spiking Neural Networks and Their Applications: A Review, https://doi.org/10.3390/brainsci12070863 , is a very nice review of methods and does some nice explaining on concepts.

If you're looking for keywords on the topic:

- Leaky Integrate and Fire (LIF) neurons

- Spiking neural networks

- Liquid State Machines (LSM)

- Synaptic plasticity (Models of synaptic plasticity)

- Spike-based synaptic plasticity

rkp8000 · 2025-05-12T16:25:17 1747067117

A (non-exhaustive) list of some notable papers:

Maass 2002, Real-time computing without stable states: https://pubmed.ncbi.nlm.nih.gov/12433288/

Sussillo & Abbott 2009, Generating Coherent Patterns of Activity from Chaotic Neural Networks https://pmc.ncbi.nlm.nih.gov/articles/PMC2756108/

Abbott et al 2016, Building functional networks of spiking model neurons https://pubmed.ncbi.nlm.nih.gov/26906501/

Zenke & Ganguli 2018, SuperSpike: Supervised Learning in Multilayer Spiking Neural Networks https://ganguli-gang.stanford.edu/pdf/17.superspike.pdf

Bellec et al 2020, A solution to the learning dilemma for recurrent networks of spiking neurons https://www.nature.com/articles/s41467-020-17236-y

Payeur et al 2021, Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits https://www.nature.com/articles/s41593-021-00857-x

Cimesa et al 2023, Geometry of population activity in spiking networks with low-rank structure https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

Ororbia 2024, Contrastive signal–dependent plasticity: Self-supervised learning in spiking neural circuits https://www.science.org/doi/10.1126/sciadv.adn6076 Kudithipudi et al 2025, Neuromorphic computing at scale (review) https://www.nature.com/articles/s41586-024-08253-8

tiahura · 2025-05-12T13:34:44 1747056884

The authors don't label a single synaptic integration as "thinking." They use the term for the network-wide internal loop ("internal ticks") that unrolls after every external input, and explicitly say it is merely "analogous to thought."

vonneumannstan · 2025-05-12T14:52:59 1747061579

Was this written by Jürgen Schmidhuber?

robwwilliams · 2025-05-12T04:14:46 1747023286

Great to refocus in this important topic. So cool to see this bridge being built across fields.

In wet-ware it is hard not to think of “time” as linear Newtonian time driven by a clock. But in the cintext of brain- and-body what really is critical is generating well ordered sequences of acts and operations that are embedded in thicker or thinner sluce of “now” that can range from 300 msec of the “specious present” to 50 microseconds in cells that evaluate the sources of sound (the medial superior olivary nucleus).

For more context on contingent temporality see interview with RW Williams in this recent publication in The European Journal of Neuroscience by John Bickle:

https://pubmed.ncbi.nlm.nih.gov/40176364/

program_whiz · 2025-05-12T21:16:36 1747084596

In my reading of the paper, I don't feel this is really like biological / spiking networks at all. They keep a running history of inputs and use multi-headed attention to form an internal model of how the past "pre-synaptic" inputs factor into the current output (post-synaptic). this is just like a modified transformer (keep history of inputs, use attention on them to form an output).

Then the "synchronization" is just using an inner product of all the post activations (stored in a large ever-growing list and using subsampling for performance reasons).

But its still being optimized by gradient descent, except the time step at which the loss is applied is chosen to be the time step with minimum loss, or minimum uncertainty (uncertainty being described by the data entropy of the output term).

I'm not sure where people are reading that this is in any way similar to spiking neuron models with time simulation (time is just the number of steps the data is cycled through the system, similar to diffusion model or how LLM processes tokens recursively).

The "neuron synchronization" is also a bit different from how its meant in biological terms. Its using an inner product of the output terms (producing a square matrix), which is then projected into the output space/dimensions. I suppose this produces "synchronization" in the sense that to produce the right answer, different outputs that are being multiplied together must produce the right value on the right timestep. It feels a bit like introducing sparsity (where the nature of combining many outputs into a larger matrix makes their combination more important than the individual values). The fact that they must correctly combine on each time step is what they are calling "synchronization".

Techniques like this are the basic the mechanism underlying attention (produce one or more outputs from multiple subsystems, dot product to combine).

program_whiz · 2025-05-12T21:26:55 1747085215

I would say one weakness of the paper is that they primarily compare performance with LSTM (a simpler recursion model), rather than similar attention / diffusion models. I would be curious how well a model that just has N layers of attention in/out would perform on these tasks (using a recursive time-stepped approach). My guess is performance will be very similar, and network architecture will also be quite similar (although a true transformer is a bit different than input attention + unet which they employ).

davedx · 2025-05-12T09:19:43 1747041583

So this weekend we have:

- Continuous thought machines: temporally encoding neural networks (more like how biological brains work)

- Zero data reasoning: (coding) AI that learns from doing, instead of by being trained on giant data sets

- Intellect-2: a globally distributed RL architecture

I am not an expert in the field but this feels like we just bunny hopped a little closer to the singularity...

chrsw · 2025-05-12T13:47:19 1747057639

I don't get that feeling. There are so many papers and so many avenues of research. It's extremely difficult for me to predict what's going to "pop" like the diffusion paper, the transformer paper, AlphaZero, Chat GPT-3, etc. But even these research or product advances that seem like step functions are built on a lot of research and trial-and-error. Can all 3 of these you listed be combined somehow? Hopefully, but I have no idea.

dgfl · 2025-05-12T11:52:55 1747050775

Don’t give too much importance to individual papers. At best, you’re mostly disregarding all the work that lead to it. At worst, you’re placing a lot of faith on an idea analyzed through rose-tinted glasses and presented with a lot of intentional omissions.

davedx · 2025-05-12T12:52:29 1747054349

I mean the Zero Data reasoning directly cites previous work on the same principle. That in particular seems like a significant step forwards though - one of the main critiques of current methods is "humans don't learn by ingesting terabytes of common crawl, they learn from experience".

aDyslecticCrow · 2025-05-12T18:40:49 1747075249

Both Intellect-2 and zero data reasoning work on LLMs ("Zero data reasoning" is quite a misleading name of a method. It's not very ground-breaking.) If you wanna see a major leap in LLMS, you should check out what InceptionLabs did recently to speed up inference by 16x using a diffusion model. (https://www.inceptionlabs.ai/)

Our algorithms for time-series reinforcement learning are abysmal compared to inference models. Despite the explosion of the AI field, robotics and self-driving are stuck without much progress.

I think this method has potential, but someone else needs to boil it down a bit and change the terminology because, despite the effort, this is not an easily digested article.

We're also nowhere close to getting these models to behave properly. The larger the model we make, the more likely it is to find loopholes in our reward functions. This holds us back from useful AI in a lot of domains.

gessha · 2025-05-12T13:05:13 1747055113

But when you try to run their code or use the product, it's either missing or doesn't perform as well as marketed in the paper. Personal recommendation for building mental resistance against AI hype is to:

- read the paper and the concrete claims, results and limitations

- download and run the code whenever possible

- test for out of distribution inputs and/or practical examples outside of the training set

spiderfarmer · 2025-05-12T09:47:44 1747043264

Also not an expert, but I think this is like saying robots will dominate the world because we invented camera's, actuators and batteries.

In other words, baby steps, not bunny hops.

TeMPOraL · 2025-05-12T19:38:51 1747078731

In some sense, they did. The world may not be full of human-like robots autonomously roaming the wastelands^Wurban landscape - but it's chock-full of actuators, sensors and batteries. There are sensors and actuators in your coffee maker. There are plenty of them in your car too, whether they control the wheels, or the angle of the mirrors, or the height of the windows, or the state of your door knob. Etc. And all of those robotic parts were mostly made by... larger robots in factories.

aaroninsf · 2025-05-12T17:37:16 1747071436

I'll take a moment to comment on this, after reading the responses which challenge your conclusion.

This criticism is entirely justified for a narrow read of your point, that the specific and relatively-widely-disseminated papers/projects, themselves, represent specific progress towards e.g. take-off or AGI or SI.

But it's also unjustified to the extent that these particular papers are proxies for broader research directions—indeed, many of the other comments provide reading lists for related and prior work.

I.e. it's not that this or that particular paper is the hop. It's that the bunny is oriented in the right direction and many microhops are occurring. What one chooses to label a hop amid the aggregate twitches and movement is a question for pedants.

Meanwhile the bunny might be moving.

bob1029 · 2025-05-12T09:50:12 1747043412

> Emulating these mechanisms, particularly the temporal coding inherent in spike timing and synchrony, presents a significant challenge. Consequently, modern neural networks do not rely on temporal dynamics to perform compute, but rather prioritize simplicity and computational efficiency.

Simulating a proper time domain is a very difficult thing to do with practical hardware. It's not that we can't do it - it's that all this timing magic requires additional hyperparameter dimensions that need to be searched over. Finding a set of valid parameters when the space is this vast seems very unlikely. You want to eliminate parameters, not introduce ones.

Also, computational substrates that are efficient to execute can be searched over much more quickly than those that are not. Anything where we need to model a spike that is delivered at a future time immediately chops a few orders of magnitude off the top because you have to keep things like priority queue structures around to serialize events.

Unless hard real time interaction is an actual design goal, I don't know if chasing this rabbit is worth it on the engineering/product side.

The elegance of STDP and how it could enable online, unsupervised learning is still highly alluring to me. I just don't see a path with silicon right now or on the horizon. Purpose built hardware could work but is like taking a really big leap of faith by setting some of the hyperparameters to const in code. The chances of getting this right before running out of money seem low to me.

angusturner · 2025-05-12T12:25:53 1747052753

Hm suppose for argument sake that feeding a batch of data through some moderately large FF architectures takes on the order of 100ms (I realise this depends on a lot parameters - but this seems reasonable for many tasks / networks).

Now suppose instead you have an CTM that allocates 10ms on the standard FF axes, and then multiplies it out by 10 internal “ticks” / recurrent steps?

The exact numbers are contrived, but my point is : couldn’t we conceivably search over that second arch just as easily?

It just boils down to whether the inductive bias of building in some explicit time axis is actually worthwhile, right ?

erewhile · 2025-05-12T08:17:14 1747037834

The ideas of these machines isn't entirely new. There's some research from 2002, where Liquid State Machines (LSM) are introduced[1]. These are networks that generally rely on continuous inputs into spiking neural networks, which are then read by some dense layer that connects to all the neurons in this network to read what is called the liquid state.

These LSMs have also been used for other tasks, like playing Atari games in a paper from 2019[2], where they show that while sometimes these networks can outperform humans, they don't always, and they tend to fail at the same things more conventional neural networks failed at at the time as well. They don't outperform these conventional networks, though.

Honestly, I'd be excited to see more research going into continuous processing of inputs (e.g., audio) with continuous outputs, and training full spiking neural networks based on neurons on that idea. We understand some of the ideas of plasticity, and they have been applied in this kind of research, but I'm not aware of anyone creating networks like this with just the kinds of plasticity we see in the brain, with no back propagation or similar algorithms. I've tried this myself, but I think I either have a misunderstanding of how things work in our brains, or we just don't have the full picture yet.

[1] doi.org/10.1162/089976602760407955 [2] doi.org/10.3389/fnins.2019.00883

ttoinou · 2025-05-12T04:56:06 1747025766

Ironically this webpage continuously refreshes itself on my firefox iOS :P

tonyhart7 · 2025-05-12T06:05:06 1747029906

it literally never load for me

swalsh · 2025-05-12T11:31:39 1747049499

To me, they key to the next generation of models needs to be neurons that fire together wire together. I think spiking neural networks offer an exciting alternative approach.

coolcase · 2025-05-12T05:59:52 1747029592

I love the ML diagrams that hybrid maths and architecture. It is much less dry than all formal math.

liamwire · 2025-05-12T08:15:49 1747037749

Seems really interesting, and the in-browser demo and model was a really great hook to get interest in the rest of the research. I’m only partially through it but the idea itself is compelling.

AIorNot · 2025-05-12T08:56:49 1747040209

Can someone explain this paper in the context of LLM architectures - it seems this cannot be combined with LLM deep learning or can it?

dcrimp · 2025-05-12T07:37:42 1747035462

I'm quite enthusiastic about reading this. Since watching the progress by the larger LLM labs, I've noted that they're not making material changes in model configuration that I think to be necessary to proceed toward more refined and capable intelligence. They're adding tools and widgets to things we know don't think like a biological brain. These are really useful things from a commercial perspective, but I think LLMs won't be an enduring paradigm, at least wrt genuine stabs at artificial intelligence. I've been surprised that there hasn't been more effort to transformative work like in the linked article.

The two things that hang me up on current progress in intelligence is that:

- there don't seem to be models which possess continuous thought. Models are alive during a forward pass on their way to produce a token and brain-dead any other time - there don't seem to be many models that have neural memory - there doesn't seem to be any form of continuous learning. To be fair, the whole online training thing is pretty uncommon as I understand it.

Reasoning in token space is handy for evals, but is lossy - you throw away all the rest of the info when you sample. I think Meta had a paper on continuous thought in latent space, but I don't think effort in that has continued to anything commercialised.

Somehow, our biological brains are capable of super efficiently doing very intelligent stuff. We have a known-good example, but research toward mimicking that example is weirdly lacking?

All the magic happens in the neural net, right? But we keep wrapping nets with tools we've designed with our own inductive biases, rather than expanding the horizon of what a net can do and empowering it to do that.

Recently I've been looking into SNNs, which feel like a bit of a tech demo, as well as neuromorphic computing, which I think holds some promise for this sort of thing, but doesn't get much press (or, presumably, budget?)

(Apologies for ramble, writing on my phone)

rvz · 2025-05-12T05:34:26 1747028066

> The Continuous Thought Machine (CTM) is a neural network architecture that enables a novel approach to thinking about data. It departs from conventional feed-forward models by explicitly incorporating the concept of Neural Dynamics as the central component to its functionality.

Still going through the paper, But this looks very exciting to actually see, the internal visual recurrence in action when confronting a task (such as the 2D Puzzle) - making it easier to interpret neural networks over several tasks involving 'time'.

(This internal recurrence may not be new, but applying neural synchronization as described in this paper is).

> Indeed, we observe the emergence of interpretable and intuitive problem-solving strategies, suggesting that leveraging neural timing can lead to more emergent benefits and potentially more effective AI systems

Exactly. Would like to see more applications of this in existing or new architectures that can also give us additional transparency into the thought process on many tasks.

Another great paper from Sakana.

omneity · 2025-05-12T07:00:47 1747033247

Is it the same Sakana from the cheating AI coder tribulations? There were some fundamental mistakes in that work that made me question the team.

https://www.hackster.io/news/sakana-ai-claims-its-ai-cuda-en...

https://techcrunch.com/2025/02/21/sakana-walks-back-claims-t...

doall · 2025-05-12T08:04:46 1747037086

They admitted, apologized, and are in the process of revising the paper. Mistakes always happen whether small or big. What is more important is to be transparent, learn from it, and make sure the same mistake doesn't happen again.

（评论） (comments)

（评论）
(comments)