Rhythm and Memory for Speech

This post was co-written with my fellow authors Mark Hurlstone and Graham Hitch.

Today we published an important paper in Cognitive Psychology. The paper is significant because it explains a link between rhythm and memory in terms of a common mechanism that connects speech processing, verbal learning and language development to rhythmic oscillations in brain activity.

The ability to remember and repeat words we’ve just heard, verbal short-term memory, is vitally important to human beings. In childhood, it lies at the heart of our ability to learn new words and to form the rich vocabulary that is the foundation of language. Throughout our lives the amount we can keep in mind places limits on what we can and can’t do, so the capacity of short-term memory is a factor in intelligence and learning. Given its importance it may seem surprising that verbal short-term memory has a very limited capacity, and even more surprising that we still do not understand what limits this capacity.

We do know, though, that rhythm plays an important part in governing how much we can remember. A simple and familiar example is keeping a phone number in mind – you can remember more digits if they are separated into shorter groups than if they are presented in an evenly paced sequence. It’s been known since the sixties that some rhythmic grouping patterns are much better than others, but until now there’s been no satisfactory explanation for this.

The Importance of Serial Order

Speech is inherently serial whether we are thinking about the ordering of sounds in a syllable, or of the words in a sentence. This serial structure can be incorporated into our memories: we remember the order of the speech sounds that make up a word; we remember the order of digits in a phone number; we remember the order of the words in a poem. The main limitation of verbal short-term memory is the problem of storing this serial structure the first time we hear (or read) something and in retrieving it accurately after a delay. Most people can remember a few words without difficulty, but when we try to remember a sequence of say 10 or 12 unrelated words most of us will struggle to get them all in the right order.

The most successful previous models explain this limitation using a mechanism called Competitive Queuing, in which items to be sequenced become associated with a dynamic timing signal. The state of the timing signal changes continuously as we encounter a sequence so that (through a simple “Hebbian” process) words perceived at different times become associated with distinct states of the timing signal.


Figure 1. Competitive Queuing. Items are associated with a gradually changing timing signal. When the timing signal is replayed, the items compete to be recalled. Competition is strongest between Items associated with similar states of the timing signal (indicated by the coloured ribbon).

To recall a sequence of items, the timing signal is “replayed”, which has the effect of activating the items in the sequence they were originally experienced. However, as nearby states of the signal are similar, neighbouring items are active in parallel and compete to be output; the greatest competition is between items close together in the original list. With a little noise in the system, the wrong item will occasionally be selected for output. The “correct” item then remains active and is very likely to be output at the next position in the sequence. This is important as it helps explain the most common type of error, where items near the middle of a list swap places with one another.


Figure 2. Order errors. In competitive queuing models items associated with similar states of the timing signal are prone to exchange with one another in errors. For example, in recalling a list of letters (“HMTBQJ”), “T” and “B” are associated with similar states of the signal (represented by similar colours in the diagram). They compete strongly with one another so there is a chance that “B” will be produced first, leaving “T” to be selected next. Items at the beginning and end of the list have fewer direct competitors and so are less likely to be involved in order errors.

Competitive Queueing models provide a great explanation for many properties of verbal short-term memory (for example, primacy and recency effects that mean errors more often occur in the middle of a sequence), but previous models could not explain in any detail how the rhythm of the items might affect memory capacity. Earlier models had recognised the importance of rhythm and offered partial explanations. However, the models only tackled regular rhythms and even so had to be told in advance what the rhythm of the sequence would be. The problem is that when encountering words or sounds in everyday life the rhythm may be irregular and unpredictable.

In the new paper we present a new Competitive Queueing model – BUMP – which explains the relationship between rhythm and the capacity of verbal short-term memory. The BUMP model replaces the timing signal in earlier models with a more detailed mechanism that is sensitive to the timing of the words that we experience. The BUMP mechanism is a hypothetical population of neurons which respond in a rhythmic way to changes in the intensity of speech sounds (amplitude modulations – AM). We show how such a population, acting as a timing signal, can be used to encode sequences whose timing is irregular and unpredictable, like that of the speech we encounter in everyday life. It explains, in detail, how the timing of one sequence of words can make it easier or harder to remember compared to another containing the same number of items and lasting the same length of time. In doing so it predicts an intimate relationship between the mechanisms of rhythm perception and the capacity of verbal short-term memory.

The model in more detail

BUMP stands for “Bottom-Up, Multi-scale Population”. “Bottom-Up” refers to the way our idealised neurons, or ‘oscillators’, respond to bursts of sound, “Multi-scale Population” refers to the way that different neural oscillators are sensitive to pulses of different duration.


In neuroscience and psychology the term “bottom-up” is used to refer to sensory processes which are driven by input from the outside world as distinct from those that are subject to “top-down” influences from cognition: expectations, goals, targets and strategies. In two separate experiments, our study shows grouping effects in memory don’t seem to relate to strategies and expectations, in that we see the same patterns whether or not the grouping pattern can be predicted. For example, a list of nine digits grouped into 3s by a slight pause after every third item (e.g., 017 – 243 – 986) is much easier to remember than the same sequence grouped in a less regular way (for example 0-1724398-6). This is true whether or not we can predict the pattern in advance. There seems to be something intrinsically easier about remembering sequences broken into evenly spaced groups – even when we cannot anticipate it. This suggests that the rhythm is influencing memory through a “bottom-up”, sensory process. In the paper we describe how this could work.

In the BUMP model each neural oscillator has an intrinsic rhythm or tuning – think of a child’s swing: you give the swing a push and it oscillates backwards and forwards at a specific rate. In the BUMP model, the push comes from the bursts of sound associated with spoken words, and the oscillations are fluctuations in the activity of neurons.

BUMP oscillators (unlike swings) have a fairly brief response to each “push” – the fluctuations settle quickly after a burst of sound. This means that they are sensitive to recent changes in the speech rate. If the pulses of intensity come fairly regularly, near to their intrinsic tuning, the oscillators keep time, tracking the pace of the incoming speech over quite a wide range.


Figure 3 (from the paper). a) Bump oscillators are hypothetical neurons whose firing rates vary according to the match between their “impulse response” and changes in the intensity of speech – “Input amplitude” (represented by the wave in b, top row). The bottom row of b) shows how changes in firing rate firing rate tracks rhythmic pulses in the intensity of speech even when (as on the right hand side) the speech rate varies.

Multi-scale Population

One problem in dealing with speech in the real world is that timing is irregular and unpredictable. The brain can’t be set up to deal with a particular ideal rate or rhythm, but must be able to deal with anything that is thrown at it. So in the BUMP model, we envisage a whole population of neurons with different sensitivities. Some are sensitive to short changes in the intensity of speech (say individual words); they respond to short pulses of sound. In these neurons, activity oscillates rapidly and the oscillations die away quickly after a burst of sound. Others are sensitive to much more gradual changes (say corresponding to a phrase or sentence); their activity oscillates more slowly and the oscillations die away more gradually after a burst of sound.

Because the oscillators in the BUMP model have a fairly brief response to each pulse, they tend to track the rhythms of the incoming speech. Oscillators with slightly different intrinsic tunings act together, oscillating in concert in response to the rhythms of speech associated with syllables, words and phrases and sentences. When speech is organised around a strong rhythm the oscillators tend to resonate to it, producing a richer and more coherent response than when the pulses occur more haphazardly.


Figure 4. a) Technically our neural oscillators come in pairs. In this schematic illustration (from the paper) of a pair of AM tuned “neurons” in the BUMP model. In each case their firing rate varies at the same rate after a burst of speech sound, but their responses are offset with one another. b) shows how the offset response of the pair (lower trace) changes during presentation of a sequence of 9 items represented by triangular AM pulses(upper trace). c) At any time the combined output of the two cells can be represented as a point in 2D space defined by the relative activity of the two “neurons” (grey circle). As time passes the point rotates around the origin. The phase of the output can be represented by a hue defined by the angle made with the origin, while its amplitude can be represented by the brightness of the chosen colour. d) the coloured ribbon shows how phase and amplitude vary over time as the grouped list is experienced. In this instance the cells are tuned to respond to modulations with a wavelength of 1.94s which in this example happens to be close to the grouping frequency of the items in the list (b). The phase of the pair’s output goes through approximately one cycle per group.


Figure 5, excerpted from the paper, shows how the multi-scale population (oscillators with different intrinsic tunings) respond to an evenly timed (a, top) and regularly grouped (b, bottom) sequence. The regular grouping activates more oscillators, providing a richer and more distinctive representation of serial order. However, a side effect is that items that occur at corresponding points in different groups may be more prone to swap with one another.

A BUMP timing signal

As in earlier Competitive Queueing models, the BUMP model stores the serial order of a sequence in terms of an association between each item and the state of the timing signal when it occurred. But unlike earlier models the timing signal reflects the states of the oscillators which vary systematically with the rhythms of a sequence of sounds or words. Because the oscillators are bottom-up, the model responds on-line and unlike previous models does not have to be told in advance what the rhythm to expect. When we replay the timing signal (to simulate retrieval) the items are activated according to the sequence they were originally encountered, but nearby items tend to compete with one another (because they are associated with similar oscillator states). But the precise timing of the original sequence affects the degree of similarity between different list positions, making one pair of items seem more distinct (and less likely to be involved in an order error) or another pair less distinct (and more likely to exchange with one another). Our simulations show that these variations produce remarkably similar patterns of overall accuracy and of specific types of error to those we see in the experimental data.

To make this concrete, if you try to remember a phone number such 017243986 you may struggle but if you are presented the same digits with pauses 017-243-986 you are more likely to remember it correctly. However, if you do make a mistake you are more likely to switch (say) the 2 and the 9 in the grouped phone number. Notice this is rather like the way speech sounds swap around in typical speech errors (i.e., spoonerisms such as “you have tasted two worms”) – elements from corresponding positions in different syllables switch with one another, and in fact our BUMP mechanism was originally devised to help explain such patterns.

The model responds very much like humans when remembering sequences with a wide range of irregular rhythms. It explains both why regularly grouped lists are recalled much better overall than other sequences, but also how items that occupy the same position in different groups tend to swap with one another in errors, and many other subtle features of the data, such as the way timing affects the tendency for errors to occur at specific points in the list. These multiple detailed features are natural properties of encoding serial order with a timing signal based on bottom-up multi-scale oscillators, and don’t depend strongly on the precise choice of oscillator tunings. The fit to the data is compelling evidence that the BUMP mechanism or something very like it lies at the heart of verbal short-term memory.


To summarise, we put forward a model built around a sensory process that sets up neural oscillations based on the shifting rhythms of speech. This mechanism is used to encode the serial order of spoken material in short-term memory, determining its capacity. Because of the central role that verbal short-term memory plays in learning new words we’d expect a close connection between this oscillatory timing process and our ability to speak and understand speech. More generally, work on serial order suggests that timing signals are critical to our ability to sequence speech sounds and words, which is in turn vital for the emergence of language. The part played by rhythm in this process leads us to speculate that a BUMP-like mechanism might underpin the evolution of language and its development during childhood.


2 thoughts on “Rhythm and Memory for Speech

    1. This is a really good question and one that (I admit) I had not thought about much till I wrote this blog article.

      Luckily other people have, and the answer seems to be a tentative yes: http://jn.physiology.org/content/94/3/1904.abstract?ijkey=e2f3b8260a34acc509014e92ea21b0ab61742bac&keytype2=tf_ipsecsha

      In humans, the strength of the oscillations is linked to comprehension:

      My guess would be that this type of system is useful for a variety of processes linking hearing, attention, memory and communication, but one that’s become specialized for language in humans. Unlike other primates, we make very complex sequences of sound where meaning is carried by their order. By combining simple elements (phonemes and syllables) in different sequences we are able to express an almost limitless range of ideas. My understanding is that wild primates are limited to simple calls and gestures which don’t appear to have much sequential structure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s