News Flash: Artificial Intelligence Discovers That Babies Can Learn Language!


One of Nature’s most spectacular miracles occurs when human babies, initially all drool and dirty diapers and wordless emotion, transform into word-wielding toddlers. Researchers who specialize in artificial intelligence have long sought to understand how a non-linguistic system can acquire language competence. Their approach, historically, has been to input mountains of linguistic data into these learning-capable computer systems — the assumption being, apparently, that language is so complex and nuanced that it can only be learned through a vast number of exemplars. The result of this “massed practice” approach has been encouraging in a way: Tools like ChatGPT which can process language commands, consult language-rich data sources, and render competent-looking language responses (though there are occasional hiccups; see this past post on AI).

It’s now obvious that computer programs can go far when it comes to mastering language. But making computers capable of human-like responding tells us little, if anything, about how human beings acquire human-like responding.

This image has an empty alt attribute; its file name is brain.jpg
The cognitive revolution taught that brains are computers (insert eye roll here).

At the height of the so-called “cognitive revolution” in Psychology, circa 1950s to 1970s, it was popular to imagine the human brain as a sort of computer, and therefore to assume that computer models would help us understand the mind/brain. But past work on language learning by artificial intelligence shows the folly of that assumption. Traditionally, in order to build their language expertise, these systems have been fed more language data than some humans might digest in a lifetime. Even if we accept the shaky premise that computer code mimics the structure and function of biological architecture, it’s hard to buy that AI systems inputted with impossibly large linguistic data sets say anything useful about human language learning, which begins very early in life and after a limited amount of verbal experience.

Enter a new study, published in Science, that took a very different approach to bootstrapping language capabilities in an AI system. Here’s an accessible popular-press description of the study. As preview, it’s worth quoting the brief journal-editor’s introduction to the study:

How do young children learn to associate new words with specific objects or visually represented concepts? This hotly debated question in early language acquisition has been traditionally examined in laboratories, limiting generalizability to real-world settings. Vong et al. investigated the question in an unprecedented, longitudinal manner using head-mounted video recordings from a single child’s first-person experiences in naturalistic settings. By applying machine learning, they introduced the Child’s View for Contrastive Learning (CVCL) model, pairing video frames that co-occurred with uttered words, and embedded the images and words in shared representational spaces. CVCL represents sets of visually similar things from one concept (e.g., puzzles) through distinct subclusters (animal versus alphabet puzzles). It combines associative and representation learning that fills gaps in language acquisition research and theories.

This image has an empty alt attribute; its file name is babycam2.jpg

Specifically, an AI system was fed input from just 61 hours of audio/video recordings taken from a helmet-mounted camera worn by a boy, “Sam,” at intervals between ages 6 to 25 years old. The recordings preserved verbal behavior directed at Sam but also whatever incidental verbal behavior might take place around him. Afterward, the AI showed evidence of “understanding” a variety of words, and also generalization to some novel exemplars that we behavior analysts might say are part of relational frames with the target words.

Long story short: You don’t have to train AI on the full text of all 70,000 books in Project Gutenberg before it begins to demonstrate language competence.

There’s another important implication of the study. As Scientific American commented in its coverage of the study:

Some cognitive scientists and linguists have theorized that people are born with built-in expectations and logical constraints that make this possible. Now, however, machine-learning research is showing that preprogrammed assumptions aren’t necessary to swiftly pick up word meanings from minimal data.

Traditional AI approaches represent one end of a continuum of language learning theory, in which it’s thought that only an insane amount of verbal experience can yield language repertoires. At the other end of the continuum is the notion, popularized within linguistics, that humans are essentially born into language. The classic example of the “built-in expectations and logical constraints” mentioned above is Noam Chomsky’s concept of a biologically-scripted “Language Acquisition Device” that (he says) allows humans to learn language almost effortlessly (see Chomsky’s seminal argument in his blistering review of Skinner’s Verbal Behavior). Chomsky thought this mechanism explained the explosion of language that takes place in early childhood. But clearly an AI program possesses no such mechanism, so even if a Language Acquisition Device can exist, it’s not necessary to language learning.

Overall, the new AI study illuminates two things that behavior analysts have known for a long time. First, although humans may indeed possess some innate advantages for language learning, the Chomskian notion that they come pre-equipped with a sense of grammar and syntax was always illogical — for some reasons why, see Kenneth MacCorquodale’s brilliant 1970 rebuttal to Chomsky’s take-down of Skinner, and a sort of update/extension published by Dave Palmer in 2017). Oh, and by the way, note that linguists, who once were in Chomsky’s thrall, have gradually come around to accepting that his ideas are hollow and unsupported by linguistic data.

Second, human children don’t need to digest the full text of 70,000 books to start developing language capabilities. This has been evident ever since applied behavior analysts began helping children with disabilities learn verbal behavior; ever since the earliest studies of derived stimulus relations showed clear relevance to verbal behavior; and ever since Betty Hart and Todd Risley’s groundbreaking research on early child language experience. Axiomatic to behavioral work overall: What matters in child language acquisition is the RIGHT experiences. This is a topic on which AI seems unlikely to supplant good old behavioral research any time soon.

Indeed, the moral of this story is close to what Skinner advanced in his classic paper “Phylogeny and ontogeny of behavior” (and elsewhere). In the 1960s and 1970s Skinner noticed an increasing tendency for scientists to focus on genetics and brain anatomy/physiology to try to make sense of behavior. Skinner’s retort: In such efforts, a science of behavior exerts primacy because it reveals the regularities in behavior that other disciplines need to explain. Skinner did not discount the possibility that at some future date those other approaches could make unique contributions; he merely argued that in their infancy the best that could be expected is to identify biological factors that correlate with behavioral phenomena. But for this to be possible, of course, functional behavior patterns first have to be clearly delineated.

This image has an empty alt attribute; its file name is whatever.jpg

The same kind of observation can be made about AI. Maybe someday language-learning computer programs really will help us understand how nonverbal human newborns transform into verbal creatures. But the present reality, as telegraphed in the “helmet” study, might be the converse: Behavioral studies of verbal behavior suggest some ways that computers can learn to be verbal. The study described here should be viewed as the miniscule tip of a prodigious iceberg. Consider: In the new study, there was nothing scripted in the input fed to the AI system — only what Sam happened to experience. Imagine instead a test in which the system is fed with 61 hours of carefully-programmed experience built around three-term contingencies and derived stimulus relations. This ought to produce much more verbal behavior, more efficiently, from the same amount of experience. And that’s just one weapon in a sizable arsenal that behavior analysts have built up for speeding verbal learning in children. A computer program that learns language in the same way (e.g., see Postscript) would be considerably “more human” than the ones developed so far.


  • In studying unscripted interactions between mothers and infants, Ernst Moerk (1990) found that a large proportion of what mothers and children did mapped onto three-term contingency patterns. Presumably that kind of thing was part of Baby Sam’s naturalistic experience. One of the take-home messages of the famous Hart and Risley study — one that has been harnessed in the Thirty Million Words intervention programs for parents and kids — is that those sorts of contingencies matter. The “quality” of language experience is important. Speaking loosely, language experience has to be interactive; you can’t just turn on Sesame Street all day and expect your child to learn.
  • To be clear, the Hart and Risley study also shows that quantity of language experience matters: In effect, language acquisition correlates with how often kids are talked with. In this respect, traditional AI studies which input mountains of linguistic data aren’t wrong exactly. Recall that continuum in which, at one end, language learning requires massive experience and, at the other, it emerges almost magically from very limited language exposure. Hart and Risley showed that language learning is cumulative. Nearly all young children do it, but how much depends on the richness of their experience. A simple way to measure that richness is in terms of how many words a child has heard others speak by age 3. Hart and Risley found that this can differ across kids by tends of millions of words, with the difference predicting later language competence and academic achievement. Just remember that this isn’t passive input; there’s an interaction between quantity and quality, with the latter pointing to BEHAVIORAL mechanisms.