Can psychologists raise intelligence? We think so: Bridging the behavioral and trait theory divide

I am very pleased to introduce Shane McLoughlin, a Lecturer in Psychology at the University of Chester in the UK. Shane is a dynamic and compelling young behavioral psychologist, who will shortly complete his PhD at the University of Chichester, UK. He is conducting some of the largest scale Relational Frame Theory research studies to date. As you will read below, a really important development is the newly founded lines of productive communication Shane has initiated with renowned leaders within cognitive, trait, and behavioral genetics fields. None of these approaches have been particularly open to behavior-analytic methods or interpretations in the past. It will hopefully lead to a much wider dissemination of our behavior science than has heretofore been the case. I am sure you will enjoy this blog on some of the most interesting and exciting research being carried out in the behavior analytic and contextual behavioral science fields. There is a wealth of links to fascinating papers, videos, website articles, books, etc., embedded within this blog so please do ‘click’! – Dr. Ian Tyndall, University of Chichester, UK.

Relational Frame Theory (RFT) hypothesises that the ability to relate stimuli (e.g., A is the same as B, B is the same as C, and so on) and then derive novel relations (e.g., therefore C is the same as A) in a way that allows us to adapt to our environments adds up to what we commonly call “human intelligence”. This will be one of two blogs on the research to date on the Strengthening Mental Abilities with Relational Training (SMART) program, which aims to train the ability to derive relationships between symbolic properties of stimuli based on contextual cues for functional purposes. As such, I’d like to make this blog a little bit different. In deciding how to do that, I tried to think about how I’m a little bit different. For better or worse, I have a tendency to put forward unpopular alternative perspectives that others will not. The reason I do this is that I believe that learning about what appears to be true now, whether that is good or bad, is the starting point from which we might improve things in the future. We need to know what’s wrong if we’re to have a chance of fixing it. As such, here I will write a little bit about the barriers to implementing SMART training that I believe apply to behavior-analytic interventions more broadly, anticipating that Dr Sarah Cassidy will do a great job of focusing on the many triumphs of this research effort to date in her subsequent blog. This will be an attempt at steel-manning non-behaviorist perspectives.

It is important to remember, as I play devil’s advocate, that I am a behaviorist and believe in the good that this perspective on psychological science can bring. During the two years before my PhD, I worked full-time teaching adaptive behavior to kids with autism and other complex needs using applied behavior analysis. The staff there gave everything they had to ensure the long-term welfare of those kids, often in the face of the most adverse of circumstances. I hadn’t learned about what really goes on in psychological practice during my undergraduate degree, so I was lucky to have people who were well-established in the Irish behavior-analytic community as my mentors to help me through. I owe a huge thank you to Martina, Laura, Carol, Amanda, Jackie, Aaron, Catherine, and all my colleagues there. At the same time, I was cutting my teeth as an experimental behavior analyst under the supervision of Dr Ian Stewart, with whom I also discussed competing philosophies of science at length. From day one, Ian pushed me to learn how to code experiments and make sure to be technically accurate in a way that applied practice doesn’t teach quite so well. I managed to gain a range of perspectives on how useful the behavior-analytic perspective on human psychology can be for changing relatively complex behavior.

I remember presenting an experiment that Ian Stewart and I had worked on at the Experimental Analysis of Behavior Group conference at University College London in 2014. I was cotton mouthed, as I presented in front of some of the biggest names in the field. I’ll never forget Dr Lanny Fields’ face turning to stone (not necessarily what happened, but it’s how I remember it!), as I skimmed over important technical details. He seemed to confer with his other senior and greatly respected colleague, Dr Jack Marr, before dressing me down somewhat at the end, as he emphasised the importance of method over all else. Dr Julian Leslie took me aside after and – he was very kind about it – explained that I had perhaps been a bit evangelical about the importance of Relational Frame Theory (RFT). At that same conference, I had also met Dr Ian Tyndall and Dr Bryan Roche. They and their colleagues were kind and enthusiastic about the work I had presented. That summer, in fact, I joined Ian Tyndall at the University of Chichester to independently test the efficacy of Bryan Roche’s SMART program for improving educational outcomes for my PhD, which is what this blog is about.

Ian Tyndall has always supported me and been enthusiastic about my ideas and abilities. However, the (perceived) dressing down by Dr Fields, whom I greatly respect and admire, has always stuck with me since. In my PhD, I wanted to, above all else, use good methods when testing the utility of SMART training. My prescribed gap in the literature to fill was to see whether SMART training effects transferred to real educational outcomes. However, in my opinion, the real gap was in methodological rigour, as most SMART studies to date had very small samples (i.e., low power/generalisability) and weak (if any) control conditions. This wasn’t the fault of those who performed the early tests. SMART aims to raise cognitive ability. This has not previously been achieved in over 100 years of trying. And boy, have people tried. The next best training out there, N-Back working memory training, achieved a mean improvement in Intelligence Quotient (IQ) of just 2-3 points according to a 2015 meta-analysis. Improving cognitive ability is, in many ways, one of the holy grail’s of psychology. As such, it requires more than a brief intervention, hence the size of early research studies. Before an intervention is generally considered effective and evidence-based such that it is adopted on a larger scale, a substantial, well-controlled body of evidence is required before people will believe you. The researchers leading SMART studies to date could not recruit large samples that could be divvied up into robust experimental conditions; they had to start at the foot of a rather steep mountain. Nonetheless, those smaller studies yielded large effects, with improvements in IQ in the 15-30 point range. Considering that 15 points represents one whole standard deviation in many IQ tests, this seems remarkable. In the context of a broader literature saying that this was virtually impossible, it seemed that behavior analysts might have the tools to climb a little further, even if those effects were to diminish substantially in more stringent tests.

Yes, we have climbed a bit further, and of course it’s all very exciting indeed, but that’s just a teaser for now. I’m going to make you wait a paragraph or two for that (sorry!). To put it all in context though, I’d like to take a minute to talk about Intelligence Quotients (IQ), something that behavior analysts and many others tend not to like so much, and that most people who do a psychology degree end up misinformed about. So, what is IQ? If you had a list of all possible questions and picked a random sample of questions of varying difficulty from it, how you perform on that subset of questions predicts how you will perform on any other random subset. Therefore, when you ask a random sample of questions, how you rank amongst your peers is how you will generally rank amongst your peers in all other intellectual endeavours. This effect replicates. Really well. It also has many real-life implications. If you measure someone’s abilities at solving puzzles, it will reliably correlate with how much general knowledge they have, how well they do in school, and so on. This is one of the best-replicated findings in psychology, perhaps second only to behavioral selection by consequences, which we behavior analysts are always banging on about.

We don’t have any better predictors of life success in the social sciences than IQ, and it appears to be largely innate. It predicts many important life outcomes (from mortality rates to job performance to number of patents held), and almost no other factors predict variance in those key variables on top of IQ. This includes socioeconomic status. For example, one study looked at state examination results (a proxy measure of IQ) across UK public non-selective schools, public selective schools (known as Grammar Schools), and private schools and found that the public non-selective schools performed worst, while the selective and private schools performed about the same. It is tempting, at this point, to conclude that the better-performing schools have better resources and the kids get more encouragement and so on. However, these academic attainment differences mirrored the genetic differences between these schools. When they statistically controlled for the factors that “select” people for Grammar Schooling, the genetic differences between schools went away. As such, it appeared that those who (i) got selected out of public non-selective schooling for high-achieving Grammar Schools and (ii) those who were in the high-achieving private schools were genetically different from those who weren’t. Yes, they were privileged, but it seems by nature, not society. This study is one of a host of several hundred genome-wide association studies and twin studies that support the hypothesis that IQ is the main predictor of success in life, and that the environment has little to do with it.

It would appear bleak from the point of view of the behavior analyst and her single subject designs that we might improve IQ in a convincing manner such that it was acceptable to mainstream psychology. Many behavior analytic interventions like SMART, a form of precision teaching, are resource-intensive (i.e., they are logistically challenging or time-consuming to implement, even just for a study, and therefore expensive). They are also often so small that they may be excluded from meta-analyses, which are crucial before interventions can be adopted on a large scale. However, within this bleak picture is some opportunity for us. The gatekeepers of society will demand gold-standard evidence before implementing interventions on a large-scale. They want highly-powered controlled trials, and meta-analyses that only include such studies. That standard is well-specified, and it has been for a very long time now. It is nothing less than an ethical obligation on our part to conduct such tests if we truly believe that our interventions work and that the world would be improved if they were implemented on a large scale.
Now that we have contextualised the challenge ahead, here’s how we have chipped away at the mighty mountain of data in favour of the hypothesis that we might be powerless to help those we care about to achieve more than they were going to anyway. Never fear, we have indeed climbed a little further up the mountain in terms of providing evidence for the efficacy of SMART training for improving educational outcomes.

The first study we conducted was a small-scale pilot test to see whether, in principle, we could get away with providing shorter SMART training interventions in smaller samples and still see some of the effects that previous, more intensive, small-scale research had reported. If so, this would allow us to streamline the intervention so that it would be logistically manageable for those implementing the training (e.g., schools). We also piloted a souped-up version of SMART that I programmed to train analogical relations (i.e., relating stimulus relations to one another). As it turned out, IQ did not improve, so we decided that we indeed did need to have bigger numbers of participants to advance the research program. Nonetheless, we did see some evidence that those who did a small amount of SMART training got through IQ tests faster, without a decrease in the number of correct answers, and we published that research in the European Journal of Behavior Analysis.

In our next study, we recruited a first-year of secondary school (age 12-14) cohort of about 150 boys. We stratified those participants into three groups, one who received SMART training from October of their school year, one who received SMART training from the following January, and another who began SMART training in April. This is a simple version of a stepped-wedge design. In this way, we expected that those who received more SMART training would have the greatest rises in IQ from October-June. However, it was here that we encountered problems with logistical implementation that are not necessarily obvious when conducting smaller studies, or indeed, engaging in one-to-one practice with clients. By completing SMART training stages, participants could earn tickets to a monthly draw for a cash prize. In this way, we hoped to reinforce engagement with the training. As expected, we had a wide range of training completion (from 0 stages to 55 stages completed). However, the mean number of stages completed was only 13, indicating that the reinforcement contingencies were unsuccessful. In contrast, most previous studies had completed 55 stages, some of which had a younger sample of children. So, SMART wasn’t too difficult. Previous studies that had greater training completion employed smaller samples (max N = 28, including control conditions), and so, researchers were able to monitor individual progress. When we plotted training completion in our study using a histogram, we found a Pareto distribution, consistent with Price’s Law (basically, that a tiny number of people do most of the work; see Figure 1). Furthermore, those who started SMART in January had fewer training stages completed than those who started in March, indicating that having more time to complete training did not facilitate more training being done. From October (the start) to June (final assessment), there was no overall increase in IQ, even if we only analysed data from those who completed 20 or more stages. In some sub-tests, measured IQ appeared to decrease substantially, which is unlikely to be correct. This makes me believe that there must have been a large amount of testing error that was outside of our control. Behavior analysts generally perform one-to-one or one-to-few assessments. Being able to supervise testing/training closely may reduce testing error, meaning that the results are more accurate. However, it appears to me that this also means that such tests/interventions are not practical to roll out on a larger scale.

Figure 1. A Pareto productivity distribution (labelled for re-use).

Our third study was more successful. We recruited a cohort of first-year (age 12-14) secondary school boys and girls (this time, N = 180). We gave half of those students access to SMART, while the rest did computer coding from January until June. In that study, the supervising teacher was very obliging and kept in touch more, and we ended up with more usable data. Far from all of it though: In many cases, we had more testing error. For example, one person appeared to gain 50 fluid IQ points from 13 stages of training. In his case, it is more plausible that we had testing error at Time 1. We had to omit a lot of people who missed or didn’t fully complete either the Time 1 or Time 2 assessment. A surprisingly large number of students entered the wrong name, so we couldn’t match their Time 1-Time 2 data. In the end, we had 70 participants, unevenly distributed across conditions (43 SMART, 27 coding). Thankfully though, key traits were matched across conditions, including fluid IQ (our key variable of interest) and personality factors. We measured personality factors because, mechanistic or not, they are broadly the next most powerful predictor of behavior after IQ. Coming from a pragmatist philosophical background (as did Skinner), something is considered “true” if it is fit for purpose. In more behavioral terms, personalities are simply preferred patterns of behavior that differ across individuals. In this study, we found that more agreeable students completed more training (i.e., they did what was asked of them). Again, there was a relatively low amount of training completion, with a mean of 16 stages. However, those in the SMART condition had a mean fluid IQ rise of almost 6 points, while the controls had a non-significant rise of < 2 points. We also measured processing speed to follow up on our first study, and found that that increased across both conditions. In some sense, this might “debunk” our own findings: when we applied a bigger sample, we found that our findings did not hold up, arguably reducing our N = 8 study to the status of “series of anecdotes”. Our last finding was that those who had a lower baseline fluid IQ benefited most from SMART. In a way, this wasn’t surprising, as one might only expect those with lower baseline abilities to benefit from a mean of 13 stages of training. There was no relationship between fluid IQ change and baseline ability for the controls. We think that this is perhaps an exciting new paper for our field, but we want to publish it in a mainstream journal. We have to reach those audiences and we have to withstand their scrutiny if we wish to be part of any conversations pertaining to national curricula. We received a relatively positive first review from an esteemed mainstream developmental psychology journal, and are currently addressing their comments.

Finally, we attempted to extend this study further to see whether any improvements in IQ manifest in cross-domain educational results. In that study, we employed a younger sample (age 6-10) of 55 children, with a similar design. In this case, our control condition was online chess. Again, we had a similar mean training completion of 15 stages. At the end, we found that the controls’ IQs were about the same, while those who did SMART gained almost 9 points in fluid IQ. At the end of the year, pupils completed a standardised test of educational achievement in relation to the curriculum they were being taught (i.e., systematically controlling for learning opportunities). Interestingly, in the SMART condition only, Time 2 IQ – the IQ that had been improved – was more strongly related to test performance than baseline IQ. After verifying this using a series of partial correlations and multiple regression models, we concluded that the IQ gains showed up in their examination results. More specifically, increased fluid IQ (the ability to think in abstract ways and learn new things) appeared to increase performance on reading comprehension (a performance measure that does not presuppose specific learning opportunities) and not on vocabulary (which requires an increase in the ability to learn new things before the relevant learning opportunities are presented). Mathematics performance did not increase, perhaps due to the fact that confidence and the degree to which prior learning opportunities had been availed of influence performance in this measure more. The controls, who had played more chess, improved on one mathematical outcome measure involving spatial rotation. This study is currently under review at another well-regarded mainstream developmental psychology journal.

It appears that SMART training has potential for improving general intelligence and performance on real-life measures of interest. However, these studies still fall short of large-scale, double-blind, randomised active-controlled trials required before making definitive claims that we can indeed raise general intelligence. Furthermore, there are logistical challenges to conducting these studies as behavioral interventions of this sort are too resource-intensive for large scale implementation in addition to existing curricula. If we were to provide the required standard of evidence, I believe that those in the mainstream intelligence community would be delighted. So far, the research that was robust enough for them to pay attention to has suggested that this is not possible. However, perhaps this is to conflate “not possible” with “hasn’t been done”. Realistically, there are two main types of training that have indeed been debunked: working memory training (e.g., N-Back) and compensatory education (e.g., the Head Start program in the US which began in the mid-1960s); and they have been debunked again and again. It would be disingenuous for intelligence researchers to overgeneralise from these few exemplars to say, “cognitive training doesn’t work”, when we haven’t tested many interventions based on coherent models of general cognition. Cognitive training null results to date are supplemented by research from behavioral genetics suggesting that perhaps 50% (depending on the study) of cognitive ability is heritable (that we know of). However, genetic influence is probabilistic rather than deterministic. Our genetic endowment is the palate upon which the environment writes. The slate is not blank, but nor is it completely filled-in. RFT, the theory upon which SMART is based, is a theory of general cognition that merits full-scale testing. SMART trains operant responding; the go-to patterns of behavior through which people adapt to their environment. Behaviorism is all about the natural selection of behavior, and so it is most congruent with evolution science.

From a neuroscience perspective, perhaps the best theory of general intelligence is Jung and Haier’s parieto-frontal integration theory (P-FIT). According to the P-FIT, somewhat counterintuitively, smarter people have less brain activity as the frontal and parietal lobes interact with one another. This means that smart brains are more efficient. They cut through the noise to get to the relevant signal. This is congruent with RFT’s account of language and cognition. If, say, you run a typical match-to-sample behavioural experiment, the participants must recognise physical patterns that occur within particular contexts. For example, in the context of a particular “sample” stimulus, the participant might be presented with a [red] square, and then given two response options, [green] and [red]. In the context of the stimulus _-_, participants might be rewarded for choosing [green], while in the context of the stimulus +=+ they might be rewarded for choosing [red] (see Figure 2). When we vary the colours across trials and consistently reinforce choosing the same colour, +=+ will come to mean “SAME” as participants abstract what is common across trials. Similarly, if we consistently reinforce choosing the other colour, _-_ will come to mean “DIFFERENT”. By responding to multiple exemplars of regularities in the environment in this way, we can learn these symbolic cues and apply them to novel stimuli. These novel stimuli need not even be physically related anymore, since the response is under contextual control of the discriminative cue stimulus instead of the physical features of the stimuli being related. So, if I say that a WUG is more than a JEP, and you already know that a JEP is more than a GEK, you can derive that a WUG is more than a GEK and a GEK is less than a WUG. As such, cognition becomes easier as we learn to respond to cues more fluently. This allows us, in turn, to perform complex cognition. For example, an RFT account of analogical responding involves relating stimulus relations (an A-B relation is the same kind of relation as a C-D relation). This allows us to generate new language and previously unthought thoughts. For example, if I know that chalk and cheese are DIFFERENT, and a particular man is friendly, then being told “he is to his brother as chalk is to cheese” says that he is DIFFERENT to his friendly brother (i.e., we might derive that the brother, whom we have never met, is unfriendly, and that we don’t want to be friends with him). This is an example of complex cognition emerging from behavioral selection by consequences (i.e., operant conditioning).

Figure 2. Example of a non-arbitrary match-to-sample procedure.

When I said to my supervisor, Dr Ian Tyndall, that I wanted to present our research at the International Society for Intelligence Research (ISIR) in Summer 2018, he remarked that it was brave. However, I don’t quite see it that way. I believe that behavior analysts need to be attending non-behavioral conferences to communicate their findings and seek feedback. If I’m wrong, I want to know as soon as possible so I can remain wrong for as short an amount of time as possible. On the other hand, we could present only at behavioral conferences and preach to the proverbial choir. I noticed that there is no division for behavior analysis in the British Psychological Society here in the UK. Behavior analytic methods are not changing policy (although we have the tools and underlying pragmatic philosophy to do so). The journals have small or no impact factors (i.e., a metric score that vaguely indicates a journal’s readership and influence). It seems to me that the university centres for training behavior analysts are becoming fewer and fewer (in the UK at least), while influential academics tell their students… well, you will see for yourself, here and here. Those guys are at Stanford (US) and Cambridge (UK) respectively, so this is no trivial state of affairs.

When I went to ISIR, I got the impression that there were quite a few who were smugly closed to the idea of our positive results. Not all, though. I noticed that, as one young researcher seemed to try to falsify the P-FIT, Dr. Richard Haier publicly commended the student for her effort, even though there seemed to be a few flaws in the study. This was heartening, as he was one of the people who came up with that theory and is the eminent Editor of the journal Intelligence. He made time for students, including me. He said that he doubted that we could shift fluid intelligence but would be interested to watch my presentation. He also asked me to send him the video I had of Dr Sapolsky (Stanford) telling students that behaviorism entailed a “black box” approach to psychology, as linked in the previous paragraph, as he himself had held the same views. If you read Dr. Haier’s recent book, you will notice that he is careful to say that certain quotations are attributed to behaviorists when he couldn’t find evidence for their existence – he is measured and careful, when other academics might have taken a more arrogant approach of assuming their truth, having never even read the original work. When I emailed that video to him, he responded graciously. With his permission, I have included his response here:

Hi Shane,
I’m glad we had time to talk. I enjoy ISIR because there is time to meet many colleagues and students.
I’ve just finished watching the Skinner/Sapolsky video. It’s fascinating! I have learned something important and corrected my view, which was the same as Sapolsky’s. Thank you for sending it.
Looking forward to seeing more of your work.
Best wishes,
Rich

So, not all people who don’t believe in the tabula rasa are bad guys. In fact, I don’t think it’s fair to construe them as being any less compassionate for paying attention to some of the most robust, highly powered, and well-controlled science that pertains to human behavior. By the same token, it wouldn’t be fair for cognitive psychologists to commit the black box fallacy (the old Chomskyan strawman), or to suggest that behaviorists don’t measure/train/have an account of complex cognition like rules, categorisation, or analogy. Just as cognitive psychologists have the highest-powered studies that are practical to roll out on a large scale, it is the behaviorists who have studies showing how to actually change the behavior of individuals given their learning history (and yes, their biological circumstances), which is how most psychological practice operates. In the interest of progress, perhaps both camps should work together with the common purpose of helping the people we care about. Neither camp needs to have all the answers, nor do they have to misrepresent one another, for that to happen.