What 26,000 books reveal when it comes to learning language

Summary: Machine learning finds you can take a person’s language behavior and estimate the types of material they have read.

Source: University at Buffalo

What can reading 26,000 books tell researchers about how language environment affects language behavior? Brendan T. Johns, an assistant professor of communicative disorders and sciences in the University at Buffalo’s College of Arts and Sciences, has some answers that are helping to inform questions ranging from how we use and process language to better understanding the development of Alzheimer’s disease.

But let’s be clear: Johns didn’t read all of those books. He’s an expert in computational cognitive science who has published a computational modeling study that suggests our experience and interaction with specific learning environments, like the characteristics of what we read, leads to differences in language behavior that were once attributed to differences in cognition.

“Previously in linguistics it was assumed a lot of our ability to use language was instinctual and that our environmental experience lacked the depth necessary to fully acquire the necessary skills,” says Johns. “The models that we’re developing today have us questioning those earlier conclusions. Environment does appear to be shaping behavior.”

Johns’ findings, with his co-author, Randall K. Jamieson, a professor in the University of Manitoba’s Department of Psychology, appear in the journal Behavior Research Methods.

Advances in natural language processing and computational resources allow researchers like Johns and Jamieson to examine once intractable questions.

The models, called distributional models, serve as analogies to the human language learning process. The 26,000 books that support the analysis of this research come from 3,000 different authors (about 2,000 from the U.S. and roughly 500 from the U.K.) who used over 1.3 billion total words.

George Bernard Shaw is often credited with saying Britain and America are two countries separated by a common language. But the languages are not identical, and in order to establish and represent potential cultural differences, the researchers considered where each of the 26,000 books was located in both time (when the author was born) and place (where the book was published).

With that information established, the researchers analyzed data from 10 different studies involving more than 1,000 participants, using multiple psycholinguistic tasks.

“The question this paper tries to answer is, ‘If we train a model with similar materials that someone in the U.K. might have read versus what someone in the U.S. might have read, will they become more like these people?'” says Johns. “We found that the environment people are embedded in seems to shape their behavior.”

The culture-specific books in this study explain much of the variance in the data, according to Johns.

“It’s a huge benefit to have a culture-specific corpus, and an even greater benefit to have a time-specific corpus,” says Johns. “The differences we find in language environment and behavior as a function of time and place is what we call the ‘selective reading hypothesis.'”

Using these machine-learning approaches demonstrates the richly informative nature of these environments, and Johns has been working toward building machine-learning frameworks to optimize education. This latest paper shows how you can take a person’s language behavior and estimate the types of materials they’ve read.

This shows books
Advances in natural language processing and computational resources allow researchers like Johns and Jamieson to examine once intractable questions. The image is in the public domain.

“We want to take someone’s past experience with language and develop a model of what that person knows,” says Johns. “That lets us identify which information can maximize that person’s learning potential.”

But Johns also studies clinical populations, and his work with Alzheimer’s patients has him thinking about how to apply his models to potentially help people at risk of developing the disease.

He says some people show slight memory loss without other indications of cognitive decline. These patients with mild cognitive impairment have a 10-15% chance of being diagnosed with Alzheimer’s in any given year, compared to 2% of the general population over age 65.

“We’re finding that people who go on to develop Alzheimer’s across time are showing specific types of language loss and production where they seem to be losing long-distance semantic associations between words, as well as low-frequency words,” he says. “Can we develop tasks and stimuli that will allow that group to retain their language ability for longer, or develop a more personalized assessment to understand what type of information they’re losing in their cognitive system?

“This research program has the potential to inform these important questions.”

About this neuroscience research article

University at Buffalo
Media Contacts:
Bert Gambini – University at Buffalo
Image Source:
The image is in the public domain.

Original Research: Closed access
“The influence of place and time on lexical behavior: A distributional analysis”. Brendan T. Johns, Randall K. Jamieson.
Behavior Research Methods doi:10.3758/s13428-019-01289-z.


The influence of place and time on lexical behavior: A distributional analysis

We measured and documented the influence of corpus effects on lexical behavior. Specifically, we used a corpus of over 26,000 fiction books to show that computational models of language trained on samples of language (i.e., subcorpora) representative of the language located in a particular place and time can track differences in people’s experimental language behavior. This conclusion was true across multiple tasks (lexical decision, category production, and word familiarity) and provided insight into the influence that language experience imposes on language processing and organization. We used the assembled corpus and methods to validate a new machine-learning approach for optimizing language models, entitled experiential optimization

Feel free to share this Neurotech News.
Join our Newsletter
I agree to have my personal information transferred to AWeber for Neuroscience Newsletter ( more information )
Sign up to receive our recent neuroscience headlines and summaries sent to your email once a day, totally free.
We hate spam and only use your email to contact you about newsletters. You can cancel your subscription any time.