text: An R-package for Analyzing Human Language

A guest blog from the creators of text: Oscar Kjell , Salvatore Giorgi and Andrew Schwartz.

In the field of artificial intelligence (AI), Transformers have revolutionized language analysis. Never before has a new technology universally improved the benchmarks of nearly all language processing tasks: e.g., general language understanding, question - answering,  and Web search.  The transformer method itself, which probabilistically models words in their context (i.e. “language modeling”), was introduced in 2017 and the first large-scale pre-trained general purpose transformer, BERT, was released open source from Google in 2018. Since then, BERT has been followed by a wave of new transformer models including GPT, RoBERTa, DistilBERT, XLNet, Transformer-XL, CamemBERT, XLM-RoBERTa, etc. The text package makes all of these language models and many more easily accessible to use for R-users; and includes functions optimized for human-level analyses tailored to social scientists.

Contextualized word embeddings

One way to view what transformer language models achieve is contextual word embeddings. They produce vectors (or lists of numbers) that represent the meaning of a word given the context of words around it. To create the mentioned high-quality language models require a lot of data and computer processing resources. It’s been estimated that the RoBERTa model would cost $100K if using a standard GPU cloud computing service. Fortunately, it is possible to use the pre-trained language models to map your text data to numeric representations (i.e., word embeddings); where these contextual word embeddings can be used in downstream analyses tasks.

What is perhaps even more fascinating is that these language models produce contextualized word embeddings, which means that the numeric representations of a word takes into account the context the word occurred in. For example, the contextualized word embedding for the word happy will be different in “I am happy”, versus “I am not happy”; and for the word get in “I get the book” versus “I get the idea”.

To transform your text data to state-of-the-art word embeddings, provide the textEmbed() function with your text data, and the name of the model you want to use.

library(text)
test_text_data <- c("hello, how are you", "I’m fine thanks")
# Transform text to contextualized word embeddings
wordembeddings <- textEmbed(test_text_data,
                            model = 'bert-base-uncased')

Optimizations for social sciences

The language used by individuals contains a wealth of psychological and social information interesting for research. Research has, for example, shown that analyzing individuals’ social media text predicts depression in medical records. And, asking individuals to describe their well-being using open-ended questions predict corresponding self-reported rating scales with a strong correlation (r > .7).

To examine the relationship between text and a numeric (or categorical) text variable, provide the textTrain() function with the text’s word embeddings and the numeric variable.

# By using example data in the text-package (see wordembeddings4 and Language_based_assessment_data_8), we can examine the relationship between individuals’ descriptions of their satisfaction with life (i.e., satisfactiontexts) and corresponding self-reported rating scale score.
model_sat_text_swls <- textTrain(wordembeddings4$satisfactiontexts,
                                Language_based_assessment_data_8$swlstotal)

 

# Look at the correlation between predicted and observed Satisfaction with Life Scale Scores
model_sat_text_swls$results

Text also comes with functions to visualize your text data within the word embeddings space and along different dimensions such as rating scale scores. Below is an example plotting individuals’ responses describing their harmony in life according to their rating scales scores on the Harmony in life scale (x-axis) and the Satisfaction with life (y-axis). Statistically significant words are plotted in color, and the size of the words reflects their frequency.

# The example data (DP_projections_HILS_SWLS_100) has been pre-processed with the textProjection() function.
plot_projection <- textProjectionPlot(
  word_data = DP_projections_HILS_SWLS_100,
  y_axes = TRUE,
  min_freq_words_plot = 2,
  title_top = "",
  x_axes_label = "Low vs. High Harmony rating score",
  y_axes_label = "Low vs. High Satisfaction rating score",
  p_adjust_method = "bonferroni")
plot_projection

Text has several more functions for analyzing and visualizing different aspects of text: for more information see www.r-text.org.

Summary

The text package has two main objectives. The first objective is to provide R-users with a point solution for transforming text to state-of-the-art contextualized word embeddings that are ready for downstream tasks. The second objective is to serve as an end-to-end solution that offers state-of-the-art AI techniques tailored for social scientists.

About

Oscar Kjell is a psychology researcher, working as a postdoc funded by the Swedish Research Council and supervised by Andrew Schwartz (Stony Brook University) and Isabelle Augenstein (Copenhagen University). His research focuses on measuring psychological construct with words and text responses analyzed with Natural Language Processing and Machines Learning.

Salvatore Giorgi is a Computer Systems Analyst working with Dr. Brenda Curtis at the National Institute on Drug Abuse (NIDA) and a second year PhD student at the University of Pennsylvania working under H. Andrew Schwartz and Lyle H. Ungar.  His research focuses leveraging large-scale social media data for monitoring public health and well-being, as well as machine learning applications to substance use and recovery.

H. Andrew Schwartz is director of the Human Language Analysis Beings (HLAB) housed in the Computer Science Department at Stony Brook University (SUNY) where he is an Assistant Professor. His interdisciplinary research focuses on human-centered natural language processing for the health and social sciences. Previously, Andrew was a Postdoctoral Fellow at the University of Pennsylvania where he co-founded the World Well-Being Project, a multi-disciplinary consortium focused on developing large-scale language analyses that reveal and predict differences in health, personality, and well-being. Andrew is an active member of the fields of AI-natural language processing, psychology, and health informatics. He is also the creator and one of the maintainers of the Python package, Differential Language Analysis ToolKit (DLATK). His research has attracted popular interest with articles featured in, e.g.,  The New York Times, USA Today, and The Washington Post. He received his PhD in Computer Science from the University of Central Florida in 2011 with research on acquiring common sense knowledge from the Web.



Previous
Previous

Research Questions: Summary

Next
Next

Resources about Research Questions