UMass Amherst Researcher Shows ‘Extreme Boosting’ AI Model Cuts Through Social Media ‘Noise’

Social media offers a treasure trove of data for researchers to understand how organizations and individuals use the technology to communicate with and grow their base of followers. However, manually analyzing the content can be time consuming or, in some cases, simply impossible due to the volume of data. While machine-learning models can help, they present their own set of challenges.

Viviana Chiu Sik Wu, assistant professor of public policy at the University of Massachusetts Amherst, conducted a systematic review of 43 studies that analyzed social media data from philanthropic and nonprofit organizations. She then devised and tested a model that pairs machine learning with human oversight to analyze content more effectively.

Wu found that most of the studies relied heavily on manual coding to analyze relatively small datasets, missing out on the benefits of automation and scalability offered by artificial intelligence. In cases where AI was used, it was often stymied by language nuances and other variables that arise during the training process for large language models, she says.

We have been seeing a lot of research using topic modeling, but without properly training the data, those unsupervised models can introduce biases and noise into the results.

Viviana Chiu Sik Wu, assistant professor of public policy

“We have been seeing a lot of research using topic modeling, but without properly training the data, those unsupervised models can introduce biases and noise into the results,” Wu explains.

In addition, she notes many studies omitted entire categories of data, which can be organized into three groups: text (message content), engagement (likes, comments, retweets, etc.) and network data (how followers, friends, etc. are interconnected).

Wu used a coded sample to develop what she calls an “extreme boosting” model, which harnesses computational power coupled with human abilities to classify messages into specific sets of preconceived categories, known as supervised machine learning.

While unsupervised machine learning can identify hidden patterns and relationships, for content analysis “it can be highly unreliable without a substantial set of training examples to begin with,” the study warns.

To test her model, Wu collected 66,749 tweets from the Twitter/X accounts of 192 community foundations in the U.S. from 2017-18. She manually analyzed 15% of the messages and used them to train and test various algorithms to identify the best predictive model to automatically analyze the remaining 56,718 tweets.

The model was tasked with identifying posts related to public engagement, which are particularly challenging to distinguish from other messages about fundraising, grants, etc. due to content that often overlaps with other topics.

The results yielded 6,331 public engagement tweets, which were verified. Though the “extreme boosting” model shows promise, Wu cautions that it requires further refinements to achieve the highest accuracy.

What is clear, she says, is that combining manual content analysis with automated machine learning can be a powerful tool to analyze social media datasets that are simply too large to be processed manually.

“The findings can be extended to situations in other fields well beyond nonprofits to analyze massive observational datasets on social media,” Wu says.

However, she points out that accessing this data has become more challenging for researchers in recent years as some platforms, including Twitter/X and Facebook, have placed additional limits on the data they make available to researchers and the public.

The changes have scholars looking at other platforms, such as Reddit and TikTok.

“We need to be more creative and innovative at getting the data,” she says.