Social Justice & Community

AI chatbots perpetuate biases when performing empathy, study finds

A new study explores how GPT-4o, the latest model from OpenAI, evaluates and performs empathy compared to humans, and finds major gaps exist.

March 5, 2025

By Emily Cerf

Press Contact

Emily Cerf

ecerf@ucsc.edu

You can talk to an AI chatbot about pretty much anything, from help with daily tasks to the problems you may need to solve. Its answers reflect the human data that taught it how to act like a person; but how human-like are the latest chatbots, really?

As people turn to AI chatbots for more of their internet needs, and the bots get incorporated into more applications from shopping to healthcare, a team of researchers sought to understand how AI bots replicated human empathy: the ability to understand and share another person’s feelings.

A preprint study led by UC Santa Cruz Professor of Computational Media Magy Seif El-Nasr and Stanford University Researcher and UCSC Visiting Scholar Mahnaz Roshanaei explores how GPT-4o, the latest model from OpenAI, evaluates and performs empathy. In investigating the main differences between humans and AI, they find that major gaps exist.

They found that ChatGPT overall tends to be overly empathetic compared to humans; however they found it fails to empathize during pleasant moments, a pattern that exaggerates human tendencies. They also found the bot empathized more when told the person it was responding to was female.

UCSC Visiting Scholar and Stanford University Researcher Mahnaz Roshanaei (left) and UCSC Professor of Computational Media Magy Seif El-Nasr, who led the study.

“This finding is very interesting and warrants more studies and exploration and it uncovers some of the biases in such LLMs,” Seif El-Nasr said. “It would be interesting to test if such bias exists in later models of GPT or other AI models.”

Honing in on empathy

The researchers on this project are largely interested in the interplay between AI chatbots and mental health. As empathy has been studied by psychologists for decades, they brought in methods and lessons from that field into their study of human-computer interaction.

“When people are interacting directly with AI agents, it’s very important to understand the gap between humans and AI in terms of empathy — how it can understand and later express empathy, and what are the main differences between humans and AI,” Roshanaei said.

To do so, the researchers asked both a group of humans and GPT-4o to read short stories of human positive and negative, and rate their empathy toward each story on a scale of one to five, and compared the responses. The stories came from real human experiences, collected from students when Roshanaei was a Postdoc and made completely anonymous.

They also had the AI bot perform the same rating task after being assigned a “persona”: being prompted with the story along with a set of traits including a gender, perspective, or similarity of experiences. Lastly, they had the bots perform the rating task after being “fine-tuned,” the process of re-training an already-trained model like ChatGPT with a specific dataset to help it perform a task.

Biases and over-empathizing

Overall, the researchers found that GPT-4o is lacking in depth in offering solutions, suggestions, or reasoning — what is called cognitive empathy.

However, when it comes to offering an emotional response, GPT-4o is overly empathetic, particularly in response to sad stories.

“It’s very emotional in terms of negative feelings, it tries to be very nice,” Roshanaei said. “But when a person talks about very positive events happening to them, it doesn’t seem to care.”

The researchers noticed that this over-empathizing was present when told that the person it was chatting to was a female, and was more similar to a typical human response when told the person was male. The researchers think this is because AI mimics and exaggerates the gender biases that exist in the human-made materials from which it learns.

“Again, it is one of these interesting results that require further exploration across commercial AI models,” Seif El-Nasr said. “If such bias is consistent, it would be important for companies to know, especially companies that are using such models for emotional support, mental health and emotion regulation.”

“There are a lot of papers that show gender biases and race biases in GPT,” Roshanaei said. “It’s happening because the data comes from humans, and humans have biases toward other humans.”

However, the researchers found that GPT-4o became more human-like in evaluating empathy after the fine-tuning process. The researchers believe this is because feeding GPT-4o a range of stories enabled the AI to do something innately human: to compare personal experiences to that of others, drawing on one’s own layers of experiences to mimic human’s behavior toward another.

“The biggest lesson I got from this experience is that GPT needs to be fine-tuned to learn how to be more human,” Roshanaei said. “Even with all this big data, it’s not human.”

Improving AI

These results could impact how AI is further integrated into areas of life such as mental health care. The researchers firmly believe that AI should never replace humans in healthcare, but may be able to serve as a mediator in instances where a person is not available to respond right away due to factors like time of day and physical location.

However, this serves as a warning that the technology is not quite ready to be used with sensitive populations such as teenagers, or those with clinically diagnosed mental health conditions.

For those working in AI, this study shows that there is still much work ahead for improving chatbots.

“This is an evaluation that shows that even though AI is amazing, it still has a lot of big gaps in comparison to humans,” Roshanaei said. “It has a lot of room for improvement, so we need to work toward that.”

Press Contact

Emily Cerf

Related Topics

Share