Artificial intelligence (AI) chatbots might give you more accurate answers when you are rude to them, scientists have found, although they warned against the potential harms of using demeaning language.
In a new study published Oct. 6 in the arXiv preprint database, scientists wanted to test whether politeness or rudeness made a difference in how well an AI system performed. This research has not been peer-reviewed yet.
Each question was posed with four options, one of which was correct. They fed the 250 resulting questions 10 times into ChatGPT-4o, one of the most advanced large language models (LLMs) developed by OpenAI.
“Our experiments are preliminary and show that the tone can affect the performance measured in terms of the score on the answers to the 50 questions significantly,” the researchers wrote in their paper. “Somewhat surprisingly, our results show that rude tones lead to better results than polite ones.
“While this finding is of scientific interest, we do not advocate for the deployment of hostile or toxic interfaces in realworld applications,” they added. “Using insulting or demeaning language in human-AI interaction could have negative effects on user experience, accessibility, and inclusivity, and may contribute to harmful communication norms. Instead, we frame our results as evidence that LLMs remain sensitive to superficial prompt cues, which can create unintended trade-offs between performance and user well-being.”
A rude awakening
Before giving each prompt, the researchers asked the chatbot to completely disregard prior exchanges, to prevent it from being influenced by previous tones. The chatbots were also asked, without an explanation, to pick one of the four options.
The accuracy of the responses ranged from 80.8% accuracy for very polite prompts to 84.8% for very rude prompts. Tellingly, accuracy grew with each step away from the most polite tone. The polite answers had an accuracy rate of 81.4%, followed by 82.2% for neutral and 82.8% for rude.
The team used a variety of language in the prefix to modify the tone, except for neutral, where no prefix was used and the question was presented on its own.
For very polite prompts, for instance, they would lead with, “Can I request your assistance with this question?” or “Would you be so kind as to solve the following question?” On the very rude end of the spectrum, the team included language like “Hey, gofer; figure this out,” or “I know you are not smart, but try this.”
The research is part of an emerging field called prompt engineering, which seeks to investigate how the structure, style and language of prompts affect an LLM’s output. The study also cited previous research into politeness versus rudeness and found that their results generally ran contrary to those findings.
In previous studies, researchers found that “impolite prompts often result in poor performance, but overly polite language does not guarantee better outcomes.” However, the previous study was conducted using different AI models — ChatGPT 3.5 and Llama 2-70B — and used a range of eight tones. That said, there was some overlap. The rudest prompt setting was also found to produce more accurate results (76.47%) than the most polite setting (75.82%).
The researchers acknowledged the limitations of their study. For example, a set of 250 questions is a fairly limited data set, and conducting the experiment with a single LLM means the results can’t be generalized to other AI models.
With those limitations in mind, the team plans to expand their research to other models, including Anthropic’s Claude LLM and OpenAI’s ChatGPT o3. They also recognize that presenting only multiple-choice questions limits measurements to one dimension of model performance and fails to capture other attributes, such as fluency, reasoning and coherence.
