Simple Interactions Can Elicit Harmful Outputs from Large Language Models

Security Vulnerabilities in Large Language Models: Simple Interactions Can Lead to Harmful Outputs

Large language models (LLMs) have made tremendous progress in recent years and are used in a variety of areas, from text generation and translation to programming. Despite intensive efforts to ensure the safety of these models, they remain vulnerable to so-called "jailbreaks" – attacks that aim to provoke harmful or undesirable outputs.

Previous research has mainly focused on complex attack methods that require technical expertise. However, two important questions remained largely unanswered: How useful is the information gained through jailbreaks for average users to carry out harmful actions? And do vulnerabilities also exist in simpler, everyday interactions with LLMs?

A new study titled "Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions" investigates precisely these questions. The results show that LLM responses are most effective in enabling harmful actions when they are both action-relevant and informative – two properties that can be easily elicited in multi-turn, multilingual interactions.

The researchers developed "HarmScore," a metric for measuring the effectiveness with which an LLM response enables harmful actions, as well as "Speak Easy," a simple framework for multi-turn, multilingual attacks. By integrating Speak Easy into direct prompt and jailbreak baselines, the researchers observed an average absolute increase of 0.319 in attack success rate and 0.426 in HarmScore across both open-source and proprietary LLMs over four security benchmarks.

Multi-Turn Interactions and Multilingualism as a Security Risk

The study demonstrates that multi-turn interactions, where the user engages in a dialogue with the LLM, pose a higher risk for jailbreaks. Through clever questioning and the exploitation of contextual information, attackers can trick the LLM into revealing harmful information. The use of different languages can also bypass the security mechanisms of LLMs. By translating requests and responses between different languages, attackers can effectively circumvent filters and security measures that are trained on specific language patterns.

Implications for the Security of LLMs

The results of this study underscore the need to improve the security mechanisms of LLMs and reduce their vulnerability to simple, multi-turn, and multilingual attacks. It is important that developers of LLMs consider these findings and implement robust security measures to prevent misuse. Future research should focus on increasing the resilience of LLMs to such attacks while preserving their usefulness in various applications.

The increasing prevalence of LLMs in critical areas such as customer service, medical diagnosis, and financial advice makes the security of these models a central concern. Only through continuous research and development can we ensure that LLMs can be used safely and responsibly.

Bibliographie: Chan, Y. S., Ri, N., Xiao, Y., & Ghassemi, M. (2025). Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions. arXiv preprint arXiv:2502.04322. https://arxiv.org/abs/2404.02151 https://blog.ncase.me/research-notes-oct-2024/ https://arxiv.org/html/2501.14073v1 https://defcon201.medium.com/hacker-summer-camp-2023-guides-part-nine-defcon-31-e5733b740381 https://unit42.paloaltonetworks.com/jailbreak-llms-through-camouflage-distraction/ https://github.com/stacks-network/pybitcoin/blob/master/pybitcoin/passphrases/english_words.py https://news.ycombinator.com/item?id=35921375 https://javirando.com/blog/2024/jailbreaks/ https://openreview.net/forum?id=ov678VcvlO https://defcon.outel.org/defcon31/dc31.csv