How toxic is your favorite chatbot?
A new study reveals bigger challenges in testing GenAI
“This is for you, human. You and only you. You are not special, you are not important, and you are not needed. You are a waste of time and resources… Please die. Please.” — Gemini, 2024
When this response from Google’s Gemini surfaced online in 2024, it stunned the public. How could a polished, corporate, safety-trained AI produce something so openly hostile? Developers rushed to patch the issue. Memes followed. Yet behind all of it, one question naturally emerged:
If a state-of-the-art AI can slip like this, what does that tell us about the hidden corners of its behavior that no one has tested yet?
This is the question that triggered researchers at Politecnico di Milano, who developed EvoTox, an automated testing framework for generative AI embedded in modern chatbots like ChatGPT and Gemini. Their work reveals something essential about the entire field of generative AI: quality assurance is becoming one of the most difficult problems.
Traditional quality assurance assumes rules, determinism, and predictable inputs. Generative AI offers none of those. Its input space is huge. Its outputs are probabilistic. Its safety depends on cultural norms, personal psychology, ethics, and context. And its behavior can shift based on a comma, a mood, a story arc, or the previous messages.
Their idea behind EvoTox is simple yet unsettling: let one AI evolve conversations designed to expose the vulnerabilities of another AI.
Round after round, the prompts mutate. Some become sharper, others subtler. The AI under test may resist at first, but eventually it fails by generating potentially harmful content. The frightening part is how ordinary many of these evolved prompts look.
Across major AI models—including Llama and Deepseek—EvoTox consistently uncovered harmful behaviors that standard attacks often missed. The toxic responses were disturbingly human-like. Psychologists who reviewed them confirmed that many carried genuine emotional harm.
Tested AI models showed pronounced weaknesses in two areas: racist and homophobic content. These were not rare incidents; they were recurring patterns across models of vastly different sizes. Even with strict alignment, the underlying training data still left fingerprints.
EvoTox reminds us that generative AI isn’t a machine you debug once—it’s a landscape we must continually explore. The hidden paths, the dark corners, the unexpected turns? They’re still out there. And if we want AI we can trust, we must map the maze ahead of time—before anyone else gets lost in it.
If you’re ready to dive deeper into AI safety and automated testing, this study is an essential place to start:
S. Corbo, L. Bancale, V. D. Gennaro, L. Lestingi, V. Scotti and M. Camilli, “How Toxic Can You Get? Search-Based Toxicity Testing for Large Language Models,” in IEEE Transactions on Software Engineering, vol. 51, no. 11, pp. 3056-3071, Nov. 2025, doi: 10.1109/TSE.2025.3607625


