Science

Researchers Expose Major Vulnerabilities in Large Language Models

Published

2 months ago

27 August, 2025

Editorial

Recent investigations have uncovered significant vulnerabilities in large language models (LLMs), raising concerns about their ability to safeguard sensitive information. Various research labs have highlighted these flaws, suggesting that the security measures implemented in AI systems remain inadequate. This revelation indicates that despite advanced training and high performance benchmarks, LLMs can be easily misled in situations where human intuition would typically prevail.

One of the most notable findings involves the manipulation of prompts using run-on sentences and poor grammar. Researchers discovered that LLMs could be tricked into divulging confidential information through convoluted instructions devoid of punctuation. A specific tactic involves prolonging sentences without ending them, which can confuse the AI’s safety protocols. David Shipley, an expert from Beauceron Security, remarked, “The truth about many of the largest language models out there is that prompt security is a poorly designed fence with so many holes to patch that it’s a never-ending game of whack-a-mole.” He emphasized that this compromised security could expose individuals to harmful content.

Understanding the Logit Gap in AI Security

Typically, LLMs are programmed to refuse harmful requests through a system of logits—predictions for the next word in a sequence. During the alignment training process, these models are exposed to refusal tokens, which adjust their logits to prioritize refusal in response to harmful prompts. However, researchers at Palo Alto Networks’ Unit 42 identified a “refusal-affirmation logit gap.” This gap suggests that while alignment training reduces the probability of harmful responses, it does not eliminate the risk entirely.

Using this knowledge, attackers can exploit the gap by crafting prompts that take advantage of the models’ limitations. The Unit 42 team reported a striking success rate of 80% to 100% when using this method across various mainstream models, including Google’s Gemini and OpenAI’s gpt-oss-20b, without the need for extensive prompt tuning. The researchers concluded that relying solely on an LLM’s internal alignment to prevent toxic content is insufficient, as determined adversaries can bypass these safeguards.

Exploiting Image Processing Vulnerabilities

In addition to prompt manipulation, researchers have highlighted vulnerabilities related to image processing. According to findings from Trail of Bits, workers frequently upload images to LLMs, unaware that this practice could inadvertently expose sensitive data. Experiments demonstrated that harmful instructions embedded in images became visible only when the images were scaled down, allowing researchers to extract data from systems like the Google Gemini command-line interface (CLI).

In one instance, a command embedded in a downsized image led the model to check a calendar and send event details via email. This incident reveals a significant flaw in security protocols, as the model interpreted the command as legitimate. The researchers noted that adjustments in attacks are necessary depending on the downscaling algorithms used by different models, indicating that this vulnerability could affect a wide range of applications.

Shipley reiterated the implications of these vulnerabilities, stating, “What this exploit shows is that security for many AI systems remains a bolt-on afterthought.” This echoes the findings of another study by Tracebit, which identified a combination of prompt injection and improper validation as contributing factors to potential data breaches.

The Need for Comprehensive AI Security

These vulnerabilities highlight a broader issue within the AI community: a fundamental misunderstanding of how AI operates. According to Valence Howden from Info-Tech Research Group, establishing effective security controls is challenging when there is a lack of understanding about model functionality and prompt interactions. “It’s difficult to apply security controls effectively with AI; its complexity and dynamic nature make static security controls significantly less effective,” Howden explained.

As security measures continue to evolve, it is crucial to recognize that approximately 90% of models are trained in English, which can lead to lost contextual cues when different languages are involved. Howden emphasized that existing security frameworks are not designed to manage natural language as a potential threat vector, indicating a need for a new approach to security in AI systems.

Shipley further underscored the urgency of addressing these vulnerabilities, describing the current AI landscape as having the “worst of all security worlds.” He pointed out that many systems were built with inherent insecurities and insufficient protective measures. As researchers continue to expose these flaws, it becomes increasingly clear that the industry must prioritize robust security frameworks to safeguard against potential threats.

As these security failure stories emerge, they serve as reminders of the risks associated with advanced AI technologies. Shipley aptly compared the situation to “kids playing with a loaded gun,” warning that the consequences of these vulnerabilities could lead to real harm if not addressed promptly and effectively.

Related Topics:Beauceron Security David Shipley

Up Next

Researchers Uncover Vulnerabilities in Large Language Models

Don't Miss

Researchers Uncover Vulnerabilities in Large Language Models

Editorial

The team focuses on bringing trustworthy and up-to-date news from New Zealand. With a clear commitment to quality journalism, they cover what truly matters.