Safeguarding Open-Source AI Models on Low-Power Devices Without Compromising Safety

Researchers at UC Riverside developed a method to keep AI safety features intact when models are downsized for low-power devices. This ensures open-source AI remains safe without external filters.

Categorized in: AI News Science and Research
Published on: Sep 05, 2025
Safeguarding Open-Source AI Models on Low-Power Devices Without Compromising Safety

Preserving AI Safeguards in Open-Source Models

As generative AI models transition from massive cloud servers to devices like phones and cars, they undergo trimming to reduce power consumption. This downsizing often removes critical safety features designed to prevent harmful outputs such as hate speech or instructions for illegal activities. Open-source AI models, while promoting transparency and innovation, face significant risks of misuse without these safeguards.

Researchers at the University of California, Riverside have developed a technique to maintain AI safety features even when open-source models are compacted for low-power devices. Unlike proprietary AI systems that rely on cloud infrastructure and continuous monitoring, open-source models can be downloaded, modified, and operated offline by anyone. This openness encourages innovation but also complicates oversight and safety enforcement.

Challenges of Reducing Model Size

The core problem the UCR team addressed is that safety features often degrade when models are reduced in size. To save memory and computation, lower-power deployments typically skip internal processing layers. While this improves speed and efficiency, it also weakens the model’s ability to filter unsafe content.

“Some of the skipped layers turn out to be essential for preventing unsafe outputs,” said Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study. “If you leave them out, the model may start answering questions it shouldn’t.”

A New Approach to Safety

The researchers’ solution involves retraining the internal structure of the AI model to preserve its detection and blocking of dangerous prompts, even when key layers are removed. This method does not rely on external filters or software patches but instead alters the model’s fundamental understanding of risky inputs.

“Our goal was to make sure the model doesn’t forget how to behave safely when it’s been slimmed down,” explained Saketh Bachu, UCR graduate student and co-lead author of the study.

Testing with Vision-Language Models

To validate their approach, the team experimented with LLaVA 1.5, a vision-language model that processes both text and images. They discovered that certain input combinations, such as a harmless image paired with a malicious question, could bypass the model’s safety filters. In one test, the downsized model responded with detailed instructions for building a bomb.

After retraining, however, the model consistently refused to answer dangerous queries, even when operating with only a fraction of its original architecture.

“This isn’t about adding filters or external guardrails,” Bachu emphasized. “We’re changing the model’s internal understanding, so it’s on good behavior by default, even when it’s been modified.”

“Benevolent Hacking” to Fortify AI

Bachu and co-lead author Erfan Shayegani describe their work as “benevolent hacking”—proactively strengthening models before vulnerabilities are exploited. Their ultimate aim is to develop techniques that ensure consistent safety across every internal layer, helping AI perform reliably in real-world conditions.

  • Amit Roy-Chowdhury, Professor of Electrical and Computer Engineering
  • Saketh Bachu, Graduate Student and Co-Lead Author
  • Erfan Shayegani, Graduate Student and Co-Lead Author
  • Arindam Dutta, Doctoral Student
  • Rohit Lal, Doctoral Student
  • Trishna Chakraborty, Doctoral Student
  • Chengyu Song, Faculty Member
  • Yue Dong, Faculty Member
  • Nael Abu-Ghazaleh, Faculty Member

Their research was presented at the International Conference on Machine Learning held in Vancouver, Canada.

“There’s still more work to do,” Roy-Chowdhury noted. “But this is a concrete step toward developing AI in a way that’s both open and responsible.”