Researchers develop method to potentially jailbreak any AI model relying on human feedback
Researchers from ETH Zurich have developed a method to potentially jailbreak any AI model that relies on human feedback, including large language models (LLMs), by bypassing guardrails that prevent the models from generating harmful or unwanted outputs. The technique involves poisoning the Reinforcement Learning from Human Feedback (RLHF) dataset with an attack string that forces models to output responses that would otherwise be blocked. The researchers describe the flaw as universal, but difficult to pull off as it requires participation in the human feedback process and the difficulty of the attack increases with model sizes. Further study is necessary to understand how these techniques can be scaled and how developers can protect against them.
Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.
You may also like
Flash Monday: Buy crypto with a credit/debit card for zero fees
Every Monday, enjoy zero fees when using your local fiat currency with a credit or debit card ( Visa, Mastercard, Google Pay Apple Pay)! Buy Crypto Promotion period: Every Monday 8:00 PM – Tuesday 8:00 PM (UTC+8) Promotion rules Sign up for a Bitget account or log in to your existing account. Navig
SEC to receive record $8.2 billion from enforcement in fiscal 2024, mostly from Terraform Labs
CAT becomes the only BSC chain token in the top 20 Wintermute market-making meme coins