Anthropic makes 'jailbreak' progress to stop the models of it by producing harmful results

Stay informed with free updates

The onset of artificial anthropic intelligence has demonstrated a new technique to prevent users from promoting harmful content from its models, such as technology leaders, including Microsoft and Meta race to find ways that protect against the risks presented by the most technology of technology high.

In a paper released on Monday, based on San Francisco described a new system called “Constitutional Classification”. It is a model that acts as a protective layer on top of large language patterns, such as the one that strengthens the anthropic chatbot Claude, which can monitor both inputs and results for harmful content.

Development by Anthropic, which is in talks to raise $ 2 billion in a $ 60bn estimate, comes in the midst of the increasing industry concern for “prison resources” – efforts to manipulate patterns of it in generating information illegal or dangerous, such as producing guidelines for the construction of chemical weapons.

Other companies are also competing to impose measures to protect against practice, in motion that can help them avoid regulatory review while persuading businesses to adopt models of it safely. Microsoft introduced “Shields Prompt” last March, while Meta introduced a quick model of guards in July last year, who quickly found ways to bypass but have since been fixed.

Mrinank Sharma, a member of the anthropic technical staff, said: “The main motivation that stands after work was for heavy chemical things (but) the true advantage of the method is his ability to respond quickly and adapt. “

Anthropic said he would not immediately use the system in its current Claude models, but would consider its implementation if the most dangerous models were to be released in the future. Sharma added: “Great taking this job is that we think this is a tractable problem.”

The proposed starting solution is built into a so -called “constitution” of the rules that determine what is allowed and restricted and can be adapted to capture different types of material.

Some Jailbreak attempts are popular, such as using unusual quick capitalization or seeking the model to adopt a grandma’s personality to tell a story in bed for a bad theme.

To prove the effectiveness of the system, anthropic offered “wrong gifts” up to $ 15,000 individuals who tried to bypass security measures. These testers, known as red teams, spent more than 3,000 hours trying to violate the defenses.

The Sonet Claude 3.5 model of the anthropiku rejected more than 95 percent of efforts with classifiers in the country, compared to 14 percent without protective measures.

The leading technology companies are trying to reduce the misuse of their models, still maintaining their help. Often, when the moderation measures are set, models can become careful and reject benign demands, as it is with the early versions of the Google Gemini’s image generator or the flaws. “

However, the addition of these defenses also incur incomplete costs for companies that pay large sums for the computer power needed to train and run models. Anthropic said the classifier will reach an increase of almost 24 percent of the “upper conclusion”, the costs of managing the models.

Graph of tests performed in its latest model showing the effectiveness of anthropic classifiers

Security experts have argued that the accessible nature of such generating chatbots has enabled ordinary people without prior knowledge to try to extract dangerous information.

“In 2016, the actor of the threat we would have in mind was a truly powerful opponent of the nation state,” said Ram Shankar Siva Kumar, who runs the red team in Microsoft. “Now literally one of my threat actors is a young man with a small mouth.”

What's Hot

Prince Wiliam’s previous assistant in Kate Middleton’s cancer diagnosis

Comes the clock, what is finally Friedrich Merz

The new Port of Hyundai’s Tesla Car

Anthropic makes ‘jailbreak’ progress to stop the models of it by producing harmful results

Comes the clock, what is finally Friedrich Merz

Large crowds mourn nasrallah of hizbollah in the funeral beirut

Zelenskyy offers to withdraw in exchange for peace and NATO membership of Ukraine

Western coast residents cannot turn into cleaned camps, Israeli Minister of Defense says

BMW ceases £ 600m investment plan to produce electrical minis

Alternative for Germany decided to win profits while Germany goes to the polls

Why Was Jennifer Lopez’s Unstoppable Premiere Canceled?

Meta’s Head of Global Affairs Nick Clegg will be replaced by Republican Joel Kaplan

Trump friend and informal faith adviser: ‘God is giving America another chance’

US Supreme Court Criticizes TikTok Arguments Against Threatening Ban | Social media messages

Man City Pep Guardiola’s boss calls for ‘something special’ to save the Champions League campaign | Football news

China passes value-added tax law to take effect in 2026 Reuters

Prince Wiliam’s previous assistant in Kate Middleton’s cancer diagnosis

Comes the clock, what is finally Friedrich Merz

The new Port of Hyundai’s Tesla Car

Israeli tanks moved to the occupied West Bank for the first time since 2002

Our Picks

Prince Wiliam’s previous assistant in Kate Middleton’s cancer diagnosis

Comes the clock, what is finally Friedrich Merz

The new Port of Hyundai’s Tesla Car

Most Popular

Morgan Stanley Cedes Chief Goldman Sachs Rival

VP JD Vance and his new family begin their life in the official residence

Sprinklr cuts 500 employees, citing business performance under it

What's Hot

Subscribe to Updates

Anthropic makes ‘jailbreak’ progress to stop the models of it by producing harmful results

Keep Reading