Stay informed with free updates
Simply sign up at Artificial intelligence Myft Digest – delivered directly to your box.
The onset of artificial anthropic intelligence has demonstrated a new technique to prevent users from promoting harmful content from its models, such as technology leaders, including Microsoft and Meta race to find ways that protect against the risks presented by the most technology of technology high.
In a paper released on Monday, based on San Francisco described a new system called “Constitutional Classification”. It is a model that acts as a protective layer on top of large language patterns, such as the one that strengthens the anthropic chatbot Claude, which can monitor both inputs and results for harmful content.
Development by Anthropic, which is in talks to raise $ 2 billion in a $ 60bn estimate, comes in the midst of the increasing industry concern for “prison resources” – efforts to manipulate patterns of it in generating information illegal or dangerous, such as producing guidelines for the construction of chemical weapons.
Other companies are also competing to impose measures to protect against practice, in motion that can help them avoid regulatory review while persuading businesses to adopt models of it safely. Microsoft introduced “Shields Prompt” last March, while Meta introduced a quick model of guards in July last year, who quickly found ways to bypass but have since been fixed.
Mrinank Sharma, a member of the anthropic technical staff, said: “The main motivation that stands after work was for heavy chemical things (but) the true advantage of the method is his ability to respond quickly and adapt. “
Anthropic said he would not immediately use the system in its current Claude models, but would consider its implementation if the most dangerous models were to be released in the future. Sharma added: “Great taking this job is that we think this is a tractable problem.”
The proposed starting solution is built into a so -called “constitution” of the rules that determine what is allowed and restricted and can be adapted to capture different types of material.
Some Jailbreak attempts are popular, such as using unusual quick capitalization or seeking the model to adopt a grandma’s personality to tell a story in bed for a bad theme.
To prove the effectiveness of the system, anthropic offered “wrong gifts” up to $ 15,000 individuals who tried to bypass security measures. These testers, known as red teams, spent more than 3,000 hours trying to violate the defenses.
The Sonet Claude 3.5 model of the anthropiku rejected more than 95 percent of efforts with classifiers in the country, compared to 14 percent without protective measures.
The leading technology companies are trying to reduce the misuse of their models, still maintaining their help. Often, when the moderation measures are set, models can become careful and reject benign demands, as it is with the early versions of the Google Gemini’s image generator or the flaws. “
However, the addition of these defenses also incur incomplete costs for companies that pay large sums for the computer power needed to train and run models. Anthropic said the classifier will reach an increase of almost 24 percent of the “upper conclusion”, the costs of managing the models.

Security experts have argued that the accessible nature of such generating chatbots has enabled ordinary people without prior knowledge to try to extract dangerous information.
“In 2016, the actor of the threat we would have in mind was a truly powerful opponent of the nation state,” said Ram Shankar Siva Kumar, who runs the red team in Microsoft. “Now literally one of my threat actors is a young man with a small mouth.”