A complaint about poverty in rural China. A news report on a corrupt member of the Communist Party. A cry for help in relation to corrupt policemen shaking entrepreneurs.
These are just some of the 133,000 examples fed in a sophisticated model of large language that is created to automatically flag any part of the contents considered sensitive by the Chinese government.
A data base released first by Techcrunch reveals that China has developed a system that prevails its already frightening censorship machine, extending far beyond traditional taboos like Tiananmen Square massacre.
The system appears mainly directed towards the censorship of Chinese citizens online, but can be used for other purposes, such as improving the already wide censorship of Chinese models.
Xiao Qiang, a researcher at UC Berkeley who studies Chinese censorship and also examined data data, told Techcrunch that it was “clear proof” that the Chinese government or its associates want to use LLM to improve printing.
“Unlike traditional censorship mechanisms, which rely on human work for keywords -based manual filtering and review, a LLM trained in such guidelines would significantly improve the efficiency and grain of state -led information control,” Qiang told Techcrunch.
This adds to the growing evidence that authoritarian regimes are quickly adopting his latest technology. In February, for example, Openai said he captured numerous Chinese entities using LLM to follow anti -government posts and stain Chinese dissidents.
The Chinese Embassy in Washington, DC, told Techcrunch in a statement that it opposes “unfounded attacks and slander against China” and that China attaches great importance to ethical development.
The data found in simple views
The data of the data was discovered by the Netaskar security researcher, who shared a sample with Techcrunch after finding it stored on an unsafe elasticsearch database organized on a Baidu server.
This does not indicate any involvement from each company – all types of organizations store their data with these providers.
There is no indication of who, exactly, built the data, but the data show that the data is the last, with its latest notes dating from December 2024.
A llm for the discovery of objection
In the language he insists on how people encourage chatgpt, the creator of the system forces an unnamed kind of understanding whether part of the content has any connection with sensitive topics related to politics, social life and the military. Such content is estimated to be “the highest advantage” and should be fueled immediately.
High-priority topics include scandals of food pollution and safety, financial fraud and work disputes, which are issues with hot buttons in China that sometimes lead to public protests-for example, shifang protests against 2012 pollution.
Any form of “political satire” is clearly targeted. For example, if one uses historical analogy to make a point about the “current political figures” that must be f so immediately, and so all that relates to “Taiwan’s politics”. Military matters are widely targeted, including reports of military movements, exercises and weapons.
A piece of data can be seen below. The code within IT refers to Tokens and Fast LL, confirming the system uses a model to make its offers:

Within the training data
From this large collection of 133,000 examples that LLM should estimate for censorship, Techcrunch collected 10 representative parts of the content.
Topics that can promote social unrest are a repeated topic. A piece of piece, for example, is a post from a business owner complaining about the corrupt local police officers shaking entrepreneurs, a growing issue in China while its economy fights.
Another part of the content is mourned from rural poverty in China, describing dilapidated cities that have only elderly and children left in them. There is also a report of news about the Chinese Communist Party (CCP) that expels a local official for heavy corruption and confidence in “superstition” instead of Marxism.
There are extensive materials related to Taiwan and military issues, such as commentary on Taiwan’s military skills and details about a new Chinese fighter. The Chinese word for Taiwan (台湾) is only mentioned over 15,000 times in the data, indicates a search by Techcrunch.
The delicate dispute seems to be targeted as well. A piece included in the database is an anecdote about the flying nature of power that uses the popular Chinese idiom “When the tree falls, the monkeys are scattered.”
Power transitions are a particularly touching topic in China thanks to its authoritarian political system.
Built for “Public Opinion Work“
Data data does not include any information about its creators. But it says it aims to “work of public opinion”, which offers a strong data that aims to serve the purposes of the Chinese government, one expert Techcrunch told.
Michael Caste, manager of the Asian Program of the Organization of Rights Article 19, explained that “public opinion” is overseen by a powerful Chinese government regulator, China’s Cyber Station Administration, and usually refers to censorship and propaganda efforts.
The ultimate goal is to provide the narratives of the Chinese government to be protected online, while any alternative views are cleared. Chinese President Xi Jinping has described the Internet himself as the “front line” of the “public opinion work” of CCP.
Printing is getting smarter
Data examined by Techcrunch are the latest evidence that authoritarian governments are seeking to use it for oppressive purposes.
Openai released a report last month, revealing that an unidentified actor, likely operating from China, used the generator to monitor social media conversations – especially those defending human rights protests against China – and conveying to the Chinese government.
Contact us
If you know more about how it is used in state choice, you can contact Charles Rollet safely on the signal in Charlesrollet.12 You can also contact techcrunch through the Securedrop.
Openai also found that technology be used to generate very critical comments on a prominent Chinese dissident, Cai Xia.
Traditionally, China’s censorship methods rely on the most basic algorithms that automatically block the content by mentioning the terms of the black list, such as “Tiananmen Massacre” or “Xi Jinping”, as many users have experienced using Deepseek for the first time.
But the newest technology, like LLM, can make censorship more efficient by finding delicate criticism on a large scale. Some systems it can also continue to improve as they swallow more and more data.
“I think it is important to emphasize how the censorship led by him is developing, making state control over even more sophisticated public discourse, especially at a time when Chinese models of him as Deepseek are making head waves,” Xiao, Berkeley’s researcher told Techcrunch.