2026/04/20/ai-models-marketed-as-uncensored-still-encounter
Study finds “uncensored” AI models still avoid charged words due to safety filtering embedded during pretraining
EDITOR BRIEF
Pretrain Forensics measured a behavior it calls the flinch: when a model avoids predicting politically or socially charged words even without issuing a refusal. Across seven pretraining models from five labs, the researchers found that supposedly uncensored models can still heavily down-rank sensitive terms compared with open-data baselines.
CONTEXT
The findings suggest that removing refusal behavior after training does not necessarily remove deeper safety biases learned during pretraining. This points to a growing distinction between visible moderation and embedded censorship, which could affect model transparency, auditing, and downstream fine-tuning reliability.
ARTICLE
Even 'uncensored' models can't say what they want
COMMENTS
Discussion
> geekhaus:~$ next read?
Next read recommendations
TechCrunch
Everyone is navigating AI security in real time — even Google
TechCrunch
Xreal, Google’s smartglasses partner, thinks it has finally mastered this notoriously tricky industry
scienceaim.com