GEEK HAUS
Back to feed
2026/04/20/ai-models-marketed-as-uncensored-still-encounter

Study finds “uncensored” AI models still avoid charged words due to safety filtering embedded during pretraining

·morgin.ai
read original

EDITOR BRIEF

Pretrain Forensics measured a behavior it calls the flinch: when a model avoids predicting politically or socially charged words even without issuing a refusal. Across seven pretraining models from five labs, the researchers found that supposedly uncensored models can still heavily down-rank sensitive terms compared with open-data baselines.

CONTEXT

The findings suggest that removing refusal behavior after training does not necessarily remove deeper safety biases learned during pretraining. This points to a growing distinction between visible moderation and embedded censorship, which could affect model transparency, auditing, and downstream fine-tuning reliability.

ARTICLE

Even 'uncensored' models can't say what they want

COMMENTS

Discussion

> geekhaus:~$ next read?

Next read recommendations