Reconsider toxic data in LLM pretreatment: A co-design approach that improves administration and detoxification

by admin · May 14, 2025

In pre-training of LLM, the quality of training data is crucial to determining model performance. A common strategy involves filtering toxic content from the training corpus to minimize harmful yields. Although this approach aligns with the principle that neural networks reflect their training data, it introduces trade-offs. Removing toxic content can reduce the diversity and richness of the data, thereby weakening the model’s ability to understand or identify toxicity and reduce performance, such as Q&A, such as question answers. This creates a dilemma: retaining too much toxic data increases harmful output, while over-filtering limits the overall functionality of the model. However, with increasing emphasis on post-training interventions, fewer models are deployed directly after preprocessing, suggesting that data quality and quantitative balance can be managed more effectively at later stages.

The methods of detoxifying LLM are usually divided into two categories: fentanyl-based and decoding-based basis. Filling methods, such as reinforced learning and direct preference optimization (DPO) using human feedback (RLHF), align model behavior with human values or curated datasets. Although effective, they often impair the original capabilities of the model and can be bypassed or undoed with further training. On the other hand, the control’s generation technology adjusts the output during the inference process, using methods such as vocabulary transfer, self-deception, or external expert models. These strategies may reduce toxicity, but often incur high computational costs and impair verbal fluency. Newer work lines explore modifying internal representations, assuming that linear structures in hidden states can be manipulated to obtain specific behavioral results.

Harvard researchers reevaluated the data quality in LLM training by exploring co-design methods before and after training. They found that while increasing the toxicity of the underlying model, the toxicity data were predicted, enhancing the internal toxicity representation of the model, making it easier to suppress after training. Using OLMO-1B models with a combination of multiple clean and toxic data, they show that toxicity becomes more linearly isolated and easy to control. Experiments with prompt and reasoning time interventions have shown that there is no detoxification improvement that impairs general performance, suggesting that toxic data can lead to a more controllable and powerful language model.

To study the effects of toxic data on LLM pretreatment, the researchers trained a series of OLMO-1B models in which the proportion of toxic content increased (from 0% to 25%) while keeping the clean data constant. They found that inclusion of moderately toxic data improved universal language competence (measured by MMLU) and toxicity detection (measured by Toxigen). Detection experiments show that models trained with toxic data form stronger and more separable internal representations of toxicity. Statistical analysis and token-level visualization further confirm that this model more accurately identifies toxic content, supporting enhanced conceptual learning without significantly impairing general performance when exposed to toxic examples.

This study explores whether exposure to toxic data during preprocessing can improve the detoxification ability of the model through post-training methods. Using the Inference Time Intervention (ITI), the researchers suggest, supervised Finetuning (SFT) and DPO, they found that models trained in up to 10% of toxic data (e.g., 4CHAN) showed improved signability. These models respond better to detox technology, achieving lower toxicity with minimal performance losses. Additionally, when tested against adversarial red team attacks, models predicted with toxic data. They showed greater robustness using ITI guidance, suggesting that such exposure may enhance the model’s harmful internal representation of content.

In summary, the study rediscusses the assumption that excluding toxic data during preprocessing improves language model quality. Through theoretical and empirical analysis using the OLMO-1B model, the authors show that an increase in toxicity data in read-outs results in a more obvious decomposition form of toxicity, making it easier to control after training. Although basic models trained on toxic data will initially produce more harmful content, detox techniques like ITI are more effective for them. Results on the benchmark dataset show a better balance between reducing toxicity and retention of general abilities. This work shows that some “bad” data can enhance the addability and alignment of the model.

Check Paper. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 90K+ ml reddit.

Here is a brief overview of what we built in Marktechpost:

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.