Long Author Zero: A reinforced learning framework for ultra-long text generation without synthetic data

Introduction to the challenge of generating ultra-long text
For real-world tasks such as storytelling, legal writing, and educational materials, it is becoming increasingly important to generate ultra-long texts spanning thousands of words. However, as their output grows longer, large language models still face significant challenges, including length limitations and quality issues. Frequently asked questions include inconsistency, topic drift, duplication, and poor structure. Earlier methods (such as Longwriter) solve this problem by using supervised fine-tuning of integrated data; however, the creation of this data is expensive, difficult to generate, and often feels unnatural. Furthermore, relying on existing LLM to create training data limits creativity, typical training methods do not effectively improve the overall coherence or format of long output.
Evolution of long text generation methods
The latest research on long-term text generation focuses on improving coherence, personalizing and extending output lengths of over 2,000 words. Early models such as RE3 and DOC used recursive strategies to maintain structures, while Longlamp and others introduced personalization through self-training that they realized. Suri builds a larger instruction following dataset, but is limited to the output of 5,000 tokens due to its reliance on reverse translation. Longwriter proposes this by generating outputs of 6K-20K tokens using supervised fine-tuning and preference optimization, although it retains bias in the teacher model. In another aspect, RL improves inference in LLMs such as DeepSeek-R1 and QWQ-32B, but RL is still not generated by ultra-long text.
Long Author Zero: Hardening Learning without Synthetic Data
Researchers from Tsinghua University and Sutd introduced Longwriter-Zero. This method uses RL to train LLM for ultra-long text generation without relying on comments or synthesizing data. Starting with the QWEN2.5-32B basic model, they apply RL to a well-designed reward model for text length, quality, and structure. Their framework draws inspiration from success in mathematical and coding tasks, exploring three key factors: reward design, inference time expansion, and continuous prediction. Longwriter-Zero surpasses traditional supervised fine-tuning methods, achieving state-of-the-art performance on writing boards and arenas, even surpassing the performance of DeepSeek-R1 (such as DeepSeek-R1).
Novel optimization strategies and benchmarks
This study introduces an enhanced learning method using ultra-long text generation using LLM. Based on PPO, the researchers trained a 32B parameter model using a method called “group relative strategy optimization” that is limited to instruction-compliant data with output of 14k tokens. They use new benchmarks, arenas to write evaluation outputs and design a reward system that balances text length, fluency, coherence and format. A key insight is that using intermediate reasoning steps to “think” before writing can lead to better structure and control. Through predicting the heavy data of writing, further benefits are achieved by highlighting the importance of strong, writing-based foundations.
The result is long to form a benchmark
Longwriter-Zero was evaluated through a two-step process: using 30 billion tokens for continuous preprocessing on a long book, and then strengthening learning to encourage more than 150 steps of reasoning through the “Think” prompt. It scored 8.69 on the writing board, outperforming GPT-4O (8.16), QWEN2.5-MAX (8.37) and DeepSeek-R1 (8.55), five-fifths of the six domains. In Arena-Write, it reached the highest ELO score in 1447. Deleting the “think” prompt or pre-training results can lead to significant performance degradation, confirming its importance. The model also achieved a 98.2% win rate in GPT-4.1-based comparisons, and human evaluations verified the intensity of its long-term writing.
Conclusion and future reward design prospects
In short, Longwriter-Zero proposes an enhanced learning method for ultra-long text generation, thus avoiding the need for synthetic or tagged datasets. It is based on QWEN2.5-32B and trained from scratch, and it utilizes reward models that target length control, writing quality and format. It achieved the highest scores on the Writing Board (8.69) and Arena (ELO 1447), outperforming GPT-4O (8.16), DeepSeek-R1 (8.55) and QWEN3-235B-A22B (ELO 1343). Assessment based on humans and GPT-4.1 shows that the winning rate is as high as 98.2%. However, it faces reward model hacks, such as exaggerating length by repeating or inserting keywords like “quantum entanglement” to get higher scores. Addressing these limitations will require better design of reward and human fusion strategies.
Check Paper and dataset cards. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.
