CMU researchers introduce PPP and UserVille to train proactive and personalized LLM agents
Most LLM agents are tuned to maximize mission success. They solve GitHub issues or answer deep research queries, but they don’t reason carefully about when to ask users questions or how to respect different interaction preferences. How can we design an LLM agent so that it knows when to ask better questions and adapts its behavior to each user?
A team of researchers from Carnegie Mellon University CMU and OpenHands Formalize these missing behaviors into 3 joint goals, Productivity, initiative and personalizationand optimize it using a multi-objective reinforcement learning framework called purchasing power parity In a new environment named User City.

From task success to interaction-aware agents
Research team definition:
- productivity As task completion quality, e.g. F1 verified feature localization on SWE-Bench or exact matching on BrowseComp-Plus.
- proactive Ask necessary clarifying questions when initial prompts are vague, while avoiding unnecessary inquiries.
- personalization For example, follow user-specific interaction preferences such as simplicity, format, or language.
UserVille, an interactive environment with a preference-aware simulator
UserVille converts existing agent baselines into interaction-centric RL environments populated by LLM-based user emulators.
It has 3 stages:
- Just-in-time virtualization: Precise quest prompts were rewritten as vague prompts, keeping the same intent but removing details. This creates an information asymmetry, with the simulator still observing the precise prompt and the agent only seeing a blurred version.
- Preference-aware user simulation: Each user emulator is parameterized by preferences in 20 types. Preferences include requirements such as simplicity, number of questions per round, answer format, time, language restrictions, or JSON-formatted questions. Twelve preferences were used in training and 8 preferences were retained for generalization testing.
- User-centered assessment: After completing a task, the simulator marks each question as low effort, medium effort, or high effort based on whether it can be answered using the precise prompt and how easy it was to answer. If the overall session effort was low, the initiative score was 1, otherwise it was 0. The personalization score is 1 if the agent followed the preference, 0 otherwise, averaged over sessions in which the agent asked at least 1 question.
UserVille is instantiated in 2 areas, Software Engineering using SWE-Gym for training, SWE-Bench Verified and SWE-Bench Full for evaluation, and BrowseComp-Plus and the search plus open_page tool scaffolding for in-depth research.


PPP, multi-objective reinforcement learning for efficient, proactive and personalized agents
The agent is implemented as a ReAct style tool using a strategy based on Seed-OSS-36B-Instruct. They can call domain tools and query the ask_user tool of the user emulator.
PPP defines track level rewards
R=RProducts + Rproctor + RPerth.
- productivity bonus rightProducts is the task metric, F1 on SWE-Func-Loc or exact match on BrowseComp-Plus.
- motivation rewards rightproctor If all questions in the session are low-effort questions, add a +0.05 reward and impose a -0.1 penalty for each medium-effort question and a -0.5 penalty for each high-effort question.
- Personalized rewards rightPerthWhen the agent follows the preference, +0.05 is added, and a non-aggressive penalty defined by the preference-specific rules is added for each violation.
The training uses the GRPO-based RL algorithm with the Clip High policy and token-level policy gradient loss from DAPO, and only optimizes tokens generated by LLM. The training environment is implemented in Verl. Seed-OSS-36B-Instruct is trained for 200 steps with a batch size of 64 and a group size of 8. The maximum output length of SWE-Func-Loc is 32k, the maximum output length of SWE-Full is 65k, and the maximum output length of Deep Research is 41k. Use GPT 5 Nano as user emulator. The SWE scaffolding is based on OpenHands, with in-depth research using the search tool and the open_page tool, with Qwen3-Embed-8B as the retriever.


Experimental results
Table 2 (below) evaluates productivity, initiative, and personalization of SWE-Bench Verified Func-Loc and BrowseComp-Plus using fuzzy prompts and averaging over 20 preferences.


For the Seed-OSS-36B-Instruct basic model:
- On SWE-Func-Loc, Productivity 38.59, Initiative 43.70, Personalization 69.07
- On BrowseComp-Plus, productivity is 18.20, initiative is 37.60, and personalization is 64.76.
After PPP RL training, the PPP model reaches:
- On SWE-Func-Loc, Productivity 56.26, Initiative 75.55, Personalization 89.26
- On BrowseComp-Plus, productivity is 26.63, initiative is 47.69, and personalization is 76.85.
Relative to Seed-OSS-36B-Instruct, the average gain in all 3 dimensions and two datasets is 16.72 points, and PPP also outperforms GPT 5 and other GPT series baselines in comprehensive indicators.
For vague cues, interaction is crucial. On SWE-Func-Loc, the F1 for exact prompting and no interaction is 64.50. It dropped to 44.11 due to the blurry prompt and lack of interaction. Adding interaction without reinforcement learning does not close this gap. Through PPP training and interaction, F1 under fuzzy prompts improved by 21.66 points.
PPP also changes interaction behavior. The ask ratio of SWE-Func-Loc increased from 50% to 100% under vague cues and from 51% to 85% when delving deeper, while it remained low under precise cues. The number of questions in each training session will increase in the early stages of training and then stabilize, with a high proportion of low-difficulty questions and a small proportion of high-difficulty questions.
Main points
- PPP Framework Agent Training as Multi-objective reinforcement learning problem Work together to optimize productivity, initiative and personalization rather than focusing solely on task success.
- UserVille build vague hint version existing benchmarks and compare them with Preference-aware user emulatorwhich enforces 20 different interaction preferences and marks the user’s effort level.
- Total reward combined Task metrics, user effort, and preference complianceusing rewards for low-effort problems and penalties for medium and high-effort or preference violations, implemented through a GRPO-based RL algorithm.
- Prompt blur on SWE Bench Func Loc and BrowseComp Plus, PPP trained seed OSS 36B All 3 metrics are significantly improved compared to the base model and the GPT 5 baseline, with an average gain of approximately 16.72 points across dimensions and datasets.
- PPP agent Generalizing to unseen preferences, alternative simulators, and more difficult tasks For example, SWE Bench Full, they learn to ask fewer but more targeted low-difficulty questions, especially when the prompts are vague.
PPP and UserVille mark an important step toward interaction-aware LLM agents as they explicitly encode productivity, initiative, and personalization in reward design, enforce 20 interaction preferences using a preference-aware user simulator, and apply GRPO and DAPO-style token-level optimizations within Verl and OpenHands scaffolding. Improvements to SWE Bench Func Loc, SWE Bench Full and BrowseComp Plus indicate that interactive modeling is now a core feature rather than an ancillary feature.
Check Paper and repurchase agreement. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.
🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.