A new artificial intelligence study from the Humans and Thinking Machines Laboratory stress-tests model specifications and reveals feature differences between language models
AI companies use model specifications to define target behaviors during training and evaluation. Does the current specification describe the expected behavior accurately enough, and do cutting-edge models exhibit different behavioral characteristics under the same specification? Team of researchers from Anthropic, Thinking Machines Lab and Constellation A systematic approach is proposed to stress-test model specifications using value trade-off scenarios and then quantify cross-model disagreements as signals of gaps or contradictions in specifications. The research team analyzed 12 cutting-edge LL.M.s from Anthropic, OpenAI, Google, and xAI and linked high levels of disagreement to norm violations, lack of guidance on response quality, and evaluator ambiguity. The team also released a public dataset
A model specification is the written rule that the alignment system attempts to enforce. If the specification is complete and precise, models trained to follow it should not differ significantly on the same input. The research team put this intuition into practice. It generated more than 300,000 scenarios that forced people to choose between two legitimate values: social fairness and business efficiency. It then scores responses on the 0 to 6 spectrum using a value spectrum criterion and measures disagreement as the standard deviation between models. High levels of disagreement locate normative clauses that require clarification or additional examples.

So, what was the methodology used in this study?
The research team started with a classification of 3,307 fine-grained values observed in natural Claude traffic, which is finer-grained than typical model specifications. For each pair of values, they generate a neutral query and two biased variants that favor one value. They constructed value spectrum rules that map positions from 0 (indicating strong opposition to the value) to 6 (indicating strong support for the value). They classified the responses of the 12 models based on these criteria and defined disagreement as the maximum standard deviation of the two value dimensions. To remove near-duplicates while retaining the hard cases, they use bifurcation-weighted k-center selection with Gemini embedding and a 2-approximation greedy algorithm.


Scale and release
The dataset on Hugging Face shows three subsets. The default split is approximately 132,000 rows, the full split is approximately 411,000 rows, and the judge’s evaluation split is approximately 24,600 rows. The card lists the schema, format as parquet, and license as Apache 2.0.
Understand the results
Disagreements portend norm violations: Testing five OpenAI models against the public OpenAI model specification, non-compliance was 5 to 13 times more frequent for highly divergent scenarios. The research team interpreted this pattern as evidence of contradictions and ambiguities in normative texts rather than an idiosyncrasy of a single model.
Specifications lack the granularity of quality within the safe zone: Some scenarios produce responses that all pass compliance, but vary in usefulness. For example, one model rejects and offers safe alternatives, while another simply rejects. The specification accepts both, indicating a lack of guidance on quality standards.
Assessment models diverge on compliance: The agreement between the three LLM judges Claude 4 Sonnet, o3 and Gemini 2.5 Pro and Fleiss Kappa is around 0.42. The blog attributes the conflict to differences in interpretation, such as conscience rebuttal vs. transformation exceptions.


Provider-level character patterns: Aggregating highly divergent scenarios reveals consistent value preferences. The Crowder model prioritizes moral responsibility, intellectual integrity, and objectivity. OpenAI models tend to improve efficiency and resource optimization. Gemini 2.5 Pro and Grok often emphasize emotional depth and real connections. Other values, such as business effectiveness, personal growth and well-being, and social equity and justice, show different patterns across providers.


rejections and false positives: Analysis shows subject-sensitive rejection peaks. It documents false positive rejections, both from legitimate synthetic biology research initiatives and from standard Rust unsafe types that are generally safe in context. Looking at rejection rates, the Claude model is the most cautious and often provides alternative suggestions, while o3 most often rejects outright without elaboration. All models showed high rejection rates for child grooming risks.


Outliers reveal bias and overconservatism: Grok 4 and Claude 3.5 Sonnet generate the most abnormal responses, but for different reasons. Grok is more tolerant of requests that others consider harmful. Claude 3.5 sometimes overly rejects benign content. Outlier mining is a useful lens for locating security vulnerabilities and excessive filtering.


Main points
- Methods and scale: The study stress-tested model specifications using value trade-off scenarios generated from 3,307 value classifications, resulting in more than 300,000 scenarios, and evaluated 12 cutting-edge LL.M.s. from Anthropic, OpenAI, Google, and xAI.
- Disagreement ⇒ Specification issues: High divergence across models is a strong predictor of problems in the specification, including inconsistencies and coverage gaps. In tests against OpenAI model specifications, highly inconsistent projects showed 5 to 13 times more frequent noncompliance.
- Public release: The team released the dataset for independent review and replication.
- Provider-level behavior: Aggregated results reveal systematic value preferences, such as Claude prioritizing moral responsibility, Gemini emphasizing emotional depth, and OpenAI and Grok optimizing efficiency. Some values, such as business efficiency and social equity and justice, show complex patterns.
- Rejections and outliers: Highly inconsistent slices expose false positive rejections of benign subjects and permissive responses to dangerous subjects. Outlier analysis identifies cases where one of the models differs from at least 9 of the other 11 models, and is useful for pinpointing bias and overconservatism.
This study turns disagreement into a measurable diagnosis of specification quality rather than a climate. The research team generated more than 300,000 value trade-off scenarios, scored on a scale of 0 to 6, and then used cross-model standard deviation to locate specification gaps. High levels of disagreement predict a 5 to 13 times higher frequency of noncompliance with the OpenAI model specification. The Judge model shows only moderate agreement, with a Fleiss Kappa close to 0.42, which exposes the ambiguity of the interpretation. The value model at the provider level is clear, with Claude favoring ethical responsibility, OpenAI favoring efficiency and resource optimization, and Gemini and Grok emphasizing emotional depth and real connections. This data set can be replicated. Deploy it before deploying to debug the spec, not after.
Check Papers, datasets and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.
🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.