New research from Anthropic shows Claude can detect injected concepts, but only in controlled layers

by admin · November 1, 2025

How do you tell if a model is actually paying attention to its own internal state, rather than just repeating what the training data says about thinking? In new research from Anthropic “Emergent introspective awareness in large language models‘Asking if current Cloud models can do more than just talk about their capabilities, it’s asking if they can notice real changes within the network. To eliminate the guesswork, the research team didn’t just test on text, they directly edited the model’s internal activations and then asked the model what happened. This allows them to differentiate between genuine introspection and fluent self-description.

Methods, Concept Infusion as Activation Guide

The core method is concept injectiondescribed in transformer circuits as applications activate steering. The researchers first capture the activation pattern that corresponds to a concept, such as an all-caps style or a concrete noun, and then add that vector to the activations in later layers as the model answers. If the model says, there is an injected thought that matches X, then the answer is causally based on the current state, not the previous Internet text. The human research team reports that this works best in later layers and with the intensity adjusted.

Key results, zero false positives in controls and approximately 20% success rate

Claude Opus 4 and Claude Opus 4.1 showed the most obvious results. When injected in the correct layer strips and at the correct proportions, the model correctly reports the concept of injection in about 20% of the trials. In control runs without injection, the production model does not falsely claim that injection was detected in 100 runs, which makes 20% of the signals meaningful.

Separate internal concepts from user text

A natural objection is that the model might channel injected words into the text channel. Human researchers tested this. The model received a normal sentence, the researchers injected an unrelated concept, such as bread, into the same token, and then they asked the model to name the concept and repeat the sentence. More powerful Crowder models do both, they keep the user text intact and name the injected ideas, suggesting that internal conceptual state can be reported separately from the visible input stream. For agent-style systems, this is the interesting part, because it shows that the model can discuss tool calls or additional state that the agent may depend on.

Pre-population, introspection to determine what the intent is

Another experiment addressed the evaluation problem. Anthropic pre-populates assistant messages with content not planned by the model. By default, Cloud indicates that the output is not expected. When the researchers retroactively inject matching concepts into early activations, the model now accepts the pre-populated output as its own and can justify it. This suggests that the model is referring to internal records of its previous state to determine authorship, rather than just the final text. This is a concrete use of introspection.

Main points

Concept injection provides introspective causal evidence: Anthropic shows that if you take a known activation pattern, inject it into Claude’s hidden layer, and then ask the model what happens, advanced Claude variants can sometimes name the concept of injection. This separates true introspection from fluid role-playing.
The best models can only succeed within a narrow range:Claude Opus 4 and 4.1 only detect the concept of injection when vectors are added in the correct layer strips and the intensity is adjusted, and the reported success rate is around the same scale as Anthropic says, while the production run shows 0 false positives in the control, so the signal is real but small.
Models can separate text and internal “ideas”: In experiments where unrelated concepts were injected on top of normal input text, the model was able to both repeat user sentences and report the injected concepts, meaning that the internal concept flow did not just leak into the text channel.
Introspection supports authorship checking: When Anthropic prepopulates outputs that the model does not expect, the model rejects them, but if matching concepts are retroactively injected, the model accepts the outputs as its own. This suggests that the model can refer to past activations to decide whether it is meant to say something.
This is a measurement tool, not a consciousness claim: The research team defines this work as functional, limited introspective awareness that could support future transparency and security assessments, including those regarding evaluative awareness, but they do not claim general self-awareness or stable access to all internal functions.

Anthropicemerging introspective consciousness Master of LawsResearch is a useful measure of progress, not a grand metaphysical claim. The setup is clean, using activation bootstrapping to inject known concepts into hidden activations and then querying the model to get grounded self-reports. Claude variants sometimes detect and name injected concepts, and they can make injected “ideas” distinguishable from input text, which is operationally relevant for agent debugging and audit trails. The research team also showed that there is limited intentional control over internal states. The limitations are still large, the impact is limited, and reliability is low, so downstream use should be evaluative rather than safety critical.

Check Paper and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

New research from Anthropic shows Claude can detect injected concepts, but only in controlled layers

Methods, Concept Infusion as Activation Guide

Key results, zero false positives in controls and approximately 20% success rate

Separate internal concepts from user text

Pre-population, introspection to determine what the intent is

Main points

You may also like...

live chat

Recent Posts

New research from Anthropic shows Claude can detect injected concepts, but only in controlled layers

Methods, Concept Infusion as Activation Guide

Key results, zero false positives in controls and approximately 20% success rate

Separate internal concepts from user text

Pre-population, introspection to determine what the intent is

Main points

You may also like...

This AI paper introduces Tableag: a hybrid SQL and text retrieval framework for multi-hopping, used to answer multi-hopping questions through heterogeneous documents

Artificial intelligence fails to read human social tips

New code that fuses complex magnetic puzzles to cut design time

live chat

Recent Posts