Use AI hallucinations to evaluate image realism

New Russian research proposes an unconventional method to detect unrealistic AI-generated images—not by improving the accuracy of large-scale visual models (LVLM), but by deliberately exploiting its illusion tendencies.
This new approach uses LVLM to extract multiple “atomic facts” about images and then applies natural language inference (NLI) to systematically measure the contradictions between these statements – effectively turning the defects of the model into a diagnostic tool for detecting images that violate common symbols.
Two images are from whops! Datasets of automatic generation of statements with the LVLM model. The left image is realistic, resulting in a consistent description, while the unusual right image makes the model hallucinate, creating contradictory or erroneous statements. Source: https://arxiv.org/pdf/2503.15948
Asked to evaluate the realism of the second image, LVLM can see something Something is wrong, because the camel depicted has three camels, which is inherently unknown.
However, LVLM initially mixes > 2 camels and > 2 animalsbecause that’s the only way you can see three humps in a “camel picture”. It then continues to hallucinate that it is less likely to hallucinate than three humps (i.e. “two heads”) and never details what seems to have raised doubts – the impossible extra humps.
Researchers of the new work found that LVLM models can perform such evaluation locally and are comparable to (or better) models that have been carefully tuned. Because fine-tuning is complex, expensive and rather fragile in downstream applicability, it is found that local uses of one of the greatest obstacles in the current AI revolution is a refreshing twist in general trends in the literature.
Public evaluation
The author asserts that the importance of this method is that it can be related to Open source frame. While advanced and high investment models such as ChatGpt Can (paper acknowledgement) have the potential to provide better results in this task, the disputed actual value of the literature for most of us (especially for amateurs and the VFX community) is the possibility of incorporating and developing new breakthroughs in local implementation; instead, everything that is doomed to proprietary commercial API systems requires cancellation, arbitrary price increase and review policies that are more likely to reflect the company’s corporate concern than the needs and responsibilities of users.
The title of the new paper is Don’t fight hallucinations, use them: Estimate image realism with NLI for atomic factsfive researchers from the Skolkovo Institute of Science and Technology (Skoltech), the Moscow Institute of Physics and Technology, and the Russian companies MTS AI and AIRI. The work has an accompanying GitHub page.
method
The author uses Israel/US cheers! Dataset for this project:

Example from an impossible image in Whoops! Dataset. It is worth noting how these images assemble reasonable elements and their impossibility must be calculated based on the concatenation of these incompatible aspects. Source: https://whoops-benchmark.github.io/
The dataset contains 500 synthetic images and over 10,874 annotations, designed specifically for testing common sense reasoning and compositional understanding of AI models. It was created in collaboration with designers who generated challenging images through text-based image systems such as Midjourney and Dall-E series – production scenarios are difficult to capture naturally:

Further examples of hoops! Dataset. Source: https://huggingface.co/datasets/nlphuji/whoops
The new method works in three stages: First, it prompts LVLM (particularly LLAVA-V1.6-Mistral-7b) to generate multiple simple statements (called “atomic facts”) – describing the image. These statements are generated using multiple beam searches to ensure variability in the output.

By optimizing diversified goals, diversified beam searches can produce better title options. Source: https://arxiv.org/pdf/1610.02424
Next, each generated statement is systematically compared with each other statement using a natural language inference model, which is assigned to reflect whether scores of statement pairs are needed, contradictory, or neutral to each other.
Contradictions represent hallucinations or unrealistic elements in an image:

Detect the mode of the pipeline.
Finally, the method summarizes these paired NLI scores into a single “reality score” that quantifies the overall coherence of the generated statement.
The researchers explored different aggregation methods through cluster-based methods. The authors apply the K-mean clustering algorithm to a single NLI score in two clusters and then select the centroid of the low-value cluster as the final measure.
Using two clusters directly aligns with the binary properties of the classification task, i.e. distinguishing realistic images. Logic is similar to simply choosing the lowest overall score. However, clustering allows metrics to represent average contradictions across multiple facts rather than relying on a single outlier.
Data and testing
The researchers tested their system! Baseline benchmarks, using rotation test splitting (i.e. cross-validation). The models tested are Blip2 Flant5-XL and Blip2 Flant5-XXL in splits, and Blip2 Flant5-XXL with zero beats (i.e., no other training).
For instructions that follow the baseline, the author prompts LVLM with phrase ‘Is this unusual? Please briefly explain in a short sentenceThis is a previous study that has found effective unrealistic images.
The models evaluated are Llava 1.6 Mistral 7b, Llava 1.6 Vicuna 13B and two sizes (7/13 billion parameters).
The test program centers on 102 realistic and unrealistic (“weird”) images. Each pair consists of a normal image and an abnormal sense counterpart.
Three human annotators marked the images, reaching a 92% consensus, indicating strong agreement among humans to constitute “weirdness.” The accuracy of the evaluation method is measured by its ability to correctly distinguish between realistic and unrealistic images.
The system was evaluated using triple cross-validation of fixed seeds. The author adjusted the weights of the expenditure score (logically agreed statement) and the contradiction score (logically conflicting statement) based on the training period, while the “neutral” score was fixed to zero. The final accuracy is calculated as the average of all test splits.

Comparison of different NLI models and aggregation methods for five subsets of generated facts measured by accuracy.
Regarding the initial results shown above, this article states:
‘this [‘clust’] Methods are one of the best ways to perform. This means that aggregation of all contradiction scores is crucial, not just focusing on extreme value. Furthermore, the largest NLI model (NLI-Deberta-V3-Large) is superior to all other aggregation methods, indicating that it captures the nature of the problem more efficiently.
The authors found that the best weights have always favored contradictions over contradictions, which suggests that contradictions are more informative in distinguishing unrealistic images. Their method outperforms all other zero-fire methods tested, close to the performance of the fine-tuning BLIP2 model:

Various methods in a hundred-day business! Benchmark. The fine-tuning (ft) method appears at the top, while the zero-shoot (ZS) method is listed below. The model size represents the number of parameters and the accuracy is used as the evaluation metric.
They also surprisingly pointed out that the performance of instruction BLIP is better than the comparable LLAVA model that is also prompted. While recognizing the superior accuracy of GPT-4O, this article highlights the author’s preference for demonstrating practical open source solutions and seems to reasonably claim novelty when explicitly utilizing hallucinations as diagnostic tools.
in conclusion
However, the authors acknowledge their project’s debt to the 2024 FaithScore outing, a collaboration between the University of Texas Dallas and Johns Hopkins.

Description of how faith assessment works. First, the descriptive statements in the answer generated by LVLM are determined. Next, these statements are broken down into single atomic facts. Finally, compare atomic facts with the input image to verify their accuracy. Underlined text highlights objective descriptive content, while blue text represents illusionary statements, allowing FaithScore to provide explainable factual correctness. Source: https://arxiv.org/pdf/2311.01477
FaithScore measures the loyalty of LVLM generated descriptions by validating consistency against image content, while the new paper’s approach explicitly utilizes LVLM hallucinations to detect unrealistic images by using facts generated in natural language reasoning.
Naturally, the new work depends on the weirdness of the current language model and their tendency toward hallucinations. If model development should bring about models that are completely non-resistant, then even the general principles of new works will no longer apply. However, this remains a challenging prospect.
First published on Tuesday, March 25, 2025