Ambigraph-eval: The benchmark for solving ambiguity in graph queries
Semantic parsing converts natural language into formal query languages, such as SQL or Cypher, allowing users to interact with the database more intuitively. However, natural language is ambiguous in nature and generally supports multiple effective interpretations, while query languages require accuracy. Although the ambiguity in table queries has been explored, graph databases present challenges due to their interconnected structure. Natural language queries on graphs and relationships usually produce multiple explanations due to the structural richness and diversity of graph data. For example, queries such as “Best Evaluation Restaurant” may vary depending on whether the results consider a single rating or a total score.
Ambiguity in interactive systems poses a serious risk because the failure of semantic parsing can cause the query to differ from the user’s intent. Such errors can lead to unnecessary data retrieval and calculation, thereby wasting time and resources. In the context of high risk such as real-time decision making, these issues can reduce performance, improve operational costs and reduce efficiency. LLM-based semantic parsing hopes to address complex and ambiguous queries by using language knowledge and interactive clarification. However, LLMS faces the challenge of self-challenging bias. After training with human feedback, they may adopt the commenter’s preferences, resulting in systematic misalignment of the actual user intentions.
Researchers from the Hong Kong Baptist Church, the National University of Singapore, Beverle and the University of Berlin and the Ant Group have proposed a way to resolve ambiguity in graph queries. The concept of ambiguity in graph database queries was developed and divided into three types: attribute, relationship and attribute relationship ambiguity. The researchers introduced Ambigraph-eval, a benchmark that contains 560 ambiguous queries and corresponding graph database samples to evaluate model performance. It tested nine LLMs, analyzed their ability to resolve ambiguity and identified areas of improvement. The study shows that reasoning ability provides limited advantages, emphasizing the importance of understanding graphical ambiguity and mastering query grammar.
The Ambigraph-eval benchmark is designed to evaluate the ability of LLMS to generate syntactically correct and semantically appropriate graphical queries such as Cypher from ambiguous natural language input. In addition, the data set is divided into two stages: data collection and human review. Ambiguous tips are obtained through three methods, including direct extraction from the graph database, synthesis from explicit data using LLMS, and creating all generations of new situations by prompting LLMS. To evaluate performance, the researchers tested four closed source LLMs (e.g. GPT-4, Claude-3.5-sonnet) and four open source LLMs (e.g. QWEN-2.5, LLAMA-3.1). Evaluation is done via API calls or using a 4X NVIDIA A40 GPU.
The evaluation of zero-fire performance of the Ambigraph-eval benchmark showed differences between models when parsing graph data was ambiguous. In the attribute ambiguity task, O1-Mini excels in the same scenario (SE) scheme, and GPT-4O and Llama-3.1 perform well. However, GPT-4O outperforms others in cross-border (CE) missions, showing excellent reasoning across the entity. For relational ambiguity, Llama-3.1 leads, while GPT-4O shows limitations in the SE task but performs well in the CE task. Attribute relational ambiguity is the most challenging, with Llama-3.1 performing best in SE tasks, while GPT-4O-led CE tasks. Overall, the model struggles with multidimensional ambiguity compared to isolated attribute or relational ambiguity.
In summary, the researchers introduced Ambigraph-eval, a benchmark for evaluating LLM’s ability to resolve ambiguity in graph database queries. Evaluation of nine models reveals significant challenges in generating accurate Cypher statements, with strong reasoning skills providing only limited benefits. Core challenges include identifying ambiguity intent, generating effective syntax, interpreting graph structures, and performing numerical aggregation. Ambiguity detection and syntax production occur with the major bottlenecks that hinder performance. To address these problems, future research should use methods such as grammatical awareness cues and explicit ambiguity signals to enhance the model’s ambiguity resolution and grammatical processing.
Check Technical paper. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please feel free to follow us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.