# Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning

Xingyu Tan  
University of New South Wales  
Data61, CSIRO  
Sydney, Australia  
xingyu.tan@unsw.edu.au

Xiaoyang Wang*  
University of New South Wales  
Sydney, Australia  
xiaoyang.wang1@unsw.edu.au

Qing Liu Data61, CSIRO Sydney, Australia q.liu@data61.csiro.au

Xiwei Xu  
Data61, CSIRO  
Sydney, Australia  
xiwei.xu@data61.csiro.au

Xin Yuan  
Data61, CSIRO  
Sydney, Australia  
xin.yuan@data61.csiro.au

Wenjie Zhang  
University of New South Wales  
Sydney, Australia  
wenjie.zhang@unsw.edu.au

# Abstract

Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of  $18.9\%$ . Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to  $23.9\%$ .

*Corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

WWW '25, Sydney, NSW, Australia

© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.  
ACM ISBN 979-8-4007-1274-6/25/04

https://doi.org/10.1145/3696410.3714892

# CCS Concepts

- Information systems  $\rightarrow$  Question answering.

# Keywords

Large Language Models; Knowledge Graph; Knowledge Graph Question Answering; Retrieval-Augmented Generation

# ACM Reference Format:

Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, and Wenjie Zhang. 2025. Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning. In Proceedings of the ACM Web Conference 2025 (WWW '25), April 28-May 2, 2025, Sydney, NSW, Australia. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3696410.3714892

# 1 Introduction

Large Language Models (LLMs) have demonstrated remarkable performance in various tasks [4, 6, 8, 16, 41]. These models leverage pre-training techniques by scaling to billions of parameters and training on extensive, diverse, and unlabeled data [33, 41]. Despite these impressive capabilities, LLMs face two well-known challenges. First, they struggle with deep and responsible reasoning when tackling complex tasks [19, 32, 39]. Second, the substantial cost of training makes it difficult to keep models updated with the latest knowledge [37, 46], leading to errors when answering questions that require specialized information not included in their training data. For example, in Figure 1(a), though models like GPT can generate reasonable answers for knowledge-specific questions, these answers may be incorrect due to outdated information or hallucination of reasoning on LLM inherent Knowledge Base (KB).

To deal with the problems of error reasoning and knowledge gaps, the plan-retrieval-answering method has been proposed [25, 27, 54]. In this approach, LLMs are prompted to decompose complex reasoning tasks into a series of sub-tasks, forming a plan. Simultaneously, external KBs are retrieved to answer each step of the plan. However, this method still has the issue of heavily relying on the reasoning abilities of LLMs rather than the faithfulness of the retrieved knowledge. The generated reasoning steps guide information selection, but answers are chosen based on the LLM's interpretation of the retrieved knowledge rather than on whether the selection leads to a correct and faithful answer.

![](images/3ef8c49a2b6412865edc61f5c0dfe64ac42585a32ea0f73c39bf8102f2192f29.jpg)  
Figure 1: Representative workflow of four LLM reasoning paradigms.

To address these challenges, incorporating external knowledge sources like Knowledge Graphs (KGs) is a promising solution to enhance LLM reasoning [26, 27, 29, 37]. KGs offer abundant factual knowledge in a structured format, serving as a reliable source to improve LLM capabilities. Knowledge Graph Question Answering (KGQA) serves as an approach for evaluating the integration of KGs with LLMs, which requires machines to answer natural language questions by retrieving relevant facts from KGs. These approaches typically involve: (1) identifying the initial entities from the question, and (2) iteratively retrieving and refining inference paths until sufficient evidence has been obtained. Despite their success, they still face challenges such as handling multi-hop reasoning problems, addressing questions with multiple topic entities, and effectively utilizing the structural information of graphs.

Challenge 1: Multi-hop reasoning problem. Current methods [15, 28, 37, 50], such as the ToG model presented in Figure 1(b), begin by exploring from each topic entity, with LLMs selecting connected knowledge triples like (France, contained_by, Europe). This process relies on the LLM's inherent understanding of these triples. However, focusing on one-hop neighbors can result in plausible but incorrect answers and prematurely exclude correct ones, especially when multi-hop reasoning is required. Additionally, multi-hop reasoning introduces significant computational overhead, making efficient pruning essential, especially in dense and large KGs.

Challenge 2: Multi-entity question. As shown in Figure 1(b), existing work [15, 28, 37, 50] typically explores KG for each topic entity independently. When a question involves multiple entities, these

entities are examined in separate steps without considering their interconnections. This approach can result in a large amount of irrelevant information in the candidate set that does not connect to the other entities in the question, leading to suboptimal results.

Challenge 3: Utilizing graph structure. Existing methods [7, 13, 46] often overlook the inherent graph structures when processing retrieved subgraphs. For example, the MindMap model in Figure 1(c) utilizes LLMs to generate text-formatted subgraphs from KG triples, converting them into graph descriptions that are fed back into the LLM to produce answers. This textual approach overlooks the inherent structural information of graphs and can overwhelm the LLM when dealing with large graphs. Additionally, during KG information selection, most methods use in-context learning by feeding triples into the LLM, ignoring the overall graph structure.

Contributions. In this paper, we introduce a novel method, Pathsover-Graph (PoG). Unlike previous studies that utilize knowledge triples for retrieval [28, 37], PoG employs knowledge reasoning paths, that contain all the topic entities in a long reasoning length, as a retrieval-augmented input for LLMs. The paths in KGs serve as logical reasoning chains, providing KG-supported, interpretable reasoning logic that addresses issues related to the lack of specific knowledge background and unfaithful reasoning paths.

To address multi-hop reasoning problem, as shown in Figure 1(d), PoG first performs question analysis, to extract topic entities from questions. Utilizing these topic entities, it decomposes the complex question into sub-questions and generates an LLM thinking indicator termed "Planning". This planning not only serves as an answering strategy but also predicts the implied relationship depths between the answer and each topic entity. The multi-hop paths are then explored starting from a predicted depth, enabling a dynamic search process. Previous approaches using planning usually retrieve information from scratch, which often confuses LLMs with source neighborhood-based semantic information. In contrast, our method ensures that LLMs follow accurate reasoning paths that directly lead to the answer.

To address multi-entity questions, PoG employs a three-phase exploration process to traverse reasoning paths from the retrieved question subgraph. All paths must contain all topic entities in the same order as they occur in the LLM thinking indicator. In terms of reasoning paths in KGs, all paths are inherently logical and faithful. Each path potentially contains one possible answer and serves as the interpretable reasoning logic. The exploration leverages the inherent knowledge of both LLM and KG.

To effectively utilize graph structure, PoG captures the question subgraph by expanding topic entities to their maximal depth neighbors, applying graph clustering and reduction to reduce graph search costs. In the path pruning phase, we select possible correct answers from numerous candidates. All explored paths undergo a three-step beam search pruning, integrating graph structures, LLM prompting, and a pre-trained language understanding model (e.g., BERT) to ensure effectiveness and efficiency. Additionally, inspired by the Graph of Thought (GoT) [4], to reduce LLM hallucination, PoG prompts LLMs to summarize the obtained Top- $W_{\text{max}}$  paths before evaluating the answer, where  $W_{\text{max}}$  is a user-defined maximum width in the path pruning phase. In summary, the advantage of PoG can be abbreviated as:

- Dynamic deep search: Guided by LLMs, PoG dynamically extracts multi-hop reasoning paths from KGs, enhancing LLM capabilities in complex knowledge-intensive tasks.  
- Interpretable and faithful reasoning: By utilizing highly question-relevant knowledge paths, PoG improves the interpretability of LLM reasoning, enhancing the faithfulness and question-relatedness of LLM-generated content.  
- Efficient pruning with graph structure integration: PoG incorporates efficient pruning techniques in both the KG and reasoning paths to reduce computational costs, mitigate LLM hallucinations caused by irrelevant noise, and effectively narrow down candidate answers.  
- Flexibility and effectiveness: a) PoG is a plug-and-play framework that can be seamlessly applied to various LLMs and KGs. b) PoG allows frequent knowledge updates via the KG, avoiding the expensive and slow updates required for LLMs. c) PoG reduces the LLMs token usage by over  $50\%$  with only a  $\pm 2\%$  difference in accuracy compared to the best-performing strategy. d) PoG achieves state-of-the-art results on all the tested KGQA datasets, outperforming the strong baseline ToG by an average of  $18.9\%$  accuracy using both GPT-3.5 and GPT-4. Notably, PoG with GPT-3.5 can outperform ToG with GPT-4 by up to  $23.9\%$ .

# 2 Related Work

KG-based LLM reasoning. KGs provide structured knowledge valuable for integration with LLMs [29]. Early studies [25, 27, 30, 52] embed KG knowledge into neural networks during pre-training or fine-tuning, but this can reduce explainability and hinder efficient knowledge updating [29]. Recent methods combine KGs with LLMs by converting relevant knowledge into textual prompts, often ignoring structural information [29, 46]. Advanced works [17, 28, 37] involve LLMs directly exploring KGs, starting from an initial entity and iteratively retrieving and refining reasoning paths until the LLM decides the augmented knowledge is sufficient. However, by starting from a single vertex and ignoring the question's position within the KG's structure, these methods overlook multiple topic entities and the explainability provided by multi-entity paths.

Reasoning with LLM prompting. LLMs have shown significant potential in solving complex tasks through effective prompting strategies. Chain of Thought (CoT) prompting [45] enhances reasoning by following logical steps in few-shot learning. Extensions like Auto-CoT [53], Complex-CoT [10], CoT-SC [44], Zero-Shot CoT [21], ToT [48], and GoT [4] build upon this approach. However, these methods often rely solely on knowledge present in training data, limiting their ability to handle knowledge-intensive or deep reasoning tasks. To solve this, some studies integrate external KBs using plan-and-retrieval methods such as CoK [25], RoG [27], and ReAct [49], decomposing complex questions into subtasks to reduce hallucinations. However, they may focus on the initial steps of sub-problems and overlook further steps of final answers, leading to locally optimal solutions instead of globally optimal ones. To address these deep reasoning challenges, we introduce dynamic multi-hop question reasoning. By adaptively determining reasoning depths for different questions, we enable the model to handle varying complexities effectively.

KG information pruning. Graphs are widely used to model complex relationships among different entities [22, 23, 35, 40]. KGs

contain vast amounts of facts [14, 42, 47], making it impractical to involve all relevant triples in the context of the LLM due to high costs and potential noise [43]. Existing methods [17, 28, 37] typically identify initial entities and iteratively retrieve reasoning paths until an answer is reached, often treating the LLM as a function executor and relying on in-context learning or fine-tuning, which is expensive. Some works attempt to reduce pruning costs. KAPING [2] projects questions and triples into the same semantic space to retrieve relevant knowledge via similarity measures. KG-GPT [20] decomposes complex questions, matches, and selects the relevant relations with sub-questions to form evidence triples. However, these methods often overlook the overall graph structure and the interrelations among multiple topic entities, leading to suboptimal pruning and reasoning performance.

# 3 Preliminary

Consider a Knowledge Graph (KG)  $\mathcal{G}(\mathcal{E},\mathcal{R},\mathcal{T})$ , where  $\mathcal{E}$ ,  $\mathcal{R}$ , and  $\mathcal{T}$  represent the set of entities, relations, and knowledge triples, respectively. Each knowledge triple  $T\in \mathcal{T}$  encapsulates the factual knowledge in  $\mathcal{G}$ , and is represented as  $T = (e_h,r,e_t)$ , where  $e_h,e_t\in \mathcal{E}$  and  $r\in \mathcal{R}$ . Given an entity set  $\mathcal{E}_S\subseteq \mathcal{E}$ , the induced subgraph of  $\mathcal{E}_S$  is denoted as  $S = (\mathcal{E}_S,\mathcal{R}_S,\mathcal{T}_S)$ , where  $\mathcal{T}_S = \{(e,r,e')\in \mathcal{T}\mid e,e'\in \mathcal{E}_S\}$ , and  $\mathcal{R}_S = \{r\in \mathcal{R}\mid (e,r,e')\in \mathcal{T}_S\}$ . Furthermore, we denote  $\mathcal{D}(e)$  and  $\mathcal{D}(r)$  as the sets of short textual descriptions for each entity  $e\in \mathcal{E}$  and each relation  $r\in \mathcal{R}$ , respectively. For example, the text description of the entity "m.0f8l9c" is  $\mathcal{D}("m.0f8l9c")=$  "France". For simplicity, in this paper, all entities and relations are referenced through their  $\mathcal{D}$  representations and transformed into natural language.

DEFINITION 1 (REASONING PATH). Given a  $KG\mathcal{G}$ , a reasoning path within  $\mathcal{G}$  is defined as a connected sequence of knowledge triples, represented as: path $_{\mathcal{G}}(e_1, e_{l+1}) = \{T_1, T_2, \dots, T_l\} = \{(e_1, r_1, e_2), (e_2, r_2, e_3), \dots, (e_l, r_l, e_{l+1})\}$ , where  $T_i \in \mathcal{T}$  denotes the  $i$ -th triple in the path and  $l$  denotes the length of the path, i.e., length  $(path_{\mathcal{G}}(e_1, e_{l+1})) = l$ .

EXAMPLE 1. Consider a reasoning path between "University" and "Student" in KG: path  $\mathcal{G}$  (University, Student) = { (University, employs, Professor), (Professor, teaches, Course), (Course, enrolled_in, Student)} and can be visualized as:

University  $\xrightarrow{\text{employs}}$  Professor  $\xrightarrow{\text{teaches}}$  Course  $\xrightarrow{\text{enrolled\_in}}$  Student.

It indicates that a "University" employs a "Professor," who teaches a "Course," in which a "Student" is enrolled. The length of the path is 3.

For any entity  $s$  and  $t$  in  $\mathcal{G}$ , if there exists a reasoning path between  $s$  and  $t$ , we say  $s$  and  $t$  can reach each other, denoted as  $s \leftrightarrow t$ . The distance between  $s$  and  $t$  in  $\mathcal{G}$ , denoted as  $dist_{\mathcal{G}}(s,t)$ , is the shortest reasoning path distance between  $s$  and  $t$ . For the non-reachable vertices, their distance is infinite. Given a positive integer  $h$ , the  $h$ -hop neighbors of an entity  $s$  in  $\mathcal{G}$  is defined as  $N_{\mathcal{G}}(s,h) = \{t \in \mathcal{E} | dist_{\mathcal{G}}(s,t) \leq h\}$ .

DEFINITION 2 (ENTITY PATH). Given a KG  $\mathcal{G}$  and a list of entities  $list_{e} = [e_{1}, e_{2}, e_{3}, \ldots, e_{l}]$ , the entity path of  $list_{e}$  is defined as a connected sequence of reasoning paths, which is denoted as  $path_{\mathcal{G}}(list_{e}) = \{path_{\mathcal{G}}(e_{1}, e_{2}), path_{\mathcal{G}}(e_{2}, e_{3}), \ldots, path_{\mathcal{G}}(e_{l-1}, e_{l})\} = \{(e_{s}, r, e_{t}) | (e_{s}, r, e_{t}) \in path_{\mathcal{G}}(e_{i}, e_{i+1}) \land 1 \leq i < l\}$ .

![](images/74d41b8e335936b6a626910e11a436a55681a6532de431bf34dc86efea57af67.jpg)  
Figure 2: Overview of the PoG architecture. Exploration: After initialization (detailed in Figure 3), the model retrieves entity paths from  $\mathcal{G}_q$  through three exploration phases. Path Pruning: PoG applies a three-step beam search to prune paths after each exploration phase. Question Answering: The pruned paths are then evaluated for question answering. If these paths do not fully answer the question, the model explores deeper paths until  $D_{max}$  is reached or moves on to the next exploration phase.

Knowledge Graph Question Answering (KGQA) is a fundamental reasoning task based on KGs. Given a natural language question  $q$  and a KG  $\mathcal{G}$ , the objective is to devise a function  $f$  that predicts answers  $a \in \text{Answer}(q)$  utilizing knowledge encapsulated in  $\mathcal{G}$ , i.e.,  $a = f(q, \mathcal{G})$ . Consistent with previous research [27, 28, 36, 37], we assume the topic entities  $\text{Topic}(q)$  mentioned in  $q$  and answer entities  $\text{Answer}(q)$  in ground truth are linked to the corresponding entities in  $\mathcal{G}$ , i.e.,  $\text{Topic}(q) \subseteq \mathcal{E}$  and  $\text{Answer}(q) \subseteq \mathcal{E}$ .

# 4 Method

PoG implements the "KG-based LLM Reasoning" by first exploring all possible faithful reasoning paths and then collaborating with LLM to perform a 3-step beam search selection on the retrieved paths. Compared to previous approaches [28, 37], our model focuses on providing more accurate and question-relevant retrieval-argument graph information. The framework of PoG is outlined in Figure 2, comprising four main components.

- Initialization. The process begins by identifying the set of topic entities from the question input, and then queries the source KG  $\mathcal{G}$  by exploring up to  $D_{\mathrm{max}}$ -hop from each topic entity to construct the evidence sub-graph  $\mathcal{G}_q$ , where  $D_{\mathrm{max}}$  is the user-defined maximum exploration depth. Subsequently, we prompt the LLM to analyze the question and generate an indicator that serves as a strategy for the answer formulation process and predicting the exploration depth  $D_{\mathrm{predict}}$ .  
- Exploration. After initialization, the model retrieves topic entity paths from  $\mathcal{G}_q$  through three exploration phases: topic entity path exploration, LLM supplement path exploration, and node expand

exploration. All reasoning paths are constrained within the depth range  $D \in [D_{\mathrm{predict}}, D_{\mathrm{max}}]$ .

- Path Pruning. Following each exploration phase, PoG employs a pre-trained LM, LLM prompting, and graph structural analysis to perform a three-step beam search. The pruned paths are then evaluated in the question answering.  
- Question Answering. Finally, LLM is prompted to assess if the pruned reasoning paths sufficiently answer the question. If not, continue exploration with deeper paths incrementally until the  $D_{\mathrm{max}}$  is exceeded or proceed to the next exploration phase.

# 4.1 Initialization

The initialization has two main stages, i.e., question subgraph detection and question analysis. The framework is shown in Figure 3.

Question subgraph detection. Given a question  $q$ , PoG initially identifies the question subgraph, which includes all the topic entities of  $q$  and their  $D_{\mathrm{max}}$ -hop neighbors.

Topic entity recognition. To identify the relevant subgraph, PoG first employs LLMs to extract the potential topic entities from the question. Following the identification, the process applies BERT-based similarity matching to align these potential entities with entities from KG. Specifically, as shown in Figure 3, we encode both the keywords and all entities from KG into dense vector embeddings as  $H_{T}$  and  $H_{\mathcal{G}}$ . We then compute a cosine similarity matrix between these embeddings to determine the matches. For each keyword, the entities with the highest similarity scores are selected to form the set  $Topic(q)$ . This set serves as the foundation for constructing the question subgraph in subsequent steps.

![](images/bb8ac572ce115805091dedc1e0dde93dbd0164c09ab418ab815b957c36f63c32.jpg)  
Figure 3: Overview of the initialization phase. Output 1: from the input question, the model identifies topic entities and prompts the LLM to decompose questions into split questions  $q_{split}$  and generate an indicator  $I_{LLM}$ . The indicator outlines a strategy for formulating the answer and predicts the exploration depth  $D_{predict}$ . Output 2: the model queries the source KG up to  $D_{max}$ -hop from identified topic entities, constructing and pruning the evidence subgraph  $G_q$ .

Subgraph detection. Upon identifying the topic entities, PoG captures the induced subgraph  $\mathcal{G}_q \subseteq \mathcal{G}$  by expanding around each entity  $e$  in Topic(q). For each entity, we retrieve knowledge triples associated with its  $D_{\mathrm{max}}$ -hop neighbors, thereby incorporating query-relevant and faithful KG information into  $\mathcal{G}_q$ . Through this process, we update  $\mathcal{E}_q$  with newly added intermediate nodes that serve as bridging pathways between the topic entities. The result subgraph,  $\mathcal{G}_q$  is defined as  $(\mathcal{E}_q, \mathcal{R}_q, \mathcal{T}_q)$ , where  $\mathcal{E}_q$  encompasses Topic(q) together with the set  $\{N_{\mathcal{G}}(e, D_{\mathrm{max}}) \mid e \in \text{Topic}(q)\}$ , effectively linking all relevant entities and their connective paths within the defined hop distance. To interact with KG, we utilize the pre-defined SPARQL queries as detailed in Appendix D.

Graph pruning. To efficiently manage information overhead and reduce computational cost, we implement graph pruning on the question subgraph  $\mathcal{G}_q$  using node and relation clustering alongside graph reduction techniques. As illustrated in Figure 3, node and relation clustering is achieved by compressing multiple nodes and their relations into supernodes, which aggregate information from the original entities and connections. For graph reduction, we employ bidirectional BFS to identify all paths connecting the topic entities. Based on these paths, we regenerate induced subgraphs that involve only the relevant connections, effectively excluding nodes and relations that lack strong relevance to the topic entities.

Question analysis. To reduce hallucinations in LLMs, the question analysis phase is divided into two parts and executed within a single LLM call using an example-based prompt (shown in Appendix E). First, the complex question  $q$  is decomposed into simpler questions based on the identified topic entities, each addressing their relationship to the potential answer. Addressing these simpler questions collectively guides the LLM to better answer the original query, thereby reducing hallucinations. Second, a LLM indicator is generated, encapsulating all topic entities and predicting the answer position within a single chain of thought derived from the original question. This indicator highlights the relationships and sequence

among the entities and answer. Based on this, a predicted depth  $D_{\text{predict}}$  is calculated, defined as the maximum distance between the predicted answer and each topic entity. An example of question analysis is shown in Figure 3 with predicted depth 2.

# 4.2 Exploration

As discussed in Section 1, identifying reasoning paths that encompass all topic entities is essential to derive accurate answers. These paths serve as interpretable chains of thought, providing both the answer and the inference steps leading to it, a feature we refer as interpretability. To optimize the discovery of such paths efficiently and accurately, the exploration process is divided into three phases: topic entity path exploration, LLM supplement path exploration, and node expand exploration. After each phase, we perform path pruning and question answering. If a sufficient path is found, the process terminates; otherwise, it advances to the next phase to explore additional paths. Due to the space limitation, the pseudo-code of exploration section is shown in Appendix A.1.

Topic entity path exploration. To reduce LLM usage and search space, PoG begins exploration from a predicted depth  $D_{\mathrm{predict}}$  rather than the maximum depth. Using the question subgraph  $\mathcal{G}_q$ , topic entities  $Topic(q)$ , LLM indicator  $I_{\mathrm{LLM}}$ , and  $D_{\mathrm{predict}}$ , PoG identifies reasoning paths containing all topic entities by iteratively adjusting the exploration depth  $D$ . Entities in  $Topic(q)$  are ordered according to  $I_{\mathrm{LLM}}$  to facilitate reasoning effectively. Starting from the predicted depth  $D = \min(D_{\mathrm{predict}}, D_{\mathrm{max}})$ , we employ a bidirectional BFS to derive all potential entity paths, which is defined as:

$$
\operatorname {P a t h s} _ {t} = \left\{p \mid | T o p i c (q) | \times (D - 1) <   l e n g t h (p) \leq | T o p i c (q) | \times D \right\},
$$

where  $p = \text{Path}_{\mathcal{G}_q}(\text{Topic}(q))$ . To reduce the complexity, a pruning strategy is employed and selects the top- $W_{\max}$  paths based on  $\text{Paths}_t, I_{\text{LLM}}$ , and split questions from Section 4.1. These paths are evaluated for sufficiency verification. If inadequate,  $D$  is incremented until  $D_{\max}$  is reached. Then the next phase commences.

LLM supplement path exploration. Traditional KG-based LLM reasoning often rephrases KG facts without utilizing the LLM's inherent knowledge. To overcome this, PoG prompts LLMs to generate predictions based on path understanding and its implicit knowledge, providing additional relevant insights. It involves generating new LLM thinking indicators  $I_{\mathrm{Sup}}$  for predicted entities  $e \in \text{Predict}(q)$ , and then using text similarity to verify and align them with  $\mathcal{E}_q \in \mathcal{G}_q$ . The supplementary entity list  $\text{Lists}_S(e) = \text{Topic}(q) + e$  is built and ranked by  $I_{\mathrm{Sup}}$  to facilitate reasoning effectively. Next, supplementary paths  $\text{Paths}_S$  are derived from  $\text{Lists}_S(e)$  in the evidence KG  $\mathcal{G}_q$  with a fixed depth  $D_{\max}$ :

$$
\operatorname {P a t h s} _ {S} = \{p \mid \text {l e n g t h} (p) \leq | T o p i c (q) | \times D _ {\max },
$$

where  $p = \text{Path}_{\mathcal{G}_q}(List_S(e))$ . These paths with new indicators are evaluated similarly to the topic entity path exploration phase. The prompting temple is shown in Appendix E.

Node expand exploration. If previous phases cannot yield sufficient paths, PoG proceeds to node expansion. Unlike previous methods [28, 37] that separately explore relations and entities, PoG explores both simultaneously, leveraging clearer semantic information for easier integration with existing paths. During the exploration, PoG expands unvisited entities by 1-hop neighbors in  $\mathcal{G}$ . New triples are merged into existing paths to form the new paths, followed by pruning and evaluation.

# 4.3 Path Pruning

As introduced in Section 2, KGs contain vast amounts of facts, making it impractical to involve all relevant triples in the LLM's context due to high costs. To address this complexity and reduce LLM overhead, we utilize a three-step beam search for path pruning. The corresponding pseudo-code can be found in Appendix A.2.

Fuzzy selection. Considering that only a small subset of the generated paths is relevant, the initial step of our beam search involves fuzzy selection by integrating a pre-trained language model (e.g. SentenceBERT [34]), to filter the irrelevant paths quickly. As shown in Figure 2, we encode the LLM indicator  $I_{\mathrm{LLM}}$  (or  $I_{\mathrm{Sup}}$ ) and all reasoning paths into vector embeddings, denoted as  $H_I$  and  $H_{Paths}$ , and calculate cosine similarities between them. The top- $W_1$  paths with the highest similarity scores are selected for further evaluation.

Precise path selection. Following the initial fuzzy selection, the number of candidate paths is reduced to  $W_{1}$ . At this stage, we prompt the LLM to select the top-  $W_{\mathrm{max}}$  reasoning paths most likely to contain the correct answer. The specific prompt used to guide LLM in selection phase can be found in Appendix E.

Branch reduced selection. Considering that paths are often represented in natural language and can be extensive, leading to high processing costs for LLMs, we implement a branch reduced selection method integrated with the graph structure. This method effectively balances efficiency and accuracy by further refining path selection. Starting with  $D = 1$ , for each entity  $e$  in the entity list, we extract the initial  $D$ -step paths from every path in the candidate set  $Paths_{c}$  into a new set  $Paths_{e}$ . If the number of  $Paths_{e}$  exceeds the maximum designated width  $W_{\mathrm{max}}$ , these paths are pruned using precise path selection. The process iterates until the number of paths in  $Paths_{c}$  reaches  $D_{\mathrm{max}}$ . For example, as illustrated in Figure 2, with  $W_{\mathrm{max}} = 1$ , only the initial step paths (depicted in green)

are extracted for further examination, while paths represented by dashed lines are pruned. This selection method enables efficient iterative selection by limiting the number of tokens and ensuring the relevance and conciseness of the reasoning paths.

Beam search strategy. Based on the three path pruning methods above, PoG can support various beam search strategies, ranging from non-reliant to fully reliant on LLMs. These strategies are selectable in a user-friendly manner, allowing flexibility based on the specific requirements of the task. We have defined four such strategies in Algorithm 2 of Appendix A.2.

# 4.4 Question Answering

Based on the pruned paths in Section 4.3, we introduce a two-step question-answering method.

Path Summarizing. To address hallucinations caused by paths with excessive or incorrect text, we develop a summarization strategy by prompting LLM to review and extract relevant triples from provided paths, creating a concise and focused path. Details of the prompts used are in Appendix E.

Question answering. Based on the current reasoning path derived from path pruning and summarizing, we prompt the LLM to first evaluate whether the paths are sufficient for answering the split question and then the main question. If the evaluation is positive, LLM is prompted to generate the answer using these paths, along with the question and question analysis results as inputs, as shown in Figures 2. The prompts for evaluation and generation are detailed in Appendix E. If the evaluation is negative, the exploration process is repeated until completion. If node expand exploration reaches its depth limit without yielding a satisfactory answer, LLM will leverage both provided and inherent knowledge to formulate a response. Additional details on the prompts can be found in Appendix E.

# 5 Experiments

Experimental settings. We evaluate PoG on five KGQA datasets, i.e., CWQ [38], WebQSP [51], GrailQA [12], SimpleQuestions [31], and WebQuestions [3]. PoG is tested against methods without external knowledge (IO, CoT[45], SC[44]) and the state-of-the-art (SOTA) approaches with external knowledge, including prompting-based and fine-tuning-based methods. Freebase [5] serves as the background knowledge graph for all datasets. Experiments are conducted using two LLMs, i.e., GPT-3.5 (GPT-3.5-Turbo) and GPT-4. Following prior studies, we use exact match accuracy (Hits@1) as the evaluation metric. Due to the space limitation, detailed experimental settings, including dataset statistics, baselines, and implementation details, are provided in Appendix C.

PoG setting. We adopt the Fuzzy + Precise Path Selection strategy in Algorithm 2 of Appendix A.2 for PoG, with  $W_{1} = 80$  for fuzzy selection. Additionally, we introduce PoG-E, which randomly selects one relation from each edge in the clustered question subgraph to evaluate the impact of graph structure on KG-based LLM reasoning.  $W_{\mathrm{max}}$  and  $D_{\mathrm{max}}$  are 3 by default for beam search.

# 5.1 Main Results

Since PoG leverages external knowledge to enhance LLM reasoning, we first compare it with other methods that utilize external knowledge. Although PoG is a training-free, prompting-based method

Table 1: Results of PoG across various datasets, compared with the state-of-the-art (SOTA) in Supervised Learning (SL) and In-Context Learning (ICL) methods. The highest scores for ICL methods are highlighted in bold, while the second-best results are underlined. The Prior FT (Fine-tuned) SOTA includes the best-known results achieved through supervised learning.  

<table><tr><td rowspan="2">Method</td><td rowspan="2">Class</td><td rowspan="2">LLM</td><td colspan="3">Multi-Hop KGQA</td><td rowspan="2">Single-Hop KGQA Simple Questions</td><td rowspan="2">Open-Domain QA WebQuestions</td></tr><tr><td>CWQ</td><td>WebQSP</td><td>GrailQA</td></tr><tr><td colspan="8">Without external knowledge</td></tr><tr><td>IO prompt[37]</td><td>-</td><td>GPT-3.5-Turbo</td><td>37.6</td><td>63.3</td><td>29.4</td><td>20.0</td><td>48.7</td></tr><tr><td>CoT[37]</td><td>-</td><td>GPT-3.5-Turbo</td><td>38.8</td><td>62.2</td><td>28.1</td><td>20.3</td><td>48.5</td></tr><tr><td>SC[37]</td><td>-</td><td>GPT-3.5-Turbo</td><td>45.4</td><td>61.1</td><td>29.6</td><td>18.9</td><td>50.3</td></tr><tr><td colspan="8">With external knowledge</td></tr><tr><td>Prior FT SOTA</td><td>SL</td><td>-</td><td>70.4[9]</td><td>85.7[27]</td><td>75.4[11]</td><td>85.8[1]</td><td>56.3[18]</td></tr><tr><td>KB-BINDER[24]</td><td>ICL</td><td>Codex</td><td>-</td><td>74.4</td><td>58.5</td><td>-</td><td>-</td></tr><tr><td>ToG/ToG-R[37]</td><td>ICL</td><td>GPT-3.5-Turbo</td><td>58.9</td><td>76.2</td><td>68.7</td><td>53.6</td><td>54.5</td></tr><tr><td>ToG-2.0[28]</td><td>ICL</td><td>GPT-3.5-Turbo</td><td>-</td><td>81.1</td><td>-</td><td>-</td><td>-</td></tr><tr><td>ToG/ToG-R[37]</td><td>ICL</td><td>GPT-4</td><td>69.5</td><td>82.6</td><td>81.4</td><td>66.7</td><td>57.9</td></tr><tr><td>PoG-E</td><td>ICL</td><td>GPT-3.5-Turbo</td><td>71.9</td><td>90.9</td><td>87.6</td><td>78.3</td><td>76.9</td></tr><tr><td>PoG</td><td>ICL</td><td>GPT-3.5-Turbo</td><td>74.7</td><td>93.9</td><td>91.6</td><td>80.8</td><td>81.8</td></tr><tr><td>PoG-E</td><td>ICL</td><td>GPT-4</td><td>78.5</td><td>95.4</td><td>91.4</td><td>81.2</td><td>82.0</td></tr><tr><td>PoG</td><td>ICL</td><td>GPT-4</td><td>81.4</td><td>96.7</td><td>94.4</td><td>84.0</td><td>84.6</td></tr></table>

and has natural disadvantages compared to fine-tuned methods trained on evaluation data. As shown in Table 1, PoG with GPT-3.5-Turbo still achieves new SOTA performance across most datasets. Additionally, PoG with GPT-4 surpasses fine-tuned SOTA across all the multi-hop and open-domain datasets by an average of  $17.3\%$  and up to  $28.3\%$  on the WebQuestions dataset. Comparing all the incontext learning (ICL) methods, PoG with GPT-3.5-Turbo surpasses all the previous SOTA methods. When comparing PoG with GPT-3.5-Turbo against SOTA using GPT-4, PoG outperforms the SOTA by an average of  $12.9\%$  and up to  $23.9\%$ . When using the same LLM, PoG demonstrates substantial improvements: with GPT-3.5-Turbo, it outperforms SOTA by an average of  $21.2\%$  and up to  $27.3\%$  on the WebQuestions dataset; with GPT-4, it outperforms SOTA by  $16.6\%$  on average and up to  $26.7\%$  on the WebQuestions dataset. Additionally, PoG with GPT-3.5-Turbo outperforms methods without external knowledge (e.g., IO, CoT, SC prompting) by  $62\%$  on GrailQA and  $60.5\%$  on Simple Questions. These results show that incorporating external knowledge graphs significantly enhances reasoning tasks. PoG-E also achieves excellent results. Under GPT-4, PoG-E surpasses all SOTA in ICL by  $14.1\%$  on average and up to  $24.1\%$  on the WebQuestions dataset. These findings demonstrate that the graph structure is crucial for reasoning tasks, particularly for complex logical reasoning. By integrating the structural information of the question within the graph, PoG enhances the deep reasoning capabilities of LLMs, leading to superior performance.

# 5.2 Ablation Study

We perform various ablation studies to understand the importance of different factors in PoG. These ablation studies are performed with GPT-3.5-Turbo on two subsets of the CWQ and WebQSP test sets, each containing 500 randomly sampled questions.

Does search depth matter? As described, PoG's dynamic deep

![](images/6a68dbe8b8268b7dcf2a298aa5accd0317e93fbaea50b69cb885b97c972001b4.jpg)

![](images/98653a2150192cd2aa981a4df0cfac770e251948fc6e132109b50c72b1dd162b.jpg)

![](images/1c7cd79a2826f7457679081f663616f590574f48ad35766befdc057ac06c16d6.jpg)  
(a) CWQ (Vary  $D_{\mathrm{max}}$  
(c) WebQSP (Vary  $D_{\mathrm{max}}$

![](images/7c1f88bb4f6ad0915270ce96a27479e73c010a2c4a45d0c9a40bce2b9b8c41d4.jpg)  
(b) CWQ(PoG)  
(d) WebQSP(PoG)  
Figure 4: The accuracy of PoG and PoG-E among CWQ and WebQSP datasets by varying different  $D_{\mathrm{max}}$ .

search is limited by  $D_{\text{max}}$ . To assess the impact of  $D_{\text{max}}$  on performance, we conduct experiments with depth from 1 to 4. The results, shown in Figures 4(a) and (c), indicate that performance improves with increased depth, but the benefits diminish beyond a depth of 3. Figures 4(b) and (d), showing which exploration phase the answer is generated from, reveal that higher depths reduce the effectiveness of both LLM-based path supplementation and node exploration. Excessive depth leads to LLM hallucinations and difficulties in managing long reasoning paths. Therefore, we set the maximum depth to 3 for experiments to balance performance and computational efficiency. Additionally, even at lower depths, PoG maintains strong performance by effectively combining the LLM's inherent knowledge with the structured information from the KG.

Table 2: Performance of PoG and PoG-E on multi-entity and single-entity questions of all datasets. The symbol ‘-’ indicates no multi-entity question inside.  

<table><tr><td>Question Set</td><td>CWQ</td><td>WebQSP</td><td>GrailQA</td><td>WebQuestions</td><td>Simple Questions</td></tr><tr><td colspan="6">PoG with GPT-3.5-Turbo</td></tr><tr><td>Single-entity</td><td>70.3</td><td>93.9</td><td>92.1</td><td>81.7</td><td>78.3</td></tr><tr><td>Multi-entity</td><td>80.2</td><td>93.1</td><td>70.7</td><td>82.8</td><td>-</td></tr><tr><td colspan="6">PoG-E with GPT-3.5-Turbo</td></tr><tr><td>Single-entity</td><td>67.5</td><td>91</td><td>88.2</td><td>76.8</td><td>80.8</td></tr><tr><td>Multi-entity</td><td>77.5</td><td>82.8</td><td>76.0</td><td>82.8</td><td>-</td></tr></table>

Table 3: The illustration of graph size reduction.  

<table><tr><td></td><td>CWQ</td><td>WebQSP</td><td>GrailQA</td><td>WebQuestions</td></tr><tr><td>Ave Entity Number</td><td>3,540,267</td><td>243,826</td><td>62,524</td><td>240,863</td></tr><tr><td>Ave Entity Number After Pruned</td><td>1,621,055</td><td>182,673</td><td>30,267</td><td>177,822</td></tr><tr><td>Ave Entity Reduction Proportion (%)</td><td>54%</td><td>25%</td><td>52%</td><td>26%</td></tr></table>

# 5.3 Effectiveness Evaluation

Effective evaluation on multi-entity questions. To evaluate PoG's performance on multi-entity questions, we report the accuracy on all test sets by categorizing questions based on the number of topic entities. The results, shown in Table 2, demonstrate that, despite the increased complexity of multi-entity questions compared to single-entity ones, PoG maintains excellent accuracy, achieving up to  $93.9\%$  on the WebQSP dataset. This underscores the effectiveness of our structure-based model in handling complex multi-entity queries. Notably, the slightly lower performance on the GrailQA dataset can be attributed to some questions lacking matched topic entities, which prevents effective reasoning using KG.

Effective evaluation on multi-hop reasoning. To assess PoG's performance on multi-hop reasoning tasks, we analyze accuracy by categorizing questions based on the length of their ground-truth SPARQL queries. We randomly sample 1,000 questions from CWQ and WebQSP datasets and determine the reasoning length of each question by counting the number of relations in their ground-truth SPARQL queries. The distribution of questions with varying reasoning lengths is illustrated in Figure 5. We evaluate the performance of PoG and PoG-E across different ground-truth lengths to understand their effectiveness under varying query complexities. As shown in Figure 6, the performance of PoG and PoG-E remains consistent across different reasoning lengths. Even at the highest length levels in the WebQSP dataset, PoG achieves excellent accuracy, reaching up to  $90\%$ . Notably, although some questions have ground-truth lengths of eight or more, PoG successfully addresses them without matching the ground-truth length, demonstrating its ability to explore novel paths by effectively combining the LLM's inherent knowledge with the structured information from the KG. These results demonstrate the effectiveness of PoG in handling complex multi-hop reasoning tasks.

Graph structure pruning. To evaluate the effectiveness of the graph pruning method proposed in Section 4.1, we conduct experiments using 200 random samples from each dataset. We report the average number of entities per question before and after graph reduction, as well as the proportion of entities reduced, in Table 3. The results indicate that up to  $54\%$  of entities in the CWQ dataset can be

![](images/0e1eab0b4e0cb8724c37bca58670ee8cef1882461b6965b8da2395e6681d2c00.jpg)  
Figure 5: The lengths of the ground-truth SPARQL queries within the CWQ and WebQSP datasets.

![](images/acfc5aa82f429329be0d1dfcdee4ce7be85031c41d892bd93a5461bfba6ebd27.jpg)  
(a) CWQ

![](images/e489d83b9d83c5b026529564f2dd0206e306a954382fb3fc28420e1fe6d1c709.jpg)  
(b) WebQSP  
Figure 6: The accuracy of PoG and PoG-E on the CWQ and WebQSP datasets, categorized by the different lengths of the ground-truth answers for each question.

pruned before path exploration. This demonstrates the effectiveness of eliminating irrelevant data from the outset.

Case study: interpretable reasoning. We also conduct the case study to demonstrate interpretability of PoG, we present three reasoning examples in Table 9 of Appendix B.5. These examples feature questions with one, two, and three entities, respectively. Through the case study, we showcase PoG's effectiveness in handling multi-entity and multi-hop tasks by providing faithful and interpretable reasoning paths that lead to accurate answers.

To further evaluate the effectiveness and efficiency of PoG, we perform additional experiments, including pruning beam search strategy ablation and prompt setting ablation (Appendix B.1), reasoning faithfulness analysis (Appendix B.2), error analysis (Appendix B.3), LLM calls cost and running time analysis (Appendix B.4), and graph reduction and path pruning case study (Appendix B.5).

# 6 Conclusion

In this paper, we introduce Paths-over-Graphs (PoG), a novel method that integrates LLMs with KGs to enable faithful and interpretable reasoning. PoG addresses complex reasoning tasks through a three-phase dynamic multi-hop path exploration, combining the inherent knowledge of LLMs with factual information from KGs. Efficiency is enhanced by graph-structured pruning and a three-step pruning process to effectively narrow down candidate paths. Extensive experiments on five public datasets demonstrate that PoG outperforms existing baselines, showcasing its superior reasoning capabilities and interoperability.

# 7 Acknowledgment

Xiaoyang Wang is supported by the Australian Research Council DP230101445 and DP240101322. Wenjie Zhang is supported by the Australian Research Council DP230101445 and FT210100303.

# References

[1] Jinheon Baek, Alham Fikri Aji, Jens Lehmann, and Sung Ju Hwang. 2023. Direct Fact Retrieval from Knowledge Graphs without Entity Linking. In ACL.  
[2] Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. arXiv preprint arXiv:2306.04136 (2023).  
[3] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP. 1533-1544.  
[4] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nczyk, et al. 2024. Graph of thoughts: Solving elaborate problems with large language models. In AAAI, Vol. 38. 17682-17690.  
[5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247-1250.  
[6] Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).  
[7] Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. 2024. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explorations Newsletter 25, 2 (2024), 42-61.  
[8] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1-113.  
[9] Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros Polymenakos, and Andrew McCallum. 2021. Case-based Reasoning for Natural Language Queries over Knowledge Bases. In EMNLP.  
[10] Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. In ICLR.  
[11] Yu Gu, Xiang Deng, and Yu Su. 2023. Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments. In ACL.  
[12] Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. In WWW. 3477-3488.  
[13] Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han. 2023. Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. arXiv preprint arXiv:2305.15066 (2023).  
[14] Lingbing Guo, Zequn Sun, and Wei Hu. 2019. Learning to exploit long-term relational dependencies in knowledge graphs. In International conference on machine learning. PMLR, 2505-2514.  
[15] Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu, Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. 2024. Knowledgenavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex & Intelligent Systems 10, 5 (2024), 7063-7076.  
[16] Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, and Lina Yao. 2025. Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models. In COLING.  
[17] Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: a general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645 (2023).  
[18] Akhil Kedia, Mohd Abbas Zaidi, and Haejun Lee. 2022. FiE: Building a Global Probability Space by Leveraging Early Fusion in Encoder for Open-Domain Question Answering. In EMNLP.  
[19] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406 (2022).  
[20] Jiho Kim, Yeonsu Kwon, Yohan Jo, and Edward Choi. 2023. Kg-gpt: A general framework for reasoning on knowledge graphs using large language models. arXiv preprint arXiv:2310.11220 (2023).  
[21] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. NeurIPS (2022).  
[22] Fan Li, Xiaoyang Wang, Dawei Cheng, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024. Hypergraph Self-supervised Learning with Sampling-efficient Signals. arXiv preprint arXiv:2404.11825 (2024).  
[23] Fan Li, Zhiyu Xu, Dawei Cheng, and Xiaoyang Wang. 2024. AdaRisk: Risk-adaptive Deep Reinforcement Learning for Vulnerable Nodes Detection. IEEE TKDE (2024).  
[24] Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023. Few-shot In-context Learning on Knowledge Base Question Answering. In ACL.  
[25] Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. 2023. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. arXiv preprint arXiv:2305.13269 (2023).  
[26] Linhao Luo, Jiaxin Ju, Bo Xiong, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2023. Chatrule: Mining logical rules with large language models for knowledge graph reasoning. arXiv preprint arXiv:2309.01538 (2023).

[27] Linhao Luo, Yuanfang Li, Reza Haf, and Shirui Pan. 2024. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning. In ICLR.  
[28] Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, and Jian Guo. 2024. Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval. arXiv preprint arXiv:2407.10805 (2024).  
[29] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xinding Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. TKDE (2024).  
[30] Matthew E Peters, Mark Neumann, Robert L Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A Smith. 2019. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164 (2019).  
[31] Michael Petrochuk and Luke Zettlemoyer. 2018. SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach. In EMNLP. 554-558.  
[32] Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2020. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252 (2020).  
[33] Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).  
[34] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.  
[35] Qing Sima, Jianke Yu, Xiaoyang Wang, Wenjie Zhang, Ying Zhang, and Xuemin Lin. 2024. Deep Overlapping Community Search via Subspace Embedding. arXiv preprint arXiv:2404.14692 (2024).  
[36] Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. 2019. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. arXiv preprint arXiv:1904.09537 (2019).  
[37] Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph. In ICLR.  
[38] Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. In NAACL.  
[39] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937 (2018).  
[40] Xingyu Tan, Jingya Qian, Chen Chen, Sima Qing, Yanping Wu, Xiaoyang Wang, and Wenjie Zhang. 2023. Higher-Order Peak Decomposition. In CIKM.  
[41] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).  
[42] Jinghao Wang, Yanping Wu, Xiaoyang Wang, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. 2024. Efficient Influence Minimization via Node Blocking. Proc. VLDB Endow. (2024).  
[43] Kai Wang, Yuwei Xu, Zhiyong Wu, and Siqiang Luo. 2024. LLM as Prompt: Low-resource Inductive Reasoning on Arbitrary Knowledge Graphs. In ACL.  
[44] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).  
[45] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022).  
[46] Yilin Wen, Zifeng Wang, and Jimeng Sun. 2024. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. In ACL.  
[47] Yanping Wu, Renjie Sun, Xiaoyang Wang, Dong Wen, Ying Zhang, Lu Qin, and Xuemin Lin. 2024. Efficient Maximal Frequent Group Enumeration in Temporal Bipartite Graphs. Proc. VLDB Endow. (2024).  
[48] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS (2024).  
[49] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).  
[50] Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. 2021. Rng-kbqa: Generation augmented iterative ranking for knowledge base question answering. arXiv preprint arXiv:2109.08678 (2021).  
[51] Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL 201-206.  
[52] Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Poolingformer: Long document modeling with pooling attention. In ICML. 12437-12446.  
[53] Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. In ICLR.  
[54] Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. 2023. Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework. In ACL.

# A Algorithm

# A.1 Exploration

We summarize the comprehensive algorithmic procedure for exploration detailed in Section 4.2 as presented in Algorithm 1.

Algorithm 1: Exploration  
Input:Question subgraph  $(\mathcal{G}_q)$  , source KG  $(\mathcal{G})$  ,question and split question  $(Q = q + q_{split})$  ,topic entities  $(Topic(q))$  ,LLM indicator  $(I_{\mathrm{LLM}})$  , predict depth  $(D_{\mathrm{predict}})$  , maximum depth  $(D_{\mathrm{max}})$  , maximum width  $(W_{\mathrm{max}})$  , and path pruning case (case) Output:PoG answers  $(a(q))$  , final reasoning path  $(Paths_f(q))$  /\* Start with topic entity path exploration \*/   
1  $List_{T}\leftarrow$  Reorder  $(Topic(q),I_{\mathrm{LLM}})$ $D\gets \min (D_{\mathrm{predict}},D_{\mathrm{max}})$    
2 while  $D\leq D_{\mathrm{max}}$  do   
3  $Path_{st}\gets$  EntityPathFind  $(List_T,D,\mathcal{G}_q)$    
4  $PathPruning(Path_{st},Q,I_{\mathrm{LLM}},W_{\mathrm{max}},D_{\mathrm{max}},List_{T},\mathrm{case})$    
5 Answer,  $Path_{ST}\gets$  QuestionAnswering  $(Path_{st},Q,I_{\mathrm{LLM}})$    
6 if {"Yes"} in Answer then return Answer,  $Path_{ST}$  .   
7 else  $D\gets D + 1$    
/\* LLM supplement path exploration procedure \*/   
8  $Path_{s}\gets []$    
9 Predict  $(q)\gets$  SupplementPrediction  $(Path_{T},Q,I_{\mathrm{LLM}})$    
10 for each  $e,I_{sup(e)}\in$  Predict  $(q)$  do   
11  $Lists_{S}\gets$  Reorder  $(List_{T} + e,I_{sup(e)})$    
12  $Paths_s'\gets$  EntityPathFind  $(Lists,D_{\mathrm{max}},\mathcal{G}_q)$    
13  $Paths_s\gets$  Paths  $^+$  FuzzySelect  $(Path_{s}',I_{sup(e)},W_{\mathrm{max}})$    
14 PathPruning  $(Path_{s},Q,I_{\mathrm{LLM}},W_{\mathrm{max}},D_{\mathrm{max}},List_{S},\mathrm{case})$    
15 Answer,  $Path_{S}\gets$  QuestionAnswering  $(Path_{s},Q,I_{\mathrm{LLM}})$    
16 if "  $\{\mathrm{Yes}\} ^ { " }$  in Answer then return Answer,  $Path_{SS}$  ： /\* Node expand exploration procedure \*/   
17 Visted  $\leftarrow 0$ $D\gets 1$ $Path_{e}\gets$ $Path_{T} + Path_{S}$    
18 PathPruning  $(Path_{e},Q,I_{\mathrm{LLM}},W_{\mathrm{max}},D_{\mathrm{max}},List_{T},\mathrm{case})$  .   
19 while  $D\leq D_{\mathrm{max}}$  do   
20 for each  $e\in$  ExtractEntity  $(Path_{e})\wedge e\notin Vistd$  do   
21 Related\_edges  $=$  Find_1_hop_Eges(G,e);   
22  $Path_{se}\gets$  MergeTogether  $(Path_{e},Related\_edge)$    
23 PathPruning  $(Path_{e},Q,I_{\mathrm{LLM}},W_{\mathrm{max}},D_{\mathrm{max}},List_{T},\mathrm{case})$    
24 Answer,  $Path_{e}\gets$  QuestionAnswering  $(Path_{e},Q,I_{\mathrm{LLM}})$    
25 if {"Yes"} in Answer then return Answer,  $Path_{e}$  .   
26 else Visted  $\leftarrow$  Visted U e;  $D\gets D + 1$    
27  $Path_{s}\gets$ $Path_{T} + Path_{S} + Path_{E}$  .   
28 PathPruning  $(Path_{s},Q,I_{\mathrm{LLM}},W_{\mathrm{max}},D_{\mathrm{max}},List_{T},\mathrm{case})$    
29 Answer,  $Path_{s}\gets$  QuestionAnsweringFinal  $(Path_{s},Q,I_{\mathrm{LLM}})$    
30 Return Answer,  $Path_{s}$

# A.2 Path Pruning

We summarize the comprehensive algorithmic procedure of path pruning detailed in Section 4.3 as presented in Algorithm 2.

Algorithm 2: PathPruning  
Input: Candidate paths  $(\text{Pathsc})$  , question and split question  $(Q = q + q_{split})$  , indicator (I), maximum width  $(W_{\mathrm{max}})$  maximum depth  $(D_{\mathrm{max}})$  , entity list (list) Output: Pruned candidate paths  $(\text{Pathsc})$    
1 if Case  $=$  Fuzzy Selection Only then FuzzySelect  $(\text{Pathsc},Q,I,W_{\mathrm{max}})$  .   
3 else if Case  $=$  Fuzzy  $^+$  Precise Path Selection then FuzzySelect  $(\text{Pathsc},Q,I,W_1)$  . PrecisePathSelect  $(\text{Pathsc},Q,I,W_{\mathrm{max}})$  .   
6 else if Case  $=$  Fuzzy  $^+$  Branch Reduced Selection then FuzzySelect  $(\text{Pathsc},Q,I,W_1)$  . BranchReduceSelect  $(\text{Pathsc},Q,I,W_{\mathrm{max}},D_{\mathrm{max}},list)$  .   
9 else if Case  $=$  Fuzzy  $^+$  Branch Reduced  $^+$  Precise Path then / case  $= 3$  Step Beam Search \*/ FuzzySelect  $(\text{Pathsc},Q,I,W_1)$  BranchReduceSelect  $(\text{Pathsc},Q,I,W_2,D_{\mathrm{max}},list)$  . PrecisePathSelect  $(\text{Pathsc},Q,I,W_{\mathrm{max}})$  .   
12 Procedure BranchReduceSelect  $(\text{Pathsc},Q,I,W,\text{D}_{\mathrm{max}},list)$    
13  $D\gets 1$  , Pathse  $\leftarrow 0$  .   
14 while  $|\text{Pathsc} |\geq W\land D\leq D_{\mathrm{max}}$  do for each e in list do Pathse  $\leftarrow$  Pathse U ExtractHeadSteps  $(\text{Pathsc},e,D)$  . if  $|\text{Pathse}| > W$  then PrecisePathSelect  $(\text{Pathsc},Q,I,W)$  . Pathsc  $\leftarrow$  IntersectMatchUpdate  $(\text{Pathse},\text{Pathsc})$  . Pathse  $\leftarrow 0$  .  $D\gets D + 1$  .   
22 if  $|\text{Pathsc} | > W$  then PrecisePathSelect  $(\text{Pathsc},Q,I,W)$

# B Experiment

# B.1 Additional Ablation Study

Compare the effect of different beam search strategies. As introduced in Section 4.3, PoG supports various beam search strategies, ranging from non-reliant to fully reliant on LLMs, selectable in a user-friendly manner. To evaluate the computational cost and performance, we test four cases outlined in Algorithm 2. In the 3-Step Beam Search case, we set  $W_{2} = 20$  for internal narrowing. The Fuzzy Selection approach, as described in Section 4.3, utilizes all candidate paths and a LLM-generated indicator for encoding and comparison. We report accuracy, average LLM calls in total, and average token input during the path pruning for each beam search strategy applied to PoG and PoG-E in Table 4 and Table 5. These results indicate that PoG/PoG-E with Fuzzy and Precise Path Selection achieves the highest accuracy. Additionally, the BranchReduced Selection method, which leverages the graph structure, not only delivers excellent results but also reduces token usage by over  $50\%$  (65%) with only a  $\pm 2\%$  (±4.3%) difference in accuracy on PoG (PoG-E) compared to the best-performing strategy. Furthermore, the Fuzzy Selection method, which employs lightweight models instead of relying solely on LLMs, also demonstrates strong performance. These results validate the effectiveness of our beam search strategies and underscore the importance of structure-based faithful path reasoning.

Table 4: Performance comparison of PoG with different beam search methods on CWQ and WebQSP.  

<table><tr><td>PoG</td><td>Evaluation</td><td>CWQ</td><td>WebQSP</td></tr><tr><td rowspan="3">w/ Fuzzy Selection</td><td>Accuracy</td><td>57.1</td><td>86.4</td></tr><tr><td>Token Input</td><td>-</td><td>-</td></tr><tr><td>LLM Calls</td><td>6.8</td><td>6.5</td></tr><tr><td rowspan="3">w/ Fuzzy and BranchReduced Selection</td><td>Accuracy</td><td>79.3</td><td>93.0</td></tr><tr><td>Token Input</td><td>101,455</td><td>328,742</td></tr><tr><td>LLM Calls</td><td>9.7</td><td>9.3</td></tr><tr><td rowspan="3">w/ Fuzzy and Precise Path Selection</td><td>Accuracy</td><td>81.4</td><td>93.9</td></tr><tr><td>Token Input</td><td>216,884</td><td>617,448</td></tr><tr><td>LLM Calls</td><td>9.1</td><td>7.5</td></tr><tr><td rowspan="3">w/ 3-Step Beam Search</td><td>Accuracy</td><td>79.8</td><td>91.9</td></tr><tr><td>Token Input</td><td>102,036</td><td>369,175</td></tr><tr><td>LLM Calls</td><td>8.8</td><td>9.0</td></tr></table>

How do summary prompts affect? Inspired by GoT [4], we utilize summary prompts to reduce LLM hallucinations and decrease computational costs. To evaluate their impact, we conduct an ablation study comparing PoG and PoG-E with and without path summarization. We measure both accuracy and average token input to the LLM API during the path pruning phase to measure efficiency and effectiveness. The results, present in Tabel 6, show that using graph summaries increases accuracy by up to  $10\%$  on the CWQ dataset with PoG-E, while reducing token input by up to  $36\%$  on WebQSP. These results indicate hat path summarization effectively minimizes LLM hallucinations, enhances the LLM's understanding of the explored paths, facilitates answer retrieval, enables earlier termination of the reasoning process, and reduces costs.

Table 5: Performance comparison of PoG-E with different beam search methods among CWQ and WebQSP datasets.  

<table><tr><td>PoG-E</td><td>Evaluation</td><td>CWQ</td><td>WebQSP</td></tr><tr><td rowspan="3">w/ FuzzySelect</td><td>Accuracy</td><td>62.31</td><td>82.3</td></tr><tr><td>Token Input</td><td>-</td><td>-</td></tr><tr><td>Ave LLM Calls</td><td>6</td><td>6.3</td></tr><tr><td rowspan="3">w/ Fuzzy and BranchReduced Selection</td><td>Accuracy</td><td>71.9</td><td>88.4</td></tr><tr><td>Token Input</td><td>128,407</td><td>371,083</td></tr><tr><td>Ave LLM Calls</td><td>9.4</td><td>9.1</td></tr><tr><td rowspan="3">w/ Fuzzy and Precise Path Selection</td><td>Accuracy</td><td>80.4</td><td>91.4</td></tr><tr><td>Token Input</td><td>344,747</td><td>603,261</td></tr><tr><td>Ave LLM Calls</td><td>8.3</td><td>7.4</td></tr><tr><td rowspan="3">w/ 3-Step Beam Search</td><td>Accuracy</td><td>73.87</td><td>89.4</td></tr><tr><td>Token Input</td><td>120,159</td><td>411,283</td></tr><tr><td>Ave LLM Calls</td><td>8.3</td><td>9.1</td></tr></table>

Table 6: Performance comparison of PoG and PoG-E with and without path summarizing on CWQ and WebQSP datasets.  

<table><tr><td>Method</td><td>Evaluation</td><td>CWQ</td><td>WebQSP</td></tr><tr><td>PoG</td><td></td><td></td><td></td></tr><tr><td rowspan="2">w/ Path Summarizing</td><td>Accuracy</td><td>81.4</td><td>93.9</td></tr><tr><td>Token Input</td><td>216,884</td><td>297,359</td></tr><tr><td rowspan="2">w/o Path Summarizing</td><td>Accuracy</td><td>74.7</td><td>91.9</td></tr><tr><td>Token Input</td><td>273,447</td><td>458,545</td></tr><tr><td>PoG-E</td><td></td><td></td><td></td></tr><tr><td rowspan="2">w/ Path Summarizing</td><td>Accuracy</td><td>80.4</td><td>91.4</td></tr><tr><td>Token Input</td><td>314,747</td><td>273,407</td></tr><tr><td rowspan="2">w/o Path Summarizing</td><td>Accuracy</td><td>70.4</td><td>90.4</td></tr><tr><td>Token Input</td><td>419,679</td><td>428,545</td></tr></table>

# B.2 Reasoning Faithfulness Analysis

Overlap ratio between explored paths and ground-truth paths. We analyzed correctly answered samples from three datasets to investigate the overlap ratio between the paths  $P$  explored by PoG and the ground-truth paths  $S$  in SPARQL queries. The overlap ratio is defined as the proportion of overlapping relations to the total number of relations in the ground-truth SPARQL path:

$$
R a t i o (P) = \frac {| R e l a t i o n (P) \cap R e l a t i o n (S) |}{| R e l a t i o n (S) |},
$$

where  $Relation(P)$  denotes the set of relations in path  $P$ . Figure 7 illustrates the distribution of questions across different overlap ratios. For the WebQSP dataset, PoG achieves the highest proportion of fully overlapping paths with the ground truth, reaching approximately  $60\%$  accuracy. In contrast, PoG-E applied to the GrailQA dataset shows the highest proportion of paths with up to  $70\%$  non-overlapping relations, indicating that PoG-E explores novel paths to derive the answers. The different results between PoG and PoG-E are due to PoG-E's strategy of randomly selecting one related edge from each clustered edge. This approach highlights the effectiveness of our structure-based path exploration method in generating diverse and accurate reasoning paths.

![](images/7e326f79e4e634c4a6c8722fcf2ce1a5efdb225218b7360a0d96ec57bc87902c.jpg)

![](images/8ae5e3959779d746b6e59ec6a9f33cdc514e0d76609f5696d577b39ee6c6d90a.jpg)

![](images/5205305b0c0a5b5a69d8e1017f3ea704c34ff799ad95b97572c9de3c6eabe1c0.jpg)  
(a) CWQ (PoG)

![](images/2259aa53e7d2b45f084c70a395be2c51d9353d39afadac78a3929b68186f3984.jpg)  
(b) CWQ (PoG-E)

![](images/46c56c91a72ded5796fe9e5337dfe5c781a014298cae91bb1f09a4147a3d088b.jpg)  
(c) WebQSP (PoG)  
(e) GrailQA (PoG)

![](images/5a7f590430d485db85d1cf8513a1495c1c68d65552a9db594a6e17d309f4eec0.jpg)  
(d) WebQSP (PoG-E)  
(f) GrailQA (PoG-E)  
Figure 7: The path overlap ratio of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.  
Figure 8: The proportions of answer evidence of PoG and PoG-E among CWQ, WebQSP, and GrailQA datasets.

Evidence of answer exploration sources. We conduct an analysis of correctly answered samples from three datasets to investigate the sources of evidence used by the LLM in generating answers, as illustrated in Figure 8. Specifically, we categorize all generated answers into three cases: KG only, LLM-inspired KG, and KG-inspired LLM. In the KG only scenario, answers are generated solely based on KG paths. The LLM-inspired KG case involves the LLM predicting an answer using its inherent knowledge and subsequently using the KG to verify its correctness. Conversely, in the KG-inspired LLM case, the paths generated by the KG are insufficient to reach the answer, and the LLM supplements the reasoning using its inherent knowledge. As shown in the figure, up to  $14\%$  of answers are generated through the KG-inspired LLM approach, and up to  $9\%$  involve LLM-inspired KG path supplementation. Compared to previous work that integrates LLM inherent knowledge with KG data[37], PoG more effectively leverages the strengths of both sources. These results demonstrate that PoG is a faithful reasoning method that primarily relies on KG-based reasoning while being supplemented by the LLM, ensuring both accuracy and interpretability in answer generation.

![](images/72c47aa876c2092683e5e5281d19a6efbbec7f961cf2a4d2e31c466985db8be2.jpg)  
CWQ(%)

![](images/54813d20cb6ad7efae91fd3e9586920af78cf5efb12273b2fd84b6f87daff7c5.jpg)  
WebQSP(%)  
(a) PoG

![](images/30a727ceb19819a6e1db42cd7f8062bae4d89c00376c58d89ebbc02a4e2e7a43.jpg)  
CWQ(%)

![](images/3551e0cbb29e0dc8019f0e9efe00e70c88f560d8d02be85f66eb15e1fa00581b.jpg)  
WebQSP(%)  
(b) PoG-E

![](images/c4587e5fc493654af3336a8ff35a794846555731b9c492a9e603d015d93d7dcf.jpg)  
GrailQA(%)

![](images/24d6f0c1a0404d1b4cce4a5765df7efccffe83e0af6e39543a82afe2703527ab.jpg)  
GrailQA(%)

# B.3 Error Analysis

To further analyze the integration of LLMs and KGs, we conduct an error analysis on the CWQ, WebQSP, and GrailQA datasets. We categorize errors into four types: (1) answer generation error, (2) refuse error, (3) format error, and (4) other hallucination errors. Note that answer generation error occurs when PoG provides an accurate reasoning path, but the LLM fails to extract the correct answer from it. The distribution of these error types is illustrated in Figure 9. The results indicate that using more powerful LLMs reduces the number of "other hallucination errors," "refuse errors," and "answer generation errors," as the model offers enhanced reasoning capabilities based on the retrieved data. Specifically, the reduction in "answer generation errors" shows the reasoning paths provided by PoG are effectively utilized by more advanced LLMs. However, we observe an increase in "format errors" with more powerful LLMs, which may be attributed to their greater creative flexibility.

![](images/dd18837a7f654786aae0b1597c8fbda7cafd2f39aa3ffeaf64dcb1b87d2726ca.jpg)  
Figure 9: The error instances and categories of PoG and PoGE in the CWQ, WebQSP, and GrailQA datasets.

# B.4 Efficiency Analysis

LLM calls cost analysis. To evaluate the cost and efficiency of utilizing LLMs, we conducted an analysis of LLM calls on the CWQ, WebQSP, and GrailQA datasets. Initially, we examined the proportion of questions answered with varying numbers of LLM calls, as depicted in Figure 10. The results indicate that the majority of questions are answered within nine LLM calls across all datasets, with approximately  $80\%$  and  $50\%$  of questions being resolved within six calls on CWQ and WebQSP, respectively. These findings demonstrate PoG's efficiency in minimizing LLM usage costs. Furthermore, we compared the average number of LLM calls required by PoG with the current SOTA method, ToG [37], as shown in Table 7. Since we utilized identical datasets for WebQSP, GrailQA, Simple Questions, and WebQuestions, we report the ToG results from their paper. The comparison reveals that PoG achieves comparable or superior accuracy while reducing the number of LLM calls by up to  $40\%$  on the GrailQA dataset compared to ToG. This improvement is attributed to PoG's dynamic exploration strategy, which avoids starting from scratch, and its effective use of graph structures to prune irrelevant information, thereby significantly decreasing computational costs.

![](images/e2b8878b077ef9766cdb8c804b4731857589aa62e87e9f9bbc398862ba751c05.jpg)  
Figure 10: The proportion of question of PoG and PoG-E by different LLM Calls among CWQ, WebQSP, and GrailQA datasets

Table 7: Average LLM calls per question of PoG and ToG among all datasets.  

<table><tr><td>Method</td><td>CWQ</td><td>WebQSP</td><td>GrailQA</td><td>Simple Questions</td><td>WebQuestions</td></tr><tr><td>PoG</td><td>10.7</td><td>8.3</td><td>6.5</td><td>6.1</td><td>9.3</td></tr><tr><td>ToG</td><td>-</td><td>11.2</td><td>10.6</td><td>8.7</td><td>10.5</td></tr></table>

Running time analysis. To further evaluate the processing efficiency of PoG, we conducted extensive evaluations focusing on average accuracy, LLM call frequency, and running time on multi-hop question datasets GrailQA and CWQ. The results, presented in Table 8, compare the performance of ToG and PoG across various beam search strategies. The data indicate that while higher accuracy slightly increases runtime, PoG effectively balances accuracy with reduced LLM call costs. Specifically, PoG reduces LLM call costs by up to  $53.5\%$  while increasing accuracy by up to  $33.4\%$ . This allows users to tailor the PoG framework to their specific needs regarding accuracy, cost, and runtime. All PoG setting provide significantly lower costs. For instance, on the GrailQA dataset, PoG with 3-step beam search reduces LLM call costs by  $39.6\%$  and improves accuracy by  $33.1\%$ , with only a  $1.14\%$  increase in runtime. A similar trend is also observed in the CWQ dataset.

Table 8: Running time evaluation of ToG and PoG with different beam search methods on CWQ and GrailQA.  

<table><tr><td>Model</td><td>Evaluation</td><td>CWQ</td><td>GrailQA</td></tr><tr><td rowspan="3">ToG</td><td>Accuracy</td><td>53.1</td><td>59.3</td></tr><tr><td>Time (s)</td><td>78.7</td><td>14.8</td></tr><tr><td>LLM Calls</td><td>21.3</td><td>10.1</td></tr><tr><td rowspan="3">PoG</td><td>Accuracy</td><td>81.4</td><td>92.7</td></tr><tr><td>Time (s)</td><td>118.9</td><td>21.4</td></tr><tr><td>LLM Calls</td><td>9.1</td><td>6.5</td></tr><tr><td>PoG</td><td>Accuracy</td><td>79.8</td><td>92.4</td></tr><tr><td rowspan="2">w/ 3-Step Beam Search</td><td>Time (s)</td><td>87.5</td><td>15.0</td></tr><tr><td>LLM Calls</td><td>8.8</td><td>6.1</td></tr><tr><td>PoG</td><td>Accuracy</td><td>57.1</td><td>83.0</td></tr><tr><td rowspan="2">w/ Fuzzy Selection</td><td>Time (s)</td><td>109.7</td><td>15.7</td></tr><tr><td>LLM Calls</td><td>6.8</td><td>4.7</td></tr></table>

# B.5 Case Study

Case study: graph reduction and path pruning. We conducted a case study using the example question presented in Figure 2 to illustrate the effects of graph pruning and path pruning on the graph structure. Figure 11(a) shows the results of graph pruning, where vertices in blue are selected as part of the question subgraph, and vertices in black are pruned. In this sample, the number of entities is reduced from 16,740 to 1,245, resulting in a  $92\%$  reduction of vertices. Figures 11(b) and 11(c) demonstrate the question subgraph induced by the blue vertices in Figure 11(a) and the results after applying fuzzy and precise path selection. In these figures, vertices in blue represent the selected entity after each pruning, vertices in yellow represent the topic entities, and the vertex in red denotes the final answer entity. From these graphs, we observe that utilizing the graph structure allows for the rapid pruning of irrelevant vertices, ensuring that the reasoning paths remain faithful and highly relevant to the question since all vertices within the question subgraph are interconnected with all topic entities, thereby maintaining the integrity and relevance of the reasoning process.

![](images/9545cf8bdc17de76b9017f4b996435616b379b0fe9cef984e4e73bbe3e7151e7.jpg)  
(a) Graph pruning

![](images/297f6ddeb580a186186adb774df102323c1f782cc37ed410d76ee54732db9fc2.jpg)  
(b) Question subgraph

![](images/5cc7882b1758b0e101f45a189909e8a5b76f2c7108e839e03609e1df7b39c94d.jpg)  
(c) Fuzzy selection  
Figure 11: Visualization of graph reduction and Path selection.

![](images/a5373a718141030c05f1d6eb563e354affd62f2009496a2df82ccca3ec09ddf7.jpg)  
(d) Precise selection

Case study: interpretable reasoning. In this section, we present Table 9, which illustrates PoG's interpretability through case studies involving questions with one, two, and three entities. These examples demonstrate PoG's effectiveness in handling multi-entity and multi-hop tasks by providing clear and understandable reasoning paths that lead to accurate answers.

# C Experiment Details

Experiment datasets. To evaluate PoG's capability in multi-hop knowledge-intensive reasoning tasks, we assess it on four KBQA datasets: three multi-hop datasets (CWQ [38], WebQSP [51], GrailQA [12]) and one single-hop dataset (SimpleQuestions [31]). Additionally, to examine PoG on more general tasks, we include an open-domain QA dataset, WebQuestions. For the evaluation of large datasets such as CWQ, GrailQA, and SimpleQuestions, we utilize a random sample of 1,000 test cases from CWQ and employ the 1,000 samples previously reported by ToG [37] to facilitate a comparison with the SOTA while also minimizing computational costs. Freebase serves as the background knowledge graph for all datasets, which encompasses approximately 88 million entities, 20,000 relations, and 126 million triples [5, 27]. The statistics of the datasets utilized in this study are detailed in Table 10.

Baselines. Inspired by ToG [37], we compare our method with standard prompting (IO), Chain-of-Thought (CoT), and Self-Consistency(SC) promptings with six in-context exemplars and "step-by-step" reasoning chains. For each dataset, we also include previous SOTA works for comparison. For a fair play, we compare with previous SOTA among all prompting-based methods and previous SOTA among all methods respectively. Since ToG is the current SOTA prompting-based method, we directly refer to their results and those of other baselines reported in their paper for comparisons.

Experimental implementation. Leveraging the plug-and-play convenience of our framework, we experiment with two LLMs: GPT-3.5 and GPT-4. We use the OpenAI API to access GPT-3.5 (GPT-3.5-turbo) and GPT-4. Aligning with ToG, we set the temperature parameter to 0.4 during the exploration process (to increase diversity) and to 0 during the reasoning process (to ensure reproducibility). The maximum token length for generation is set to 256. In all experiments, we set both  $W_{\mathrm{max}}$  and  $D_{\mathrm{max}}$  to 3 for beam search. All the experiments are conducted on a machine with Intel Xeon Gold 6248R CPU, Nvidia A5000 GPU and 512GB memory.

Table 10: Statistics of the datasets used in this paper.† denotes we randomly select 1,000 samples from the CWQ test set to create the experiment testing set due to the abundance of test samples. * denotes that we utilize the 1,000 samples reported by ToG [37] to compare with the state-of-the-art.

<table><tr><td>Dataset</td><td>Answer Format</td><td>Test</td><td>Train</td></tr><tr><td>ComplexWebQuestions (CWQ)†</td><td>Entity</td><td>1,000</td><td>27,734</td></tr><tr><td>WebQSP</td><td>Entity/Number</td><td>1,639</td><td>3,098</td></tr><tr><td>GrailQA*</td><td>Entity/Number</td><td>1,000</td><td>44,337</td></tr><tr><td>Simple Question*</td><td>Entity/Number</td><td>1,000</td><td>14,894</td></tr><tr><td>WebQuestions</td><td>Entity/Number</td><td>2,032</td><td>3,778</td></tr></table>

Table 9: Examples of faithful and interpretable reasoning of PoG.  

<table><tr><td>Field</td><td>Content</td></tr><tr><td>Question</td><td>Lou Seal is the mascot for the team that last won the World Series when?</td></tr><tr><td>Answer</td><td>2014 World Series</td></tr><tr><td>Topic Entity</td><td>Lou Seal</td></tr><tr><td>LLM indicator</td><td>&quot;Lou Seal&quot; - mascot for - team - last won - World Series - when</td></tr><tr><td rowspan="2">Split Question</td><td>split_question 1: What team is &quot;Lou Seal&quot; the mascot for?</td></tr><tr><td>split_question 2: When did the last World Series win occur for the team with Lou Seal as the mascot?</td></tr><tr><td>Reasoning Path</td><td>{Lou Seal} → sports.mascot.team → {San Francisco Giants} → sports.sports_team.championships → {2014 World Series, 2012 World Series, 2010 World Series}</td></tr><tr><td rowspan="2">PoG answer</td><td>answer: {2014 World Series}</td></tr><tr><td>reason: From the given path {Lou Seal} → sports.mascot.team → {San Francisco Giants}, {San Francisco Giants} is the team that &quot;Lou Seal&quot; serves as the mascot for. From {San Francisco Giants} → sports.sports_team.championships → {2014 World Series, 2012 World Series, 2010 World Series}, the last World Series won by the {San Francisco Giants} is in {2014}. Therefore, the provided knowledge graph path is sufficient to answer the overall question, and the answer is {2014 World Series}.</td></tr><tr><td>Question Answer</td><td>What European Union country sharing borders with Germany contains the Lejure Municipality?</td></tr><tr><td>Topic Entity</td><td>Denmark
{Germany, Lejure Municipality}</td></tr><tr><td>LLM indicator</td><td>&quot;Lejure Municipality&quot; - contained by - European Union country (answer)- shares borders with-</td></tr><tr><td rowspan="3">Split Question</td><td>&quot;Germany&quot;</td></tr><tr><td>split_question 1: What European Union country shares borders with &quot;Germany&quot;?</td></tr><tr><td>split_question 2: What country contains the Lejure Municipality?</td></tr><tr><td>Reasoning Path</td><td>{Lejure Municipality} → location/administrative_division.country → {Denmark} → loca-
tion.location.containedby → {Eurasia} → location.location.contains → {Germany},
{Lejure Municipality} → location/administrative_division_country → {Denmark} → 
baseLocations.countries.continent → {Europe} → location.location.contains → {Western 
Europe} → location.location.contains → {Germany}</td></tr><tr><td rowspan="2">PoG answer</td><td>answer: {Denmark}</td></tr><tr><td>reason: From {Denmark} → location.location.containedby → {Eurasia} → location.location.contains 
→ {Germany}, Denmark is European Union country shares borders with &quot;Germany&quot;. From {Lejure 
Municipality} → location.administrative_division-country → {Denmark}, Denmark is a country 
contains the Lejure Municipality Both paths lead to Denmark, which is the European Union country 
sharing borders with Germany and containing the Lejure Municipality. Therefore, the provided 
knowledge graph path is sufficient to answer the main question, and the answer is {Denmark}.</td></tr><tr><td>Question Answer</td><td>What Central Tiem Zone continent do the Falkland Islands belong to?</td></tr><tr><td>Topic Entity</td><td>{Falkland Islands, Continent, Central Time Zone}</td></tr><tr><td>LLM indicator</td><td>answer(&quot;Central Time Zone&quot;) - owns - answer(&quot;Continent&quot;)- located contain - &quot;Falkland Islands&quot;</td></tr><tr><td rowspan="2">Split Question</td><td>split_question 1: What continent do the Falkland Islands belong to?</td></tr><tr><td>split_question 2: What Central Time Zone does the continent belong to?</td></tr><tr><td>Reasoning Path</td><td>{Central Time Zone} → time.time-zoneLocations_in.this_time-zone → {Americas} → com-
mon.topic.not able_types → {Continent} → commontopic.notable_types → {Americas} → loca-
tion.location.con tains → {Falkland Islands}</td></tr><tr><td rowspan="2">PoG answer</td><td>answer: {Americas}</td></tr><tr><td>reason: From {Continent} → common-topic.notable_types → {Americas} → location.location.contain-
tains →{ Falkland Islands}, Americas is the continent Falkland Islands belong to. From {Central 
Time Zone} → time.time-zoneLocations_in.this_time-zone → {Americas} → common-topic.no-
table_types → {Continent}. Americas is the Central Time Zone. Therefore, the provided knowledge 
graph path is sufficient to answer the overall question, and the answer is {Americas}.</td></tr></table>

# D SPARQL

This section outlines the pre-defined SPARQL queries used for interacting with the knowledge graph and constructing graphs for our experiments.

# D.1 1-hop Entity and Relation Search

To facilitate the retrieval of 1-hop neighbors of entities within the Freebase Knowledge Graph, we have predefined a SPARQL query. This query is designed to be executed by simply substituting the appropriate ID for the query entity ID. It returns the connected entities' IDs and their associated relations' IDs, indicating whether the connected entity is at the tail or the head of the relation.

```txt
PREFIX ns: <http://rdfs.freebase.com/ns/>   
SELECT ?relation ?connectedEntity ?direction   
WHERE { ns:ID ?relation ?connectedEntity . BIND("tail" AS ?direction) } UNION { ?connectedEntity ?relation ns:ID . BIND("head" AS ?direction) }
```

# D.2 Short Textual Description

The following predefined function implements the retrieval of short textual descriptions,  $\mathcal{D}(.)$ , for converting the identifiers (IDs) of entities or relations into natural language descriptions.

```txt
PREFIX ns: <http://rdfs.freebase.com/ns/>   
SELECT DISTINCT ?tailEntity   
WHERE { ?entity ns:typeobject.name ?tailEntity . FILTER(?entity  $=$  ns:ID) }   
UNION { ?entity <http://www.w3.org/2002/07/ owlsameAs> ?tailEntity . FILTER(?entity  $=$  ns:ID)   
}
```

# D.3 1-hop Subgraph Search

To facilitate subgraph detection in Section 4.1, we implement the 1-hop subgraph detection feature by integrating SPARQL functions described in Appendix D.1 and D.2. The purpose of this function is to retrieve, in a single SPARQL query, the 1-hop neighbors of a given query with their IDs, natural language names, and connected relationships, specifying whether the connected entity is at the tail or the head of the relationship.

```txt
PREFIX ns: <http://rdfs.freebase.com/ns/>   
SELECT ?relation ?connectedEntity ?connectedEntityName ? direction   
WHERE { ns:ID ?relation ?connectedEntity . OPTIONAL { ?connectedEntity ns:typeobject.name ? name . FILTER(lang(?name)  $=$  'en') } BIND(COALESCE(?name，"Unnamed Entity") AS ?connectedEntityName) BIND("tail"AS?direction) } UNION { ?connectedEntity ?relation ns:ID . OPTIONAL { ?connectedEntity ns:typeobject.name ? name . FILTER(lang(?name）  $=$  'en') } BIND(COALESCE(?name，"Unnamed Entity") AS ?connectedEntityName) BIND("head"AS?direction) }
```

# E Prompts

In this section, we detail the prompts required for our main experimental procedures.

# Question Analysis Prompt Template

You will receive a multi-hop question, which is composed of several interconnected queries, along with a list of topic entities that serve as the main keywords for the question. Your task is to break the question into simpler parts, using each topic entity once and provide a Chain of Thought (CoT) that shows how the topic entities are related. Note: Each simpler question should explore how one topic entity connects to others or the answer. The goal is to systematically address each entity to derive the final answer.

In-Context Few-shot

Q: {Query}

Topic Entity: {Topic Entity}

A:

# LLM Supplement Prompt Template

Using the main question, a possibly uncertain chain of thought generated by a language model, some related split questions, paths from the "Related Paths" section, and main topic entities: please first provide three predicted results, and second offer three possible chains of thought that could lead to these results, using the provided knowledge paths and your own knowledge. If any answers are unclear, suggest alternative answers to fill in the gaps in the chains of thought, following the same format as the provided examples.

In-Context Few-shot

Q: {Query}

Topic Entity: {Topic Entity}

Think Indicator:{Think Indicator}

Split Question:{Split Question}

A:

where {Think Indicator}, and {Split Question} are obtained in section 4.1. An indicator example is shown in Figure 2.

# Precise Path Select Prompt Template

Given a main question, a LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with "m.")} -> {set of relationships} -> {set of entities(with id start with "m.")}, Please score and give me the top three lists from the candidate paths set can be highly to be the answer of the question.

In-Context Few-shot

Q: {Query}

Think Indicator:{Think Indicator}

Split Question:{Split Question}

Candidate Paths:{Candidate Paths}

A:

{Candidate Paths} denotes the retrieved reasoning paths Finalpaths to be selected in this request which are formatted as a series of structural sentences:

$$
\{e _ {0 x}, \dots , e _ {0 z} \} \to r _ {1 _ {i}} \to \{e _ {1 x}, \dots , e _ {1 z} \} \to \dots \to r _ {l _ {j}} \to \{e _ {l x}, \dots , e _ {l z} \}
$$

![](images/79e7d41897e2d76eb91b0e7d27f8168d28561a12a790e0536c019264f8b0fbb0.jpg)

$$
\left\{e _ {0 x}, \dots , e _ {0 z} \right\}\rightarrow r _ {1 _ {i}} \rightarrow \left\{e _ {1 x}, \dots , e _ {1 z} \right\}\rightarrow \dots \rightarrow r _ {l _ {j}} \rightarrow \left\{e _ {l x}, \dots , e _ {l z} \right\},
$$

where,  $i$  and  $j$  in  $r_{1_i}, r_{1_j}$  represent the  $i$ -th,  $j$ -th relation from each relation edge in the clustered question subgraph. And  $e$  is constructed by its ID and natural language name  $\mathcal{D}(ID)$ .

# Path Summarizing Prompt Template

Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, the associated accuracy retrieved knowledge paths from the Related paths section, and main topic entities. Your task is to summarize the provided knowledge triple in the Related paths section and generate a chain of thoughts by the knowledge triple related to the main topic entities of the question, which will be used for generating the answer for the main question and split question further. You have to make sure you summarize correctly by using the provided knowledge triple, you can only use the entity with id from the given path and you can not skip in steps.

In-Context Few-shot

Q: {Query}

Think Indicator:{Think Indicator}

Split Question:{Split Question}

Related Paths:{Related Paths}

A:

{Related_Paths} has the same format with the {Candidate_Paths} before.

# Question Answering Evaluation Prompt Template

Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with "m.")} -> {set of relationships} -> {set of entities(with id start with "m.")}. Your task is to determine if this knowledge graph path is sufficient to answer the given split question first then the main question. If it's sufficient, you need to respond {Yes} and provide the answer to the main question. If the answer is obtained from the given knowledge path, it should be the entity name from the path. Otherwise, you need to response {No}, then explain the reason.

In-Context Few-shot

Q: {Query}

Think Indicator:{Think Indicator}

Split Question:{Split Question}

Related Paths:{Related Paths}

A:

# Question Answering Generation Prompt Template

Given a main question, an uncertain LLM-generated thinking Cot that considers all the entities, a few split questions that you can use one by one and finally obtain the final answer, and the associated retrieved knowledge graph path, {set of entities (with id start with "m.")} -> {set of relationships} -> {set of entities(with id start with "m.")}, Your task is to generate the answer based on the given knowledge graph path and your own knowledge.

In-Context Few-shot

Q: {Query}

Think Indicator:{Think Indicator}

Split Question:{Split Question}

Related Paths:{Related Paths}

A: