Timon Harz

December 12, 2024

How to Build a Graph RAG System: Step-by-Step Guide

Unravel the power of Graph RAG, the cutting-edge framework that connects facts to deliver better answers. Explore its potential and learn how to implement it effectively.

Graph RAG is the talk of the tech town, and chances are, you’ve heard about it too. But what exactly is Graph RAG, and why has it gained so much traction? In this article, we’ll dive into what Graph RAG is, why it’s essential, and how to implement it using LlamaIndex. Let’s get started!

The Evolution from LLMs to RAG Systems

Large Language Models (LLMs) have revolutionized AI, but they come with a significant limitation: static knowledge. These models rely solely on the data they were trained on, which can lead to hallucinations—producing incorrect or fabricated information. To overcome this, Retrieval-Augmented Generation (RAG) systems were introduced.

Unlike LLMs, RAG systems dynamically retrieve information from external knowledge bases in real time. By incorporating fresh and relevant data, RAG systems generate more accurate responses. Traditional RAG systems rely on text embeddings to fetch specific pieces of information, which makes them powerful. However, they also face certain challenges.

If you’ve worked on RAG-related projects, you’ll recognize this: the quality of responses depends heavily on the clarity and precision of the input query. But an even bigger challenge is the lack of effective reasoning across multiple documents.

Why Is Reasoning Across Documents Important?

Let’s break this down with an example. Imagine asking the system:

“Who were the key contributors to the discovery of DNA’s double-helix structure, and what role did Rosalind Franklin play?”

n a traditional RAG setup, the system might retrieve the following pieces of information:

Document 1: “James Watson and Francis Crick proposed the double-helix structure in 1953.”
Document 2: “Rosalind Franklin’s X-ray diffraction images were critical in identifying DNA’s helical structure.”
Document 3: “Maurice Wilkins shared Franklin’s images with Watson and Crick, which contributed to their discovery.”

The issue? Traditional RAG systems treat these documents as isolated pieces of information. They fail to connect the dots effectively, resulting in fragmented responses such as:

“Watson and Crick proposed the structure, and Franklin’s work was important.”

This response falls short, missing the crucial relationships between contributors. That’s where Graph RAG comes in! Instead of treating information as isolated pieces, Graph RAG structures the retrieved data as a graph—where each document or fact is represented as a node, and the relationships between them are captured as edges.

Here’s how Graph RAG approaches the same query:

Nodes: Represent individual facts (e.g., “Watson and Crick proposed the structure,” “Franklin contributed critical X-ray images”).
Edges: Represent the connections between facts (e.g., “Franklin’s images → shared by Wilkins → influenced Watson and Crick”).

By reasoning through these interconnected nodes, Graph RAG generates a cohesive and insightful response, such as:

“The discovery of DNA’s double-helix structure in 1953 was primarily led by James Watson and Francis Crick. However, this breakthrough heavily relied on Rosalind Franklin’s X-ray diffraction images, which were shared with them by Maurice Wilkins.”

This ability to combine information from multiple sources and answer broader, more complex questions is what makes Graph RAG so popular.

The Graph RAG Pipeline

Let’s dive into the Graph RAG pipeline, inspired by the approach detailed in Microsoft Research’s paper, “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.”

The Graph RAG Pipeline

Step 1: Source Documents → Text Chunks
Large Language Models (LLMs) can only process a limited amount of text at once. To ensure accuracy and prevent missing critical information, large documents are broken into smaller, manageable “chunks” for analysis.

Step 2: Text Chunks → Element Instances
Each text chunk is processed to identify graph nodes (entities) and edges (relationships). For example, from a news article, the system might identify:Node: “NASA” (entity)Node: “spacecraft” (entity)Edge: “launched” (relationship connecting NASA and spacecraft)

Step 3: Element Instances → Element Summaries
The identified elements are then summarized into concise descriptions. For instance:Node summary: “NASA is a space agency responsible for space exploration missions.”Edge summary: “NASA launched the spacecraft in 2023.”
This step ensures the graph is both detailed and easy to interpret.

Step 4: Element Summaries → Graph Communities
The graph is often too large for direct analysis, so it’s divided into smaller communities using clustering algorithms like Leiden. These communities group closely related nodes and edges. For example:Community 1: “Space Exploration” (e.g., nodes like “NASA,” “Spacecraft,” “Mars Rover”)Community 2: “Environmental Science” (e.g., nodes like “Climate Change,” “Carbon Emissions,” “Sea Levels”)
This organization simplifies the data, highlighting key themes and connections.

Step 5: Graph Communities → Community Summaries
Each community is summarized to provide an overview of its content. For example:Space Exploration community: “Key missions, discoveries, and organizations like NASA and SpaceX.”
These summaries allow the AI to answer broader questions or explore overarching themes.

Step 6: Community Summaries → Community Answers → Global Answer
The final step involves using the community summaries to answer user queries:Query the Data: A user asks, “What are the main impacts of climate change?”Community Analysis: The AI reviews summaries from relevant communities.Generate Partial Answers: Each community contributes, such as:“Rising sea levels threaten coastal cities.”“Disrupted agriculture due to unpredictable weather.”Combine into a Global Answer: The partial answers are synthesized into a comprehensive response.This pipeline ensures accurate, interconnected, and meaningful answers derived from complex datasets.

“Climate change impacts include rising sea levels, disrupted agriculture, and an increased frequency of natural disasters.”

Ensuring Clear and Accurate Responses

This process guarantees that the final answer is not only accurate but also detailed and easy to comprehend.

Step-by-Step Implementation of Graph RAG with LlamaIndex

You can either build your own Python-based implementation or leverage frameworks like LangChain or LlamaIndex. In this guide, we’ll use the LlamaIndex baseline code from their website, explained in a beginner-friendly manner. Additionally, I encountered a parsing issue in the original code, which I’ll explain along with my solution.

Step 1: Install Dependencies

First, install the required libraries for building the pipeline:

graspologic: Used for graph algorithms such as Hierarchical Leiden for community detection.

Step 2: Load and Preprocess Data

Load a sample dataset of news articles and break it into smaller parts for easier processing. For this example, we’ll limit the dataset to 50 samples. Each row of data is converted into a Document object.

import pandas as pd
from llama_index.core import Document

# Load sample dataset
news = pd.read_csv("https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv")[:50]

# Convert data into LlamaIndex Document objects
documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for _, row in news.iterrows()
]

Step 3: Split Text into Nodes

To process the data effectively, use SentenceSplitter to divide documents into manageable chunks with slight overlaps to avoid missing important context.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,  # Ensures slight overlap between chunks
)
nodes = splitter.get_nodes_from_documents(documents)

Step 4: Configure the LLM, Prompt, and Graph RAG Extractor

Set up the LLM (e.g., GPT-4) and configure the extractor to analyze chunks, extract entities, and identify relationships.

from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
llm = OpenAI(model="gpt-4")

The GraphRAGExtractor utilizes:

The LLM to analyze chunks.
A Prompt Template to guide the LLM in extracting entities and relationships.
A Parsing Function (parse_fn) to structure the LLM output into entities and relationships.

Here’s the process:

Each chunk (node) is analyzed by the LLM with the specified prompt.
The LLM identifies entities, their types, and relationships.
The parsing function converts this data into structured objects (EntityNode for entities and Relation for relationships).
Metadata is attached to text chunks, forming the basis for building knowledge graphs or handling queries.

Parsing Issue and Solution

In the original implementation, the parse_fn failed to extract entities and relationships reliably due to overly rigid and complex regular expressions. The LLM response often had inconsistent formatting and line breaks, which the regex patterns couldn’t handle.

Solution: Simplify the regex patterns to match the response’s key-value structure more flexibly.

import re

# Simplified regex patterns for parsing entities and relationships
entity_pattern = r'entity_name:\s*(.+?)\s*entity_type:\s*(.+?)\s*entity_description:\s*(.+?)\s*'
relationship_pattern = r'source_entity:\s*(.+?)\s*target_entity:\s*(.+?)\s*relation:\s*(.+?)\s*relationship_description:\s*(.+?)\s*'

def parse_fn(response_str: str) -> Any:
    entities = re.findall(entity_pattern, response_str)
    relationships = re.findall(relationship_pattern, response_str)
    return entities, relationships

The rest of the GraphRAGExtractor class and prompt template can remain unchanged.

With this adjustment, the system can parse LLM responses more reliably, ensuring accurate extraction of entities and relationships for further processing.

import asyncio
import nest_asyncio
 
nest_asyncio.apply()
 
from typing import Any, List, Callable, Optional, Union, Dict
from IPython.display import Markdown, display
 
from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (
    DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field
class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.
 
    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.
 
    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """
 
    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int
 
    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings
 
        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)
 
        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )
 
    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"
 
    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )
 
    async def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")
 
        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []
 
        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            metadata[
                "entity_description"
            ] = description  # Not used in the current implementation. But will be useful in future work.
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=metadata
            )
            existing_nodes.append(entity_node)
 
        metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, rel, obj, description = triple
            subj_node = EntityNode(name=subj, properties=metadata)
            obj_node = EntityNode(name=obj, properties=metadata)
            metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj_node.id,
                target_id=obj_node.id,
                properties=metadata,
            )
 
            existing_nodes.extend([subj_node, obj_node])
            existing_relations.append(rel_node)
 
        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node
 
    async def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))
 
        return await run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

<strong>KG_TRIPLET_EXTRACT_TMPL</strong> = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.
 
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity")
 
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
 
Format each relationship as ("relationship")
 
3. When finished, output.
 
-Real Data-
######################
text: {text}
######################
output:""

kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)

Step 5: Build the Graph Index

The PropertyGraphIndex extracts entities and relationships from text using kg_extractor and stores them as nodes and edges in the GraphRAGStore.

import re
from llama_index.core.graph_stores import SimplePropertyGraphStore
import networkx as nx
from graspologic.partition import hierarchical_leiden
 
from llama_index.core.llms import ChatMessage
class GraphRAGStore(SimplePropertyGraphStore):
    community_summary = {}
    max_cluster_size = 5
 
    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response
 
    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)
 
    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        for node in self.graph.nodes.values():
            nx_graph.add_node(str(node))
        for relation in self.graph.relations.values():
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph
 
    def _collect_community_info(self, nx_graph, clusters):
        """Collect detailed information for each node based on their community."""
        community_mapping = {item.node: item.cluster for item in clusters}
        community_info = {}
        for item in clusters:
            cluster_id = item.cluster
            node = item.node
            if cluster_id not in community_info:
                community_info[cluster_id] = []
 
            for neighbor in nx_graph.neighbors(node):
                if community_mapping[neighbor] == cluster_id:
                    edge_data = nx_graph.get_edge_data(node, neighbor)
                    if edge_data:
                        detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                        community_info[cluster_id].append(detail)
        return community_info
 
    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)
 
    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

from llama_index.core import PropertyGraphIndex
 
index = PropertyGraphIndex(
    nodes=nodes,
    property_graph_store=GraphRAGStore(),
    kg_extractors=[kg_extractor],
    show_progress=True,
)

Output:

Extracting paths from text: 100%|██████████| 50/50 [02:51<00:00,  3.43s/it]
Generating embeddings: 100%|██████████| 1/1 [00:01<00:00,  1.53s/it]
Generating embeddings: 100%|██████████| 4/4 [00:01<00:00,  2.27it/s]

Step 6: Detect Communities and Generate Summaries

Leverage graspologic's Hierarchical Leiden algorithm to detect communities within the graph and create summaries. Communities represent clusters of nodes (entities) that are densely interconnected but have sparse connections to other clusters. The algorithm optimizes a metric called modularity, which evaluates the quality of these divisions.

Note: Isolated nodes (nodes without relationships) are excluded by the Leiden algorithm. This is expected when some nodes lack meaningful connections, so encountering this warning is normal and not a cause for concern.

Step 7: Query the Graph

Utilize the GraphRAGQueryEngine to query the processed data. Upon receiving a query, the engine retrieves relevant community summaries from the GraphRAGStore. For each summary, the engine calls the generate_answer_from_summary method, using the LLM to provide a tailored response to the query. The individual responses are then combined into a coherent final output via the aggregate_answers method.

Implementation

from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM

class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    llm: LLM

    def custom_query(self, query_str: str) -> str:
        """Generate a detailed response to a query using community summaries."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(summary, query_str)
            for _, summary in community_summaries.items()
        ]
        return self.aggregate_answers(community_answers)

    def generate_answer_from_summary(self, community_summary, query):
        """Generate a community-specific answer using the LLM."""
        prompt = (
            f"Using this community summary: {community_summary}, "
            f"respond to the query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(role="user", content="Provide an answer based on this summary."),
        ]
        response = self.llm.chat(messages)
        return re.sub(r"^assistant:\s*", "", str(response)).strip()

    def aggregate_answers(self, community_answers):
        """Combine multiple community answers into a unified response."""
        prompt = "Summarize the following responses into a single, cohesive answer."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(role="user", content=f"Responses: {community_answers}"),
        ]
        final_response = self.llm.chat(messages)
        return re.sub(r"^assistant:\s*", "", str(final_response)).strip()

Example Query Execution

query_engine = GraphRAGQueryEngine(
    graph_store=index.property_graph_store,
    llm=llm
)

response = query_engine.query("What are news related to the financial sector?")
display(Markdown(f"{response.response}"))

Output Example

The majority of the provided summaries do not mention news related to the financial sector. However, a few exceptions include:

- Matt Pincus, through MUSIC, has invested in Soundtrack Your Brand.
- Nirmal Bang issued a Buy Rating for Tata Chemicals Ltd. (TTCH).
- Coinbase Global Inc. is involved in a legal conflict with the SEC and has undertaken financial transactions, such as issuing Convertible Senior Notes.
- Deutsche Bank recommended buying shares in Allegiant Travel and SkyWest, suggesting positive investment opportunities.
- Coinbase Global, Inc. repurchased 0.50% Convertible Senior Notes due 2026, indicating strategic financial management

This step synthesizes detailed, contextual answers while leveraging community insights for precision and clarity.

Wrapping Up

And that's a wrap! I hope you found this article insightful. Graph RAG is a powerful approach that enables you to tackle both precise factual queries and complex abstract questions by analyzing the relationships and structures within your data. While it’s still in its early stages and comes with challenges—particularly higher token utilization compared to traditional RAG—it represents a significant step forward. I’m excited to see how this field evolves in the future. If you have any thoughts, questions, or suggestions, feel free to share them in the comments below!

Press contact

Timon Harz

oneboardhq@outlook.com