Transforming unstructured data into structured knowledge with Retrieval-Augmented Generation
Build Knowledge Graph RAG For Data Analysis
This project designs and implements a generative AI system to automate financial market intelligence. The solution ingests real-time, unstructured data from diverse sources, including news outlets and market feeds. A structured knowledge graph is constructed using a LlamaIndex SimplePropertyGraphStore, dynamically modeling entities and their complex interrelationships. This graph serves as the foundational knowledge base for a sophisticated Retrieval-Augmented Generation (RAG) pipeline. The integrated system delivers context-aware, analytical responses to complex user queries, transforming raw data into actionable strategic insights for investment decision-making.
π Complete Knowledge Graph RAG Pipeline
This comprehensive guide walks you through building an intelligent financial analysis system that transforms unstructured data into actionable insights using Knowledge Graph RAG technology.
π
Data Collection
Automated web scraping & dataset creation
πΈοΈ
Knowledge Graph
Entity extraction & relationship mapping
π
Intelligent Retrieval
Context-aware query processing
π‘
Financial Insights
Actionable investment intelligence
Tools & Tech Stack
Python 3.10
Model OpenAI, Google Cloud Custom Search API
Tools & Languages Python, Google Cloud Custom Search API
Open the Integrated Terminal from the top menu: View > Terminal
Create a virtual environment by running this command:
python -m venv venv
Activate the virtual environment by running this command:
.\venv\Scripts\activate
2. Install Required Libraries
This command will install all required dependencies specified in requirements.txt file
pip install -r requirements.txt
3. Importing Dependencies
Every AI model is built on the shoulders of powerful libraries. This section covers the essential toolsβfor data loading, graph processing, and LLM connectivityβthat form the foundation of our application.
This is where we provide our AI model with access to external intelligence. We configure the API keys for Google Search and OpenAI, while setting up the necessary environment for asynchronous operations.
Functionality: Takes a search query, calls Google's API, and returns search results as structured data.
generate_search_queries(user_input)
Purpose: Uses GPT-4 to generate diverse search queries for comprehensive research
π Integrates with: OpenAI GPT-4
defgenerate_search_queries(user_input):"""Generates a list of 5-7 detailed and relevant search queries for financial sentiment analysisbased on the user's input, such as a target sector, field, or region."""prompt=f"""You are a financial analyst and search query expert. Based on the following user input, generate a list of 5-7 search queriesfor financial sentiment analysis based on user input. Ensure the queries cover diverse aspects of the topic, including sector-specific trends,regional financial overviews, and broader financial landscapes. The queries should focus on extracting data relevant to sentimentand performance analysis.User Input: {user_input}Strictly output the queries as a python list of strings. Do not add any additional comments."""response=openai.chat.completions.create(model="gpt-4",messages=[{"role":"system","content":"You are an expert in generating search queries for financial sentiment analysis."},{"role":"user","content":prompt}],max_tokens=200)# Extract and clean up the list of queriesqueries=response.choices[0].message.content.strip()returneval(queries)
Functionality: Takes user input (such as a sector, field, or region), uses GPT-4 to generate targeted search queries for financial sentiment analysis, and returns them as a Python list.
Input: User input describing a financial sector, field, or region
Processing:
Constructs a detailed prompt for GPT-4
Calls OpenAI's API with the prompt
Extracts and processes the response
Output: A Python list containing 5-7 targeted search queries
Integration: Works seamlessly with the search_with_google_api function to create a complete search pipeline
fetch_full_content(url)
Purpose: Extracts complete textual content from webpages by parsing all paragraph elements
π Integrates with: BeautifulSoup, requests
deffetch_full_content(url):"""Fetches the full content of a webpage given its URL."""headers={"User-Agent":("Mozilla/5.0 (Windows NT 10.0; Win64; x64) ""AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36")}try:response=requests.get(url,headers=headers,timeout=10)ifresponse.status_code==200:soup=BeautifulSoup(response.text,"html.parser")paragraphs=soup.find_all("p")full_text="\n".join([p.get_text()forpinparagraphs])returnfull_text.strip()iffull_textelseNoneelse:print(f"Error: Unable to fetch content from {url} (Status Code: {response.status_code})")returnNoneexceptExceptionase:print(f"Error fetching content from {url}: {e}")returnNone
Functionality: Takes a URL, fetches the webpage content using proper headers, extracts all paragraph text using BeautifulSoup, and returns clean, structured text for analysis.
Input: Webpage URL
Processing:
Uses realistic browser headers to avoid blocking
Implements timeout protection (10 seconds)
Parses HTML with BeautifulSoup
Extracts all paragraph (<p>) elements
Joins text content with newline separators
Output: Clean, extracted text content or None if extraction fails
Error Handling: Comprehensive exception handling with descriptive error messages
Integration: Works with search results from search_with_google_api to build complete dataset
Key Features:
Anti-blocking Measures: Uses realistic User-Agent headers to mimic browser requests
Robust Parsing: Leverages BeautifulSoup for reliable HTML parsing
Content Focus: Extracts paragraph text which typically contains the main content
Performance: 10-second timeout prevents hanging on slow responses
Error Resilience: Graceful handling of network issues and parsing errors
defcreate_dataset_from_queries(queries,directory="dataset"):"""Process search queries and save results as text files in the same directory."""ifnotos.path.exists(directory):os.makedirs(directory)file_count=1# To ensure unique filenames across all queriesforqueryinqueries:print(f"Processing query: {query}")valid_count=0page_number=1whilevalid_count<10:print(f"Fetching search results, page {page_number}...")results=search_with_google_api(query+f"&start={page_number * 10}")ifnotresults:print("No more results found. Try refining the query.")breakforresultinresults:ifvalid_count>=10:break# Stop when 10 valid documents are savedtitle=result["title"]link=result["link"]snippet=result.get("snippet","No snippet")# Fetch full content of the linkfull_content=fetch_full_content(link)iffull_content:# Save only if content is validfilename=f"{directory}/doc_{file_count}.txt"withopen(filename,"w",encoding="utf-8")asf:f.write(f"Query: {query}\n")f.write(f"Title: {title}\n")f.write(f"Link: {link}\n")f.write(f"Snippet: {snippet}\n\n")f.write(f"Full Content:\n{full_content}")print(f"Saved: {filename}")valid_count+=1file_count+=1else:print(f"Skipped: {link} (No valid content)")page_number+=1# Move to the next page of resultsprint(f"Finished processing all queries. Total files saved: {file_count - 1}")
Functionality: Orchestrates the complete data pipeline - processes search queries, fetches multiple pages of results, extracts full content, and saves structured datasets with metadata.
π― Key Integration Point:
This function serves as the bridge between AI-generated queries and the web scraping pipeline, ensuring seamless data flow from conceptual queries to structured datasets.
Input: List of search queries generated by generate_search_queries()
Processing:
Creates output directory if it doesn't exist
Processes each query sequentially
Fetches multiple pages of search results using pagination
Extracts full content from each valid URL
Saves structured data files with metadata
Output: Multiple text files in the dataset directory, each containing:
Original search query
Page title and URL
Search result snippet
Full extracted content
Quality Control: Only saves documents with valid, extractable content
Pipeline Integration:
Query Processing: Takes AI-generated queries and processes them systematically
Pagination Handling: Automatically fetches multiple result pages using Google's &start= parameter
Content Validation: Integrates with fetch_full_content() to ensure data quality
Structured Storage: Saves files with consistent naming and metadata format
Progress Tracking: Provides real-time feedback on processing status
Each file contains structured information ready for knowledge graph construction and RAG pipeline processing.
6. Complete System Integration & Execution
This example demonstrates the end-to-end workflow of the financial sentiment analysis system, from user input to dataset creation.
End-to-End System Execution
Purpose: Demonstrates the complete integration of all system components to transform user input into a structured dataset
# User provides financial analysis topicuser_input="Financial sentiment analysis for the electric vehicle sector in the US"# Step 1: Generate targeted search queries using AIqueries=generate_search_queries(user_input)print("Generated Queries:")fori,queryinenumerate(queries,1):print(f" {i}. {query}")# Step 2: Execute complete dataset creation pipelinecreate_dataset_from_queries(queries)
Workflow: This code demonstrates the complete automation pipeline that transforms a simple user request into a comprehensive dataset for financial analysis.
Step 1 - User Input: Define the financial analysis topic
user_input = "Financial sentiment analysis for the electric vehicle sector in the US"
This integrated execution demonstrates how the system transforms a simple user request into a comprehensive, structured dataset ready for knowledge graph construction and advanced financial sentiment analysis using RAG pipelines.
ποΈ System Architecture Overview
1
Data Ingestion Layer
Google Custom Search API
Web Content Extraction
Query Generation (GPT-4)
2
Knowledge Graph Layer
SimplePropertyGraphStore
Entity Extraction
Relationship Mapping
3
Retrieval Layer
Graph-based Retrieval
Semantic Search
Context-Aware Queries
Data Flow: User Input β Query Generation β Web Scraping β Knowledge Graph β Intelligent Retrieval
βAll components are fully integrated and operational
πΈοΈ
7. Knowledge Graph Construction with PropertyGraphIndex
Transform unstructured text into structured knowledge with entity extraction and relationship mapping
Knowledge Graph Initialization
Purpose: Creates a structured knowledge graph from the collected dataset using entity extraction and embedding models
# Initialize in-memory graph storegraph_store=SimplePropertyGraphStore()# Load documents from dataset directorydocuments=SimpleDirectoryReader("dataset").load_data()# Create Property Graph Index with entity extractionindex=PropertyGraphIndex.from_documents(documents,embed_model=OpenAIEmbedding(model_name="text-embedding-3-small"),kg_extractors=[SchemaLLMPathExtractor(llm=OpenAI(model="gpt-3.5-turbo",temperature=0.0))],property_graph_store=graph_store,# Uses in-memory storeshow_progress=True,use_async=True)print("β Knowledge Graph created successfully with SimplePropertyGraphStore!")print("β No external database required!")
Components:
SimplePropertyGraphStore: In-memory graph database for storing entities and relationships
No external database setup required
Efficient for medium-sized datasets
SimpleDirectoryReader: Loads all text files from the dataset directory
Automatically processes multiple document formats
Maintains document metadata
PropertyGraphIndex Configuration:
Embedding Model: OpenAI's text-embedding-3-small for vector representations
Entity Extraction: SchemaLLMPathExtractor with GPT-3.5-turbo
Progress Tracking: Real-time progress indicators
Async Processing: Parallel processing for faster indexing
Knowledge Graph Structure:
Nodes: Entities extracted from financial texts (companies, sectors, metrics)
Edges: Relationships between entities (invests_in, competes_with, operates_in)
Properties: Attributes and metadata for each entity
Embeddings: Vector representations for semantic search
This section covers the persistence and loading mechanisms for the knowledge graph, enabling long-term storage and retrieval.
Storage Operations
Purpose: Save the knowledge graph to disk and reload it for future sessions
# Persist the knowledge graph to storage directoryindex.storage_context.persist(persist_dir="./storage")# Load index from storage in future sessionsfromllama_index.coreimportStorageContext,load_index_from_storageindex=load_index_from_storage(StorageContext.from_defaults(persist_dir="./storage"))# Alternative: Load from existing graph storeindex=PropertyGraphIndex.from_existing(property_graph_store=graph_store)
Storage Operations:
Persistence:index.storage_context.persist()
Saves graph structure, embeddings, and metadata to disk
Creates organized storage directory structure
Enables session recovery without re-processing
Loading:load_index_from_storage()
Restores complete knowledge graph state
Maintains all entity relationships and embeddings
Fast initialization for subsequent sessions
Alternative Loading:PropertyGraphIndex.from_existing()
Session Continuity: No need to rebuild graph for each session
Performance: Faster startup times after initial processing
Scalability: Easy to extend with additional documents
Backup: Persistent storage for critical knowledge graphs
9. Intelligent Retrieval from Knowledge Graph
This section demonstrates how to query the knowledge graph using sophisticated retrieval mechanisms for financial analysis.
Graph-Based Retrieval System
Purpose: Query the knowledge graph to extract relevant financial insights and relationships
# Define retriever with graph-only moderetriever=index.as_retriever(include_text=False,# Default is true)# Execute retrieval queryresults=retriever.retrieve("What is the summary of the financial texts?")# Display retrieved resultsforrecordinresults:print(record.text)
Retrieval Configuration:
Retriever Setup:index.as_retriever()
include_text=False: Focuses on graph structure only
Leverages entity relationships for retrieval
Uses semantic similarity in graph space
Query Execution:retriever.retrieve()
Processes natural language queries
Traverses knowledge graph relationships
Returns ranked results based on relevance
Result Processing: Iterates through retrieved records
Each record contains extracted entities and relationships
Structured format for easy consumption
Ready for further analysis or display
Example Query Results:
The financial texts analyze the electric vehicle sector in the US, focusing on:
- Market growth projections showing 35% CAGR through 2028
- Key players: Tesla, Ford, GM, and emerging startups like Rivian
- Investment trends: $120B committed to EV infrastructure
- Regulatory impacts: Federal tax incentives and state-level mandates
- Consumer adoption: Current 7% market share projected to reach 25% by 2030
- Supply chain challenges: Battery material shortages and manufacturing scaling
Advanced Retrieval Features:
Graph Traversal: Follows entity relationships across documents
Semantic Search: Uses embeddings for context-aware retrieval
Relationship Inference: Discovers implicit connections between entities
Multi-hop Reasoning: Chains multiple relationships for complex queries
Sample Queries for Financial Analysis:
"Show me companies investing in battery technology"
"What are the main challenges facing EV startups?"
"Compare market positions of Tesla and traditional automakers"
"Analyze the impact of government policies on EV adoption"
"Identify key growth drivers in the electric vehicle sector"
This retrieval system transforms the knowledge graph into an intelligent query engine that can provide comprehensive financial insights by leveraging the structured relationships between entities extracted from the original dataset.