Customer Support Agent

How to Build an AI-Powered Customer Support Agent with OpenAI and AzureML

In this tutorial, I'll guide you step-by-step in creating an intelligent AI Customer Support Agent using OpenAI embeddings and the FAISS vector store. The project was developed on Google Colab, with Google Drive serving as the storage solution. The provided code snippets are illustrative—you're welcome to customize them with your own datasets, alternative models, or different backend services.

Tools & Tech Stack

Python 3.8+
OpenAI API (text-embedding-3-small)
FAISS (vector similarity search)
Google Colab (optional for deployment)
Google Drive (optional for deployment)
Pandas, NumPy, Matplotlib

Folder Structure

cs_agent/
- cs_dataset/
  - cs_dataset.csv
- vector/
- src/
  - helper.py
- main.ipynb
- requirements.txt

Let's walk through each step in detail, breaking down the process clearly and methodically to ensure a smooth implementation

1. Unzipping Project Files in Google Colab

This command extracts the project files in Google Colab when working with data from Google Drive.

!unzip /content/drive/MyDrive/projectpro/cs_agent/cs_agent.zip

Code Explanation:

Exclamation Mark (!): ! allows you to run shell commands directly from a Colab notebook cell
unzip Command: unzip is the Linux command for extracting compressed ZIP archives
File Path:
- /content/drive/MyDrive/ - Mount point for Google Drive in Colab
- projectpro/cs_agent/ - Directory structure within your Drive
- cs_agent.zip - The compressed file containing your project code

Expected Output:

Archive:  /content/drive/MyDrive/projectpro/cs_agent/cs_agent.zip
  inflating: cs_agent/main.py        
  inflating: cs_agent/utils.py       
  inflating: cs_agent/config.json    
  inflating: cs_agent/README.md

Important Notes:

Before running this command, ensure you've mounted your Google Drive: from google.colab import drive drive.mount('/content/drive')
The files will be extracted to your current working directory unless specified otherwise
For large files, this operation may take some time to complete

2. Install Required Libraries

This command will install all required dependencies specified in requirements.txt file

pip install -r requirements.txt

Code Explanation:

pip: pip is Python's package installer (Package Installer for Python)
install: install is the pip command to install packages
-r flag: -r tells pip to install from a requirements file rather than a single package
requirements.txt: requirements.txt is a text file containing the list of packages to install with their versions

Step 3: Import Core Modules

Here, we import all essential libraries and functions needed for the project.

import faissimport numpy as npimport pandas as pdfrom  collections import    defaultdict from  src.helper import   ( create_embeddings  , 
                            create_index  , 
                            create_embeddings  , 
                            semantic_similarity 
                         )

Code Explanation:

FAISS (Facebook AI Similarity Search): import faiss - Library for efficient similarity search of dense vectors
Numerical Computing: import numpy as np - Essential for handling vector operations and array computations
Data Handling: import pandas as pd - For structured data manipulation of search results/metadata
Data Structures: from collections import defaultdict - Useful for grouping search results by categories
Custom Helper Functions:
- create_embeddings() - Generates vector representations of text
- create_index() - Builds the FAISS search index
- semantic_similarity() - Computes similarity scores
- call_llm() - Interfaces with large language models

Step 4: Load Dataset

This code snippet reads the CSV file and loads its contents into a pandas DataFrame. Line 4 displays a preview of the data (first few rows) to verify successful loading.

# Read CSV file into DataFrame
df = pd.read_csv('cs_dataset/cs_dataset.csv')
# Display first 5 rows
df.head()

Code Explanation:

CSV Reading: pd.read_csv() reads the CSV file and creates a DataFrame object
- Automatically handles header rows
- Infers data types by default
File Path: The path 'cs_dataset/cs_dataset.csv' specifies:
- Subdirectory containing the file
- Exact filename with .csv extension
Data Preview: df.head() displays:
- First 5 rows by default
- Column headers
- Sample data values

Expected Output:

     flags  instruction                                     category     intent            response
0     B   question about cancelling order {{Order Number}}    ORDER  cancel_order  I've understood you have a question regarding...
1   BQZ   i have a question about cancelling oorder {{Or...    ORDER  cancel_order  I've been informed that you have a question ab...
2  BLQZ   i need help cancelling puchase {{Order Number}}    ORDER  cancel_order  I can sense that you're seeking assistance wit...
3    BL   I need to cancel purchase {{Order Number}}         ORDER  cancel_order  I understood that you need assistance with can...
4  BCELN  I cannot afford this order, cancel purchase {{...    ORDER  cancel_order  I'm sensitive to the fact that you're facing f...

Step 5: Plot Category Distribution

The following code generates a visualization of customer question categories from our dataset. Since this is purely for demonstration purposes, you may choose to skip this section.

df['category'].value_counts().plot(kind='bar')

Code Explanation:

Column Selection: df['category'] - Accesses the 'category' column from DataFrame
Value Counts: .value_counts() - Calculates frequency of each unique category
- Returns a Series with categories as index
- Counts as values
- Automatically sorts by frequency (descending)
Plotting: .plot(kind='bar') - Generates a vertical bar chart
- Categories on x-axis
- Counts on y-axis
- Uses matplotlib under the hood

Output:

Fig. 1 - Distribution of customer support ticket categories

Step 6: Building a Category-Intent Mapping Dictionary

The following code efficiently creates a mapping between support ticket categories and their associated intents, revealing the relationship between broad issue types and specific customer needs.

from collections import defaultdict# Create dictionary to map categories to sets of intentscategory_intent_dict = defaultdict(set)# Populate the dictionaryfor category, intent in zip(df['category'], df['intent']):    category_intent_dict[category].add(intent)# Convert sets to lists for final outputcategory_intent_dict = {k: list(v) for k, v in category_intent_dict.items()}

Code Explanation:

Data Structure Choice: defaultdict(set) - Automatically initializes new keys with empty sets - Ensures each intent is only stored once per category
Efficient Pair Processing: zip(df['category'], df['intent']) - Iterates through category-intent pairs without indexing
Set Operations: .add(intent) - Automatically handles duplicate intents per category
Final Conversion: {k: list(v) for k, v in category_intent_dict.items()} - Converts sets to lists for easier JSON serialization - Creates a standard dictionary output

Output:

{'ORDER': ['track_order', 'place_order', 'change_order', 'cancel_order'],
 'SHIPPING': ['change_shipping_address', 'set_up_shipping_address'],
 'CANCEL': ['check_cancellation_fee'],
 'INVOICE': ['get_invoice', 'check_invoice'],
 'PAYMENT': ['check_payment_methods', 'payment_issue'],
 'REFUND': ['check_refund_policy', 'track_refund', 'get_refund'],
 'FEEDBACK': ['complaint', 'review'],
 'CONTACT': ['contact_customer_service', 'contact_human_agent'],
 'ACCOUNT': ['recover_password',
  'edit_account',
  'registration_problems',
  'delete_account',
  'switch_account',
  'create_account'],
 'DELIVERY': ['delivery_period', 'delivery_options'],
 'SUBSCRIPTION': ['newsletter_subscription']}

Step 7: Analyzing Text Length Patterns in Customer Support Conversations

The following code analyzes the average length of both customer instructions and agent responses, revealing key communication patterns in support interactions. This diagnostic step is optional and can be skipped if needed

# Calculate average instruction length (in tokens)avg_instruction_tokens = df['instruction'].apply(lambda x: len(x.split())).mean()# Calculate average response length (in tokens)avg_response_tokens = df['response'].apply(lambda x: len(x.split())).mean()# Print resultsprint(f"Avg. token count for instructions: {avg_instruction_tokens}")print(f"Avg. token count for responses: {avg_response_tokens}")

Code Explanation:

Token Counting: .apply(lambda x: len(x.split())) - Splits text by whitespace and counts words - Simple approximation of token count
Column Processing: df['instruction'] and df['response'] - Accesses the customer questions and agent answers
Statistical Summary: .mean() - Calculates the average length across all entries
Formatted Output: f-strings - Displays results with clear labels

Example Output:

Avg. token count for instructions: 8.690979458172075
Avg. token count for responses: 104.78903691574874

Step 8: Generating Text Embeddings for Customer Support Analysis

The following code serves as the core component of this project, creating numerical vector embeddings of customer support instructions using OpenAI's embedding models. It utilizes the create_embeddings() method from a helper class located in the /src folder, which calls the OpenAI API to generate these vector representations. To execute this code, you must first sign in to the OpenAI Platform and create an API key, as this is required to produce vector embeddings from your customer training data.

Note: The execution time for this code varies depending on your dataset size. Processing larger training datasets will require more time to complete.

vectors = create_embeddings(df, column_name='instruction', model='text-embedding-3-small')

Code Explanation:

create_embeddings(): create_embeddings() - Custom function that generates embeddings for text data
Input Parameters:
- df - Pandas DataFrame containing your support data
- column_name='instruction' - Specifies which column contains the text to embed
- model='text-embedding-3-small' - OpenAI's efficient embedding model (1536 dimensions)
Output: vectors - A 2D numpy array where each row represents the embedding of a text instruction

Step 9: Validating Embedding Dimensions with vectors.shape

This code validates the dimensionality of the generated embeddings, confirming alignment with expected specifications prior to downstream processing.

# Check the shape of the vectors
vectors.shape

Understanding the Output:

Typical Output Format: (num_samples, embedding_dimension)
Example Output: (1000, 1536)
- 1000 - Number of text samples processed
- 1536 - Dimensionality of each embedding (for text-embedding-3-small)

Step 10 :Creating Efficient Vector Search Indexes with FAISS

The following instruction stores the embeddings in the vector database created in Step 8

index = create_index(vectors, index_file_path='vector/faiss.index')

Code Explanation:

create_index(): create_index() - Custom function that builds a FAISS search index
Parameters:
- vectors - Your embedding vectors (numpy array)
- index_file_path - Where to save the index file (optional)
Output: index - FAISS index object ready for similarity searches

Step 11: Loading Pre-Built vector database for Efficient Search

The following instruction loads the vector database created in Step 10.

index = faiss.read_index('vector/faiss.index')

Code Explanation:

faiss.read_index(): faiss.read_index() - FAISS function to load a saved index from disk
Parameters:
- 'vector/faiss.index' - Path to the saved index file
Output: index - Reconstructed FAISS search index object

Step 12: Sample Query

This is our query/question for which we want to retrieve answers from our vector database.

query = "how can I change my order? My order number is 501"

Step 13: Performing Semantic Similarity Searches with Embeddings

This code generates an embedding for the query text, then searches the vector database (loaded in Step 11) to retrieve the most relevant matches along with their similarity scores

distances, indices = semantic_similarity(query, index, model='text-embedding-3-small')
top_similar = df.iloc[indices[0]].reset_index(drop=True)
top_similar['distance'] = distances[0]

Code Explanation:

semantic_similarity(): semantic_similarity() - Custom function that:
- Embeds the query text
- Searches the FAISS index
- Returns matches and similarity scores

Step 14: Processing and Enhancing Semantic Search Results

This code implements semantic search capabilities using large language models to retrieve and present relevant results for customer support applications.

# Extract responses from top matchesresponses = top_similar['response'].to_list()# Display formatted resultsprint(top_similar[['instruction', 'intent', 'response']].to_markdown(index=False))# Generate enhanced LLM responseprint(call_llm(query, responses))

Code Explanation:

Response Extraction: .to_list() - Converts responses column to Python list
Formatted Display: .to_markdown() - Creates clean markdown table output showing:
- Original customer instruction
- Detected intent
- Support response
LLM Enhancement: call_llm() - Custom function that:
- Combines top responses
- Generates coherent answer
- Maintains context from search results