Build an AI Picture Book Generator with Stable Diffusion Models

AIM of the Project

The aim of this project is to build an AI-powered children's picture book generation system that can:

Create captivating and moral-driven children's stories that align with various Themes, Settings, Characters and Narrative Tones using advanced natural language generation models.
Generate visually consistent and high-quality illustrations for each part of the story using Stable Diffusion models.

By leveraging advanced AI models like Google Gemini and image-generation models such as Stable Diffusion, this system automates story generation and illustration design to produce interactive and imaginative picture books. The goal is to empower publishers, educators, and parents to create high-quality, customizable content for children that is both immersive and impactful.

What is AI-Powered Picture Book Generation?

AI-powered picture book generation uses artificial intelligence to create engaging, interactive children's stories along with visual prompts for illustrations. By integrating AI with text-to-image models like Stable Diffusion, this technology can generate captivating narratives and corresponding images, making it easier to produce personalized picture books that children can enjoy.

In this project, the Gemini model is utilized to craft unique stories with themes like friendship, adventure, kindness, and bravery. These stories are divided into parts with each having a visual prompt for image generation, transforming the narrative into a visual experience. The generated images are based on the protagonist's description and key moments in the story, giving each part a vibrant, illustrated look.

Setup for IPYNB

For the best experience, please stay connected to the internet while executing this Project

Running an IPYNB on Google Colab:

Open the Google Colab website.
Click on the "New Notebook" button.
Click the "File" menu in the new notebook and choose "Upload notebook."
Select the IPYNB file you want to upload.
Once the file is uploaded, click on the "Runtime" menu and choose "Run all" to execute all the cells in the notebook.
The Default version of Python that Colab uses currently is 3.10

Important Libraries

torchThe core library of PyTorch, a deep learning framework used for high-performance tensor computation and model execution. In this project, it enables efficient computation for the Stable Diffusion pipelines. Refer to the documentation for more information.
diffusersProvides tools for implementing and fine-tuning diffusion models like Stable Diffusion. It is used here for generating visually captivating images for the story based on scene prompts and protagonist descriptions. Refer to the documentation for more information.
matplotlibA versatile library for data visualization in Python. In this project, it is used to display generated images during the execution of the code. Refer to the documentation for more information.
PIL (Python Imaging Library): Used for image processing tasks like saving and displaying generated images. This ensures the generated illustrations are preserved in high-quality formats for use in picture books. Refer to the documentation for more information.
google.generativeai: Used to interact with Google's Generative AI models for creative storytelling. It helps generate detailed stories, protagonist descriptions, and scene prompts with imaginative and child-friendly content. Refer to the documentation for more information.

Important Libraries

torchThe core library of PyTorch, a deep learning framework used for high-performance tensor computation and model execution. In this project, it enables efficient computation for the Stable Diffusion pipelines. Refer to the documentation for more information.
diffusersProvides tools for implementing and fine-tuning diffusion models like Stable Diffusion. It is used here for generating visually captivating images for the story based on scene prompts and protagonist descriptions. Refer to the documentation for more information.
matplotlibA versatile library for data visualization in Python. In this project, it is used to display generated images during the execution of the code. Refer to the documentation for more information.
PIL (Python Imaging Library): Used for image processing tasks like saving and displaying generated images. This ensures the generated illustrations are preserved in high-quality formats for use in picture books. Refer to the documentation for more information.
google.generativeai: Used to interact with Google's Generative AI models for creative storytelling. It helps generate detailed stories, protagonist descriptions, and scene prompts with imaginative and child-friendly content. Refer to the documentation for more information.

Execution Instructions

Setup:
- Install necessary libraries:
```
pip install diffusers torch pandas PIL google-generativeai
```
- Obtain API keys for any external services (e.g., Google Generative AI).
Implement GEMINI Functionalities: Run the generate_story() function to implement the functionality of using GEMINI to generate stories by randomly selecting the themes and settings and generate prompts for each of the scenes.
Implement the Stable Diffusion Functionalities: Run the generate_images_from_gemini() function that will generate images of the protagonist and the scenes based on the prompts that were generated by gemini in the previous step.
Execute the entire pipeline: Run the final main function that combines the functionalities of the generate_story() function and the generate_images_from_gemini() function and displays the final story, prompts for the scenes and the picture book generated.

IMPLEMENTATION

Integrating Google Gemini for Story Generation

The Google Gemini model is a state-of-the-art generative AI tool designed to handle tasks like natural language generation and storytelling with remarkable creativity and coherence. In this project, Gemini is used to generate engaging narratives for picture books, including character descriptions, plot outlines, and scene prompts. By leveraging its advanced capabilities, we can ensure high-quality, imaginative content tailored for children's stories.

Step to create Gemini API Key

Click on the https://aistudio.google.com/apikey
Click on the Get API Key Icon
Next, click on the Create an API Key Icon and give it a name and copy the generated API key safely for future use.

Gemini API Setup


import google.generativeai as genai
import random
# Configure the Gemini API
api_key = "Your Gemini API Key"
genai.configure(api_key=api_key)

# Initialize the Gemini model
model = genai.GenerativeModel("gemini-2.5-pro")

Code Breakdown

API Key: The api_key used in this function should be a valid API key associated with your Google Cloud project. Ensure that you have set up the key for using Google Gemini.
Model Version: The "gemini-1.5-flash" model is a specific version of the Gemini model. You can use a different version of the models and experiment with them from the available models in the API key.

Generate Your Picture Book Story and Visual Prompts

Introduction

Welcome to the "Generate Story" function!

This block of code brings creativity to life by crafting a full-fledged children's story, a detailed protagonist description, and visually engaging prompts for each scene. By leveraging Google Gemini, this function ensures that the generated story and visuals are rich in imagination and perfectly tailored for picture books.

This function is designed to:

Randomly select a theme and setting for the story.
Generate a story with a clear moral or lesson, divided into 5 logical parts.
Provide a protagonist description to maintain character consistency across scenes.
Create scene prompts for each story part, compatible with Stable Diffusion for generating high-quality illustrations.

How It Works?

Random Theme and Setting: A random selection is made from pre-defined lists of themes (e.g., friendship, bravery) and settings (e.g., forest, ocean).
Google Gemini Prompting: A detailed prompt is sent to the Gemini API to generate the story content.
Story Structure:
- A beginning, middle, and end with a clear narrative.
- A protagonist with distinct physical and personality traits.
- Logical divisions into 5 parts, each tied to a specific scene.
Scene Prompts: Highly descriptive visual prompts for use with Stable Diffusion to create illustrations.

Key Parameters

Theme: The central idea or moral of the story, chosen randomly. Examples: kindness, adventure, honesty.
Setting: The backdrop for the story, chosen randomly. Examples: castle, jungle, village.
Protagonist Description: A standalone description of the main character, including physical appearance and personality traits.
Story Structure: The story is divided into 5 parts, each highlighting a key moment.

Generate Story Function

import json # Import json for parsing

def generate_story():
    """
    Generates a children's story, protagonist description, and scene prompts using Gemini.

    Returns:
        dict: A dictionary containing the story, protagonist description, and scene prompts.
    """
    # Randomly select a theme and setting
    theme = random.choice(["friendship", "adventure", "kindness", "bravery", "overcoming fears", "honesty"])
    setting = random.choice(["forest", "castle", "ocean", "mountain", "desert", "jungle", "village", "busy city"])

    gemini_prompt = f"""
You are a creative storyteller tasked with generating simple, engaging children's stories that can be illustrated as picture books using Stable Diffusion for image generation. Each story must resemble timeless classics like "The Tortoise and the Hare," with a strong narrative and visually clear prompts. Follow these guidelines carefully:

1. **Story Generation:**
   - Create a captivating children's story based on the theme "{theme}" and set in "{setting}".
   - Use simple, kid-friendly language and imaginative yet relatable scenarios.
   - Ensure the story has:
     - One main protagonist with consistent traits throughout.
     - A clear challenge, conflict, or adventure the protagonist embarks on and resolves.
     - A moral or lesson that children can learn from.
   - Keep the story concise and divide it into 5 logical parts.

2. **Protagonist Description:**
   - Write a detailed, standalone description of the protagonist that remains consistent across all scenes:
     - Physical traits: size, age, clothing, and distinctive features.
     - Personality traits: brave, kind, curious, or determined.
     - Any unique items or accessories they carry that appear in every scene.

3. **Story Division into 5 Parts:**
   - Divide the story into exactly 5 sequential moments:
     - Each part should represent a distinct event in the narrative, progressing logically toward the resolution.
     - Clearly establish the setting, protagonist's actions, and emotions in each part.

4. **Scene Prompt Generation for Stable Diffusion:**
   - For each of the 5 parts, write a **highly descriptive visual prompt** that Stable Diffusion can use to generate images.
   - Ensure prompts are:
     - Clear, specific, and unambiguous, avoiding abstract or overly complex descriptions.
     - Focused on describing the protagonist's appearance, actions, surroundings, and emotions.
     - Consistent with the protagonist's traits and accessories described earlier.
     - Example of a well-crafted prompt:
       - "A small tortoise with a green shell and a bright red scarf stands proudly on a dirt path in a sunny forest, with trees and flowers around, while a confident hare smirks nearby."

5. **Output the following as a JSON object.** The JSON object should have the following keys:
   - 'theme': The chosen theme.
   - 'setting': The chosen setting.
   - 'protagonist_description': A single string containing the detailed protagonist description.
   - 'story_text': A single string containing the complete story divided into parts, with each part clearly separated (e.g., using newlines).
   - 'scene_prompts': A list of 5 strings, where each string is a clear visual prompt for a scene.
Example JSON output:
```json
{{
  "theme": "friendship",
  "setting": "forest",
  "protagonist_description": "A curious squirrel named Sammy, with a bushy tail and a small red scarf, who is adventurous and helpful.",
  "story_text": "Part 1: Once upon a time...\\nPart 2: Sammy met a shy rabbit...\\nPart 3: They went on an adventure...\\nPart 4: They faced a challenge...\\nPart 5: They learned about friendship...",
  "scene_prompts": [
    "A small squirrel with a red scarf explores a sunny forest, surrounded by towering trees and colorful flowers.",
    "The squirrel meets a shy rabbit near a sparkling stream, with sunlight reflecting off the water.",
    "Sammy and the rabbit discover a hidden cave, illuminated by glowing crystals.",
    "They work together to cross a treacherous river, building a bridge from fallen branches.",
    "Sammy and the rabbit share a celebratory nut, their bond of friendship stronger than ever, under a starlit sky."
  ]
}}
```
"""

    # Generate the response using the Gemini model
    response = model.generate_content(gemini_prompt)
    if not response or not response.text.strip():
        return {"error": "Gemini returned an empty or invalid response.", "theme": theme, "setting": setting}

    # Initialize gemini_output with theme and setting
    gemini_output = {"theme": theme, "setting": setting}

    try:
        # Extract pure JSON string from the response text
        json_start = response.text.find('```json')
        json_end = response.text.rfind('```')
        if json_start != -1 and json_end != -1:
            json_string = response.text[json_start + 7:json_end].strip()
        else:
            json_string = response.text.strip()

        # Attempt to parse the extracted JSON string
        parsed_output = json.loads(json_string)
        gemini_output["protagonist_description"] = parsed_output.get("protagonist_description")
        gemini_output["story_text"] = parsed_output.get("story_text")
        gemini_output["scene_prompts"] = parsed_output.get("scene_prompts", [])
    except json.JSONDecodeError as e:
        return {"error": f"Failed to parse Gemini response as JSON: {e} - Response: {response.text}", "theme": theme, "setting": setting}

    return gemini_output

Before Running the Code....

Dependencies: Ensure the required libraries (random and google.generativeai) are installed and properly configured.
API Key: Provide the necessary API key to authenticate with Google Gemini.

Example Output

{
    "theme": "friendship",
    "setting": "forest",
    "protagonist_description": "A curious squirrel named Sammy, with a bushy tail and a small red scarf, who is adventurous and helpful.",
    "story_text": "Once upon a time...",
    "scene_prompts": [
        "A small squirrel with a red scarf explores a sunny forest, surrounded by towering trees and colorful flowers.",
        "The squirrel meets a shy rabbit near a sparkling stream, with sunlight reflecting off the water.",
        ...
    ]
}

Image Generation Pipeline Setup

from diffusers import DiffusionPipeline, AutoPipelineForImage2Image
import torch
from PIL import Image
import matplotlib.pyplot as plt
import time
import gc

Creating Stunning Illustrations from Story Prompts with Stable Diffusion

Introduction

This Python function, generate_images_from_gemini_output, leverages pre-trained Stable Diffusion model to generate a sequence of images based on the thematic and descriptive prompts provided by the gemini_output.

Purpose of the code

The function generates:

Protagonist Image: Based on the description of a protagonist.
Scene Images: Based on the scene prompts, starting with the protagonist's image and progressively transforming it.

It uses Stable Diffusion, a popular deep learning model for text-to-image and image-to-image generation.

Code Components

Function: generate_images_from_gemini_output

This function is responsible for generating images based on prompts.

Parameters

gemini_output (dict): A dictionary containing:
- protagonist_description: Text describing the main character.
- scene_prompts: List of textual descriptions for each scene.
model_name (str, optional): Name of the pre-trained model. Defaults to "sd-legacy/stable-diffusion-v1-5".
device (str, optional): Device to run the model, e.g., "cuda" for GPUs or "cpu" for CPUs. Defaults to "cuda".
strength (float, optional): Controls the degree of transformation for image-to-image generation. Ranges from 0.0 (no transformation) to 1.0 (maximum transformation). Defaults to 0.8.

Returns

A list of generated images (list of PIL.Image.Image objects).

Helper Function:

display_and_save_image : Displays the generated image using matplotlib and saves it to a file.

Workflow

Import Necessary Libraries
- diffusers: Provides Stable Diffusion pipelines for text-to-image and image-to-image generation.
- torch: For managing model weights and running computations on the specified device.
- gc: For explicit garbage collection to optimize memory usage.
Initialize Pipelines
- Text-to-Image Pipeline (StableDiffusionPipeline): Converts text descriptions into images.
- Image-to-Image Pipeline (StableDiffusionImg2ImgPipeline): Modifies images based on new text prompts.
Both pipelines are loaded with the specified model_name and cast to float16 for efficient memory usage.
Input Validation
Checks if protagonist_description and scene_prompts are present in gemini_output. Raises errors if missing.
Generate Images
1. Protagonist Image:
  The protagonist description is fed into the text-to-image pipeline.
  
  The result is stored and displayed.
2. Scene Images:
  Iterates through scene_prompts, applying the image-to-image pipeline to the protagonist's image using the specified strength.
  
  Each result is stored, displayed, and saved.
Memory Management
Explicitly clears GPU memory using torch.cuda.empty_cache() and Python garbage collection (gc.collect()).
Save and Display Images
Uses the helper function display_and_save_image to visualize and store images in .png format.

Customization Options

Hyperparameters

strength: Adjust transformation intensity for image-to-image generation.
guidance_scale: Tune the balance between following the text prompt strictly or being creative (default values: 8 for protagonist image, 9 for scene images).

Pipeline and Device

model_name: Replace with other pre-trained models for different styles or quality.
device: Use "cpu" if a GPU is unavailable, though it will be slower.

Prompts

Modify the protagonist_description or scene_prompts in gemini_output for different narratives or themes.

File Saving

Adjust file paths and formats in display_and_save_image.

Key Considerations

Memory Management: Stable Diffusion models require significant memory, especially on GPUs. Explicit memory clearing (gc.collect()) is essential.
Input Validation: Ensures required fields are provided, avoiding runtime errors.
Dependencies:
- diffusers: Install using pip install diffusers.
- torch: Install using pip install torch.

Illustrations Generation using stable-diffusion-v1-5

def generate_images_from_gemini_output(gemini_output, model_name="sd-legacy/stable-diffusion-v1-5", device="cuda", strength=0.8):
    """
    Generates a sequence of images based on the output from the Gemini function.

    Args:
        gemini_output (dict): Dictionary containing the theme, setting, protagonist description, and scene prompts.
        model_name (str): Pretrained model name.
        device (str): Device to run the model on (e.g., "cuda" or "cpu").
        strength (float): Strength of transformation for image-to-image generation (0.0 to 1.0).

    Returns:
        list: List of generated images.
    """
    from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline
    import torch
    import gc

    # Initialize text-to-image pipeline
    pipe = StableDiffusionPipeline.from_pretrained(
        model_name,
        torch_dtype=torch.float16
    ).to(device)

    # Initialize image-to-image pipeline
    img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
        model_name,
        torch_dtype=torch.float16
    ).to(device)

    protagonist_description = gemini_output.get("protagonist_description", "")
    scene_prompts = gemini_output.get("scene_prompts", [])

    if not protagonist_description:
        raise ValueError("Protagonist description is missing in the Gemini output.")
    if not scene_prompts:
        raise ValueError("Scene prompts are missing in the Gemini output.")

    generated_images = []

    # Step 1: Generate protagonist image
    print(f"Generating protagonist image with prompt: {protagonist_description}")
    protagonist_image = pipe(prompt=protagonist_description, guidance_scale=8).images[0]
    generated_images.append(protagonist_image)
    display_and_save_image(protagonist_image, "protagonist_image.png")

    # Free memory
    torch.cuda.empty_cache()
    gc.collect()

    # Step 2: Generate scene images
    for i, prompt in enumerate(scene_prompts):
        print(f"Generating scene image {i + 1} with prompt: {prompt}")
        current_image = img2img_pipe(
            prompt=prompt,
            image=protagonist_image,
            strength=strength,
            guidance_scale=9
        ).images[0]
        generated_images.append(current_image)
        display_and_save_image(current_image, f"scene_image_{i + 1}.png")

        # Free memory
        torch.cuda.empty_cache()
        gc.collect()

    return generated_images

def display_and_save_image(image, filename):
    """
    Displays the generated image and saves it to a file.

    Args:
        image (PIL.Image.Image): Image to be displayed and saved.
        filename (str): Path to save the image.
    """
    import matplotlib.pyplot as plt

    plt.figure(figsize=(6, 6))
    plt.imshow(image)
    plt.axis("off")  # Turn off axis
    plt.show()

    image.save(filename)
    print(f"Image saved as '{filename}'")

Main Functionality for Story Generation and Image Creation

Purpose

This block of code generates a story and corresponding images based on its content. The generated story includes a randomly selected theme and setting, as well as details about the protagonist, story text, and scene prompts.

Code Components

Entry Point Check
```
if __name__ == "__main__":
```
This condition ensures that the following block of code is executed only when the script is run directly (not when it's imported as a module in another script).
Story Generation
```
gemini_output = generate_story()
```
generate_story() Function: Randomly generates a story with attributes such as Theme, Setting, Story Text, Protagonist Description, Scene Prompts, Output
Story Details Printing
Outputs the generated story details to the console.

Key Components:
- gemini_output['theme']: Displays the story's theme.
- gemini_output['setting']: Displays the story's setting.
- gemini_output['story_text']: Prints the main story content.
- gemini_output['protagonist_description']: Prints details about the protagonist.
- gemini_output['scene_prompts']: Iterates over scene prompts and prints each with a corresponding part number.
Image Generation
```
generated_images = generate_images_from_gemini_output(gemini_output)
```
generate_images_from_gemini_output() Function:
- Purpose: Creates visual representations (images) based on the details in gemini_output.
- Input: The story dictionary (gemini_output) containing the theme, setting, and prompts.
- Output: A list or dictionary of generated images corresponding to different parts of the story.

Expected Generated Images

Below are the images generated by our AI Picture Book Generator using Stable Diffusion models. Each image was created based on specific prompts that describe scenes from the story about Pip, a brave mountain goat kid on an adventure.

protagonist_image.png

Prompt:

Pip is a small, fluffy white mountain goat kid with big, curious brown eyes and tiny, curved horns just beginning to sprout. He is determined and brave, always ready for an adventure. Around his neck, he always wears a bright blue, hand-knitted bell that jingles softly with every step he takes.

scene_image_1.png

Prompt:

A small, fluffy white mountain goat kid with a bright blue bell around its neck stands at the base of a huge, rocky mountain, looking up with determination at the sunlit peak. The scene is a cheerful, sunny day with green grass and wildflowers at the bottom.

scene_image_2.png

Prompt:

The small, fluffy white mountain goat kid with a blue bell carefully climbs a steep, rocky path on the side of a mountain. He is focused and determined, using his small size to navigate the difficult terrain. The morning light is soft and golden.

scene_image_3.png

Prompt:

The small, fluffy white mountain goat kid with a blue bell stands at the edge of a wide, windy chasm on a mountain path, looking thoughtfully at a fallen log that spans the gap. He looks slightly worried but hopeful. The sky is a clear blue.

scene_image_4.png

Prompt:

The small, fluffy white mountain goat kid with a blue bell is bravely and carefully walking across a narrow log bridge over a chasm high on a mountain. He is halfway across, focused and balanced, with a proud expression. The wind gently ruffles his white fur.

scene_image_5.png

Prompt:

The small, fluffy white mountain goat kid with a blue bell stands triumphantly on the very top of a mountain peak, watching a beautiful, vibrant sunrise. The sky is filled with orange, pink, and gold clouds. He looks happy and proud, his little bell shining in the morning sun.