Diptaraj Sen
README.md updated
0f043be

A newer version of the Streamlit SDK is available: 1.44.1

Upgrade
metadata
title: Image-to-Story-Generation
emoji: πŸš€
colorFrom: indigo
colorTo: blue
sdk: streamlit
app_file: streamlit_app.py
pinned: false
license: apache-2.0

Image-to-Story-Generation

Image-to-Story-Generation is a Streamlit-based application that transforms images into captivating stories and narrates them aloud. The application leverages state-of-the-art AI models for image captioning, story generation, and text-to-speech conversion.

Features

  • Image Captioning: Automatically generates captions for uploaded images using a Vision-Encoder-Decoder model.
  • Story Generation: Creates a short, coherent story based on the generated caption using a language model.
  • Text-to-Speech: Converts the generated story into audio using Google Text-to-Speech (gTTS).
  • Streamlit Interface: Provides an intuitive web interface for users to upload images and interact with the pipeline.

How It Works

  1. Upload an Image: Users upload an image via the Streamlit interface.
  2. Generate Caption: The app generates a caption for the image using a pre-trained Vision-Encoder-Decoder model.
  3. Generate Story: A short story is created based on the caption using a language model.
  4. Text-to-Speech: The story is converted into an audio file, which can be played directly in the app.

Installation

Prerequisites

Steps

  1. Clone the repository:

    git clone https://github.com/diptaraj23/Image-to-Story-Generation.git
    cd genai-storyteller
    
  2. Create a virtual environment and activate it:

    python -m venv venv
    venv\Scripts\activate  # On Windows
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Add your Hugging Face API token to .streamlit/secrets.toml:

    [HF_TOKEN]
    HF_TOKEN = "your_hugging_face_api_token"
    
  5. Run the Streamlit app:

    streamlit run streamlit_app.py
    
  6. Open the app in your browser at http://localhost:8501.

File Structure

genai-storyteller/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ captioning.py         # Image captioning logic
β”‚   β”œβ”€β”€ storytelling.py       # Story generation logic
β”‚   β”œβ”€β”€ tts.py                # Text-to-speech conversion
β”‚   β”œβ”€β”€ logger.py             # Logging utility
β”œβ”€β”€ assets/                   # Directory for input images
β”œβ”€β”€ outputs/                  # Directory for generated audio files
β”œβ”€β”€ logs/                     # Directory for log files
β”œβ”€β”€ tests/                    # Unit tests for the application
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ test_captioning.py    # Tests for captioning
β”‚   β”œβ”€β”€ test_story.py         # Tests for story generation
β”‚   β”œβ”€β”€ test_tts.py           # Tests for text-to-speech
β”œβ”€β”€ .streamlit/
β”‚   β”œβ”€β”€ config.toml           # Streamlit configuration
β”œβ”€β”€ .devcontainer/
β”‚   β”œβ”€β”€ devcontainer.json     # Dev container configuration
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ run_pipeline.py           # CLI pipeline for image-to-audio
β”œβ”€β”€ streamlit_app.py          # Main Streamlit application
β”œβ”€β”€ .gitignore                # Git ignore file

Usage

Streamlit Interface

  1. Upload an image in .jpg, .jpeg, or .png format.
  2. Click the "Generate Story" button to process the image.
  3. View the generated caption and story.
  4. Listen to the story in audio format.

Command-Line Interface

You can also run the pipeline via the command line:

python run_pipeline.py <image_filename>

Replace <image_filename> with the name of the image file in the assets/ directory.

Models Used

  • Image Captioning: ydshieh/vit-gpt2-coco-en (Vision-Encoder-Decoder model)
  • Story Generation: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (Language model)
  • Text-to-Speech: Google Text-to-Speech (gTTS)

Logging

Logs are stored in the logs/ directory. The application logs both to the console and to a file named pipeline.log.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments