# Image Analysis with InternVL2 This project uses the InternVL2-40B-AWQ model for high-quality image analysis, description, and understanding. It provides a Gradio web interface for users to upload images and get detailed analysis. ## Features - **High-Quality Image Analysis**: Uses InternVL2-40B (4-bit quantized) for state-of-the-art image understanding - **Multiple Analysis Types**: General description, text extraction, chart analysis, people description, and technical analysis - **Simple UI**: User-friendly Gradio interface for easy image uploading and analysis - **Efficient Resource Usage**: 4-bit quantized model (AWQ) for reduced memory footprint and faster inference ## Requirements The application requires: - Python 3.9+ - CUDA-compatible GPU (recommended 24GB+ VRAM) - Transformers 4.37.2+ - lmdeploy 0.5.3+ - Gradio 3.38.0 - Other dependencies in `requirements.txt` ## Setup ### Docker Setup (Recommended) 1. **Build the Docker image**: ``` docker build -t internvl2-image-analysis . ``` 2. **Run the Docker container**: ``` docker run --gpus all -p 7860:7860 internvl2-image-analysis ``` ### Local Setup 1. **Create a virtual environment**: ``` python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 2. **Install dependencies**: ``` pip install -r requirements.txt ``` 3. **Run the application**: ``` python app_internvl2.py ``` ## Usage 1. Open your browser and navigate to `http://localhost:7860` 2. Upload an image using the upload box 3. Choose an analysis type from the options 4. Click "Analyze Image" and wait for the results ### Analysis Types - **General**: Provides a comprehensive description of the image content - **Text**: Focuses on identifying and extracting text from the image - **Chart**: Analyzes charts, graphs, and diagrams in detail - **People**: Describes people in the image - appearance, actions, and expressions - **Technical**: Provides technical analysis of objects and their relationships ## Testing To test the model directly from the command line: ``` python test_internvl2.py --image path/to/your/image.jpg --prompt "Describe this image in detail." ``` ## Deployment to Hugging Face To deploy to Hugging Face Spaces: ``` python upload_internvl2_to_hf.py ``` ## Model Details This application uses InternVL2-40B-AWQ, a 4-bit quantized version of InternVL2-40B. The original model consists of: - **Vision Component**: InternViT-6B-448px-V1-5 - **Language Component**: Nous-Hermes-2-Yi-34B - **Total Parameters**: ~40B (6B vision + 34B language) ## License This project is released under the same license as the InternVL2 model, which is MIT license. ## Acknowledgements - [OpenGVLab](https://github.com/OpenGVLab) for creating the InternVL2 models - [Hugging Face](https://huggingface.co/) for model hosting - [lmdeploy](https://github.com/InternLM/lmdeploy) for model optimization - [Gradio](https://gradio.app/) for the web interface