---
license: other
language:
- ja
base_model:
- tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3
pipeline_tag: visual-question-answering
---

# Llama-3.1-70B-Instruct-multimodal-JP-Graph - Built with Llama

Llama-3.1-70B-Instruct-multimodal-JP-Graph is a Japanese Large Vision Language Model.
This model is based on [Llama-3.1-Swallow-70B](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.3) and Image Encoder of [Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

# How to use
### 1. Install LLaVA-NeXT

- First, please install LLaVA-NeXT by following the instructions at the [URL](https://github.com/LLaVA-VL/LLaVA-NeXT).

```sh
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"
```

### 2. Install dependencies
```sh
pip install flash-attn==2.6.3
pip install transformers==4.45.2
```

### 3. Modify LLaVA-NeXT
- Modify the LLaVA-NeXT code as follows.
  - Create the LLaVA-NeXT/llava/model/multimodal_encoder/qwen2_vl directory and copy the contents of the attached qwen2_vl directory into it.
  - Overwrite LLaVA-NeXT/llava/model/multimodal_encoder/builder.py with the attached "builder.py".
  - Copy the attached "qwen2vl_encoder.py" into LLaVA-NeXT/llava/model/multimodal_encoder/.
  - Overwrite LLaVA-NeXT/llava/model/language_model/llava_llama.py with the attached "llava_llama.py".
  - Overwrite LLaVA-NeXT/llava/model/llava_arch.py with the attached "llava_arch.py".
  - Overwrite LLaVA-NeXT/llava/conversation.py with the attached "conversation.py".

### 4.  Inference
The following script loads the model and allows inference.

```python
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import copy
import torch

import warnings
warnings.filterwarnings("ignore")


pretrained = 'r-g2-2024/Llama-3.1-70B-Instruct-multimodal-JP-Graph-v0.1'
model_name = "llava_llama"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)

model.eval()

image = Image.open("./画像14.png")
image

inputs = image_processor(image)
pixel_values = torch.tensor(inputs['pixel_values']).to(dtype=torch.float16, device=device)
pixel_values = [pixel_values]
_image_grid_thw = torch.tensor(inputs['image_grid_thw'], dtype=torch.long)
_image_grid_thw = [_image_grid_thw]

conv_template = "llava_llama_3"
question = DEFAULT_IMAGE_TOKEN + "\nFY22からFY23にかけて単体の値はどれくらい増加したか？"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=pixel_values,
    image_sizes=image_sizes,
    image_grid_thws=_image_grid_thw,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)


question = DEFAULT_IMAGE_TOKEN + "\nFY2021の連結の値はいくつか？"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=pixel_values,
    image_sizes=image_sizes,
    image_grid_thws=_image_grid_thw,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

question = DEFAULT_IMAGE_TOKEN + "\nこの図は何を表しているか？"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=pixel_values,
    image_sizes=image_sizes,
    image_grid_thws=_image_grid_thw,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)


question = DEFAULT_IMAGE_TOKEN + "\nFY2020の純利益はマイナスか？プラスか？"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=pixel_values,
    image_sizes=image_sizes,
    image_grid_thws=_image_grid_thw,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)


question = DEFAULT_IMAGE_TOKEN + "\n単体が連結の利益を上回るのはいつからか？"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=pixel_values,
    image_sizes=image_sizes,
    image_grid_thws=_image_grid_thw,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)


```


# Usage and License Notices:
This project incorporates certain datasets and model checkpoints that are subject to their respective original license agreements. Users are hereby notified that they must adhere to all applicable terms, conditions, and restrictions set forth in such original licenses. These include, without limitation, the Qwen License Agreement applicable to "Improved using Qwen", the MIT License applicable to a model used in creating the training dataset, and the specific licenses governing the base language models associated with any derived checkpoints (including, but not limited to, the Apache License Version 2.0 for Qwen2-VL-7B-Instruct, the Meta LLaMA 3.1 Community License, and the Gemma Terms of Use). In addition, this project utilizes code from LLaVA-NeXT during training, and users must also comply with the Apache License 2.0 accordingly.
This project does not introduce any additional licensing terms beyond those imposed by the original licensors. Users are further reminded that it is their sole responsibility to ensure that their access to and use of the datasets and model checkpoints is in full compliance with all applicable laws, regulations, and license terms.