Model Card for InstructVLA-Generalist

Weights

checkpoints/step-013500-epoch-01-loss=0.1093.pt: Complete model checkpoint for direct evaluation.

Evaluation Results

results_step-013500-epoch-01-loss=0.1093_cot_1-3: Results on SimplerEnv-Instruct with multimodal reasoning.
results_step-013500-epoch-01-loss=0.1093_instruct_no_cot_1_3: Results on SimplerEnv-Instruct without multimodal reasoning.
results_step-006000-epoch-01-loss=0.1724_simpler_1-3: Results on SimplerEnv.
results_step-006000-epoch-01-loss=0.1724_simpler_cot_1-3: Results on SimplerEnv with multimodal reasoning(have similar performance compared with no reasoning).
vlmeval: Multimodal performance

This checkpoint supports dialogue, please check our Code


import torch
from vla.instructvla_eagle_dual_sys_v2_meta_query_v2 import load, load_vla
from PIL import Image
import numpy as np

model_path = 'outputs/release_ckpts/instructvla_finetune_v2_xlora_freeze_head_instruction--image_aug/checkpoints/step-013500-epoch-01-loss=0.1093.pt'

# Load Stage-2 (Generalist) model
model = load_vla(model_path, stage="stage2").eval().to(torch.bfloat16).cuda()

messages = [
    {"content": "You are a helpful assistant."},  # system
    {
        "role": "user",
        "content": "Can you describe the main idea of this image?",
        "image": [{'np_array': np.asarray(Image.open("./asset/teaser.png"))}]
    }
]

# Preprocess input
inputs = model.processor.prepare_input(dict(prompt=messages))
autocast_dtype = torch.bfloat16

with torch.autocast("cuda", dtype=autocast_dtype, enabled=True):
    output = model.vlm.generate(
        input_ids=inputs['input_ids'].cuda(),
        attention_mask=inputs['attention_mask'].cuda(),
        pixel_values=inputs['pixel_values'].cuda(),
        max_new_tokens=200,
        output_hidden_states=False,
    )

response = model.processor.tokenizer.decode(output[0])
print(response)

Downloads last month: 16

Video Preview

Robotics

Model tree for ShuaiYang03/instructvla_finetune_v2_xlora_freeze_head_instruction

Base model

nvidia/Eagle2-2B

Finetuned

(12)

this model

Dataset used to train ShuaiYang03/instructvla_finetune_v2_xlora_freeze_head_instruction

Collection including ShuaiYang03/instructvla_finetune_v2_xlora_freeze_head_instruction

InstructVLA

Collection

Paper, Data and Checkpoints for ``InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation'' • 14 items • Updated 12 days ago • 1