How to use this model to get description of the video in javascript?

#27
by zlelik - opened

I tried to use Qwen2 for creating video description and it works nicely in the web demo here https://huggingface.co/spaces/Qwen/Qwen2.5-VL-72B-Instruct but this is server side python demo. I want to do the same in javascript in client side.
Based on different examples in python, I came up with the following code.

import { AutoProcessor, Qwen2VLForConditionalGeneration, load_video} from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.5.0';
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const url = 'https://cdn.glitch.global/ff1f5dc9-0921-477a-82d5-e3c1157e4a98/VID_20230808_115215.mp4?v=1744834444628';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id, {
    device: "webgpu"
});
console.log("The model is loaded");

const video = await load_video(url, {fps: 30});
console.log("Video is loaded");

const messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": url,
                "min_pixels": 4 * 28 * 28,
                "max_pixels": 256 * 28 * 28,
                "total_pixels": 20480 * 28 * 28,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
];

const text = processor.apply_chat_template(messages, {add_generation_prompt: true });

const inputs = await processor(text, null, video);

const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 128,
});

const decoded = processor.batch_decode(
    outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
    { skip_special_tokens: true, clean_up_tokenization_spaces: false },
);
console.log(`output video: ${decoded[0]}`);

but it always returns the description consisting of one character 'I'. FInal output is always

output video: I

Similar code for the image works perfectly

import { AutoProcessor, Qwen2VLForConditionalGeneration, RawImage} from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.5.0';
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const url = 'https://cdn.glitch.global/ff1f5dc9-0921-477a-82d5-e3c1157e4a98/3.JPG?v=1744659987837';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id, {
    device: "webgpu"
});
console.log("The model is loaded");

const image = await (await RawImage.read(url)).resize(448, 448);
console.log("Image is loaded");
    
const messages = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "text", text: "Describe this image." },
    ],
  },
];

const text = processor.apply_chat_template(messages, { add_generation_prompt: true });

const inputs = await processor(text, image);

const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 128,
});

const decoded = processor.batch_decode(
    outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
    { skip_special_tokens: true, clean_up_tokenization_spaces: false },
);
console.log(`output image: ${decoded[0]}`);

What am I doing wrong with the video captioning?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment