How to use this model to get description of the video in javascript?
#27
by
zlelik
- opened
I tried to use Qwen2 for creating video description and it works nicely in the web demo here https://huggingface.co/spaces/Qwen/Qwen2.5-VL-72B-Instruct but this is server side python demo. I want to do the same in javascript in client side.
Based on different examples in python, I came up with the following code.
import { AutoProcessor, Qwen2VLForConditionalGeneration, load_video} from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.5.0';
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const url = 'https://cdn.glitch.global/ff1f5dc9-0921-477a-82d5-e3c1157e4a98/VID_20230808_115215.mp4?v=1744834444628';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id, {
device: "webgpu"
});
console.log("The model is loaded");
const video = await load_video(url, {fps: 30});
console.log("Video is loaded");
const messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": url,
"min_pixels": 4 * 28 * 28,
"max_pixels": 256 * 28 * 28,
"total_pixels": 20480 * 28 * 28,
},
{"type": "text", "text": "Describe this video."},
],
}
];
const text = processor.apply_chat_template(messages, {add_generation_prompt: true });
const inputs = await processor(text, null, video);
const outputs = await model.generate({
...inputs,
max_new_tokens: 128,
});
const decoded = processor.batch_decode(
outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
{ skip_special_tokens: true, clean_up_tokenization_spaces: false },
);
console.log(`output video: ${decoded[0]}`);
but it always returns the description consisting of one character 'I'. FInal output is always
output video: I
Similar code for the image works perfectly
import { AutoProcessor, Qwen2VLForConditionalGeneration, RawImage} from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.5.0';
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const url = 'https://cdn.glitch.global/ff1f5dc9-0921-477a-82d5-e3c1157e4a98/3.JPG?v=1744659987837';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id, {
device: "webgpu"
});
console.log("The model is loaded");
const image = await (await RawImage.read(url)).resize(448, 448);
console.log("Image is loaded");
const messages = [
{
role: "user",
content: [
{ type: "image" },
{ type: "text", text: "Describe this image." },
],
},
];
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, image);
const outputs = await model.generate({
...inputs,
max_new_tokens: 128,
});
const decoded = processor.batch_decode(
outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
{ skip_special_tokens: true, clean_up_tokenization_spaces: false },
);
console.log(`output image: ${decoded[0]}`);
What am I doing wrong with the video captioning?