Transformers documentation
DeepseekVLHybrid
DeepseekVLHybrid
Deepseek-VL-Hybrid was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding low-resolution images and SAM (Segment Anything Model) is incorporated to handle high-resolution image encoding, enhancing the model’s ability to process fine-grained visual details. Deepseek-VL-Hybrid is a variant of Deepseek-VL that uses SAM (Segment Anything Model) to handle high-resolution image encoding.
You can find all the original Deepseek-VL-Hybrid checkpoints under the DeepSeek-community organization.
Click on the Deepseek-VL-Hybrid models in the right sidebar for more examples of how to apply Deepseek-VL-Hybrid to different vision and language tasks.
The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.
import torch
from transformers import pipeline
pipe = pipeline(
task="image-text-to-text",
model="deepseek-community/deepseek-vl-7b-chat",
device=0,
torch_dtype=torch.float16
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
{ "type": "text", "text": "Describe this image."},
]
}
]
pipe(text=messages, max_new_tokens=20, return_full_text=False)
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses torchao to only quantize the weights to int4.
import torch
from transformers import TorchAoConfig, DeepseekVLHybridForConditionalGeneration, AutoProcessor
quantization_config = TorchAoConfig(
"int4_weight_only",
group_size=128
)
model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
"deepseek-community/deepseek-vl-7b-chat",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config
)
Notes
Do inference with multiple images in a single conversation.
import torch from transformers import DeepseekVLHybridForConditionalGeneration, AutoProcessor model = DeepseekVLHybridForConditionalGeneration.from_pretrained( "deepseek-community/deepseek-vl-7b-chat", torch_dtype=torch.float16, device_map="auto", attn_implementation="sdpa" ) processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat") messages = [ [ { "role": "user", "content": [ {"type": "text", "text": "What’s the difference between"}, {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"}, {"type": "text", "text": " and "}, {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"} ] } ], [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"}, {"type": "text", "text": "What do you see in this image?"} ] } ] ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, padding=True, truncation=True, tokenize=True, return_dict=True, return_tensors="pt" ).to(model.device, dtype=model.dtype) generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)
DeepseekVLHybridConfig
class transformers.DeepseekVLHybridConfig
< source >( text_config: AutoConfig = None vision_config: AutoConfig = None high_res_vision_config: AutoConfig = None image_token_id: int = 100015 **kwargs )
Parameters
- text_config (
Union[AutoConfig, dict]
, optional, defaults toLlamaConfig
) — The config object or dictionary of the text backbone. - vision_config (
Union[AutoConfig, dict]
, optional, defaults toSiglipVisionConfig
) — The config object or dictionary of the vision backbone. - high_res_vision_config (
Union[AutoConfig, dict]
, optional, defaults toSamVisionConfig
) — The config object or dictionary of the high resolution vision backbone. - image_token_id (
int
, optional, defaults to 100015) — The index representing image tokens in the model’s token vocabulary.
This is the configuration class to store the configuration of a DeepseekVLHybridModel. It is used to instantiate a DeepseekVLHybrid model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the DeepseekVLHybrid deepseek-community/deepseek-vl-7b-chat architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import DeepseekVLHybridConfig, DeepseekVLHybridModel
>>> # Initializing a DeepseekVLHybrid deepseek-community/deepseek-vl-7b-chat style configuration
>>> configuration = DeepseekVLHybridConfig()
>>> # Initializing a model (with random weights) from the deepseek-community/deepseek-vl-7b-chat style configuration
>>> model = DeepseekVLHybridModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
DeepseekVLHybridProcessor
class transformers.DeepseekVLHybridProcessor
< source >( image_processor tokenizer chat_template = None num_image_tokens = 576 )
Parameters
- image_processor (DeepseekVLHybridImageProcessor) — The image processor is a required input.
- tokenizer (LlamaTokenizerFast) — The tokenizer is a required input.
- chat_template (
str
, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string. - num_image_tokens (
int
, optional, defaults to 576) — The number of special image tokens used as placeholders for visual content in text sequences.
Constructs a DeepseekVLHybrid processor which wraps a DeepseekVLHybrid Image Processor and a Llama tokenizer into a single processor.
DeepseekVLHybridProcessor offers all the functionalities of DeepseekVLHybridImageProcessor and LlamaTokenizerFast. See the
__call__()
and decode() for more information.
This method forwards all its arguments to LlamaTokenizerFast’s batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to LlamaTokenizerFast’s decode(). Please refer to the docstring of this method for more information.
DeepseekVLHybridImageProcessor
class transformers.DeepseekVLHybridImageProcessor
< source >( do_resize: bool = True size: typing.Optional[dict[str, int]] = None high_res_size: typing.Optional[dict[str, int]] = None min_size: int = 14 resample: Resampling = <Resampling.BICUBIC: 3> high_res_resample: Resampling = <Resampling.BICUBIC: 3> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None high_res_image_mean: typing.Union[float, list[float], NoneType] = None high_res_image_std: typing.Union[float, list[float], NoneType] = None do_convert_rgb: typing.Optional[bool] = None **kwargs )
Parameters
- do_resize (
bool
, optional, defaults toTrue
) — Whether to resize the image’s (height, width) dimensions to the specifiedsize
. Can be overridden by thedo_resize
parameter in thepreprocess
method. - size (
dict
, optional, defaults to{"height" -- 384, "width": 384}
): Size of the output image after resizing. Can be overridden by thesize
parameter in thepreprocess
method. - high_res_size (
dict
, optional, defaults to{"height" -- 1024, "width": 1024}
): Size of the high resolution output image after resizing. Can be overridden by thehigh_res_size
parameter in thepreprocess
method. - min_size (
int
, optional, defaults to 14) — The minimum allowed size for the resized image. Ensures that neither the height nor width falls below this value after resizing. - resample (
PILImageResampling
, optional, defaults toResampling.BICUBIC
) — Resampling filter to use if resizing the image. Only has an effect ifdo_resize
is set toTrue
. Can be overridden by theresample
parameter in thepreprocess
method. - high_res_resample (
PILImageResampling
, optional, defaults toResampling.BICUBIC
) — Resampling filter to use if resizing the image. Only has an effect ifdo_resize
is set toTrue
. Can be overridden by thehigh_res_resample
parameter in thepreprocess
method. - do_rescale (
bool
, optional, defaults toTrue
) — Whether to rescale the image by the specified scalerescale_factor
. Can be overridden by thedo_rescale
parameter in thepreprocess
method. - rescale_factor (
int
orfloat
, optional, defaults to1/255
) — Scale factor to use if rescaling the image. Only has an effect ifdo_rescale
is set toTrue
. Can be overridden by therescale_factor
parameter in thepreprocess
method. - do_normalize (
bool
, optional, defaults toTrue
) — Whether to normalize the image. Can be overridden by thedo_normalize
parameter in thepreprocess
method. Can be overridden by thedo_normalize
parameter in thepreprocess
method. - image_mean (
float
orlist[float]
, optional, defaults toIMAGENET_STANDARD_MEAN
) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_mean
parameter in thepreprocess
method. Can be overridden by theimage_mean
parameter in thepreprocess
method. - image_std (
float
orlist[float]
, optional, defaults toIMAGENET_STANDARD_STD
) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_std
parameter in thepreprocess
method. Can be overridden by theimage_std
parameter in thepreprocess
method. - high_res_image_mean (
float
orlist[float]
, optional, defaults toOPENAI_CLIP_MEAN
) — Mean to use if normalizing the high resolution image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by thehigh_res_image_mean
parameter in thepreprocess
method. - high_res_image_std (
float
orlist[float]
, optional, defaults toOPENAI_CLIP_STD
) — Standard deviation to use if normalizing the high resolution image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by thehigh_res_image_std
parameter in thepreprocess
method. - do_convert_rgb (
bool
, optional, defaults toTrue
) — Whether to convert the image to RGB.
Constructs a DEEPSEEK_VL_HYBRID image processor.
pad_to_square
< source >( image: ndarray background_color: typing.Union[int, tuple[int, int, int]] = 0 data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None ) → np.ndarray
Parameters
- image (
np.ndarray
) — The image to pad. - background_color (
int
ortuple[int, int, int]
, optional, defaults to 0) — The color to use for the padding. Can be an integer for single channel or a tuple of integers representing for multi-channel images. If passed as integer in mutli-channel mode, it will default to0
in subsequent channels. - data_format (
str
orChannelDimension
, optional) — The channel dimension format for the output image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format. If unset, will use same as the input image.
- input_data_format (
str
orChannelDimension
, optional) — The channel dimension format for the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format.
Returns
np.ndarray
The padded image.
Pads an image to a square based on the longest edge.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None high_res_size: typing.Optional[dict[str, int]] = None resample: Resampling = None high_res_resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None high_res_image_mean: typing.Union[float, list[float], NoneType] = None high_res_image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None do_convert_rgb: typing.Optional[bool] = None )
Parameters
- images (
ImageInput
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False
. - do_resize (
bool
, optional, defaults toself.do_resize
) — Whether to resize the image. - size (
Dict[str, int]
, optional, defaults toself.size
) — Dictionary in the format{"height": h, "width": w}
specifying the size of the output image after resizing. - high_res_size (
Dict[str, int]
, optional, defaults toself.high_res_size
) — Dictionary in the format{"height": h, "width": w}
specifying the size of the high resolution output image after resizing. - resample (
PILImageResampling
filter, optional, defaults toself.resample
) —PILImageResampling
filter to use if resizing the image e.g.PILImageResampling.BILINEAR
. Only has an effect ifdo_resize
is set toTrue
. - high_res_resample (
PILImageResampling
filter, optional, defaults toself.resample
) —PILImageResampling
filter to use if resizing the image e.g.PILImageResampling.BICUBIC
. Only has an effect ifdo_resize
is set toTrue
. - do_rescale (
bool
, optional, defaults toself.do_rescale
) — Whether to rescale the image values between [0 - 1]. - rescale_factor (
float
, optional, defaults toself.rescale_factor
) — Rescale factor to rescale the image by ifdo_rescale
is set toTrue
. - do_normalize (
bool
, optional, defaults toself.do_normalize
) — Whether to normalize the image. - image_mean (
float
orList[float]
, optional, defaults toself.image_mean
) — Image mean to use ifdo_normalize
is set toTrue
. - image_std (
float
orList[float]
, optional, defaults toself.image_std
) — Image standard deviation to use ifdo_normalize
is set toTrue
. - high_res_image_mean (
float
orList[float]
, optional, defaults toself.high_res_image_mean
) — Image mean to use ifdo_normalize
is set toTrue
. - high_res_image_std (
float
orList[float]
, optional, defaults toself.high_res_image_std
) — Image standard deviation to use ifdo_normalize
is set toTrue
. - return_tensors (
str
orTensorType
, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of
np.ndarray
. TensorType.TENSORFLOW
or'tf'
: Return a batch of typetf.Tensor
.TensorType.PYTORCH
or'pt'
: Return a batch of typetorch.Tensor
.TensorType.NUMPY
or'np'
: Return a batch of typenp.ndarray
.TensorType.JAX
or'jax'
: Return a batch of typejax.numpy.ndarray
.
- Unset: Return a list of
- data_format (
ChannelDimension
orstr
, optional, defaults toChannelDimension.FIRST
) — The channel dimension format for the output image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input image.
- input_data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
- do_convert_rgb (
bool
, optional, defaults toself.do_convert_rgb
) — Whether to convert the image to RGB.
Preprocess an image or batch of images.
resize
< source >( image: ndarray size: typing.Union[dict[str, int], int] background_color: typing.Optional[tuple[int, int, int]] = None resample: Resampling = <Resampling.BICUBIC: 3> data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs ) → np.ndarray
Parameters
- image (
np.ndarray
) — Image to resize. - size (
dict[str, int]
orint
) — The size to resize the image to. If a dictionary, it should have the keys"height"
and"width"
. - background_color (
tuple[int, int, int]
) — The background color to use for the padding. - resample (
PILImageResampling
, optional, defaults toPILImageResampling.BICUBIC
) —PILImageResampling
filter to use when resizing the image e.g.PILImageResampling.BICUBIC
. - data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format.None
: will be inferred from input
- input_data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
Returns
np.ndarray
The resized image.
Resize an image to dynamically calculated size.
DeepseekVLHybridImageProcessorFast
class transformers.DeepseekVLHybridImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.deepseek_vl_hybrid.image_processing_deepseek_vl_hybrid_fast.DeepseekVLHybridFastImageProcessorKwargs] )
Constructs a fast Deepseek Vl Hybrid image processor.
pad_to_square
< source >( images: torch.Tensor background_color: typing.Union[int, tuple[int, int, int]] = 0 ) → torch.Tensor
Parameters
- images (
torch.Tensor
) — The images to pad. - background_color (
int
ortuple[int, int, int]
, optional, defaults to 0) — The color to use for the padding. Can be an integer for single channel or a tuple of integers representing for multi-channel images. If passed as integer in mutli-channel mode, it will default to0
in subsequent channels.
Returns
torch.Tensor
The padded images.
Pads an image to a square based on the longest edge.
DeepseekVLHybridModel
class transformers.DeepseekVLHybridModel
< source >( config )
Parameters
- config (DeepseekVLHybridModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Deepseek Vl Hybrid Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None pixel_values: FloatTensor = None high_res_pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None cache_position: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, image_size, image_size)
) — The tensors corresponding to the input images. Pixel values can be obtained using DeepseekVLHybridImageProcessor. See DeepseekVLHybridImageProcessor.call() for details (DeepseekVLHybridProcessor uses DeepseekVLHybridImageProcessor for processing images). - high_res_pixel_values (
torch.FloatTensor
of shape `(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using AutoImageProcessor. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
~cache_utils.Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_values
are passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user is expected to input only unprocessedinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - logits_to_keep (
Union[int, torch.Tensor]
, defaults to0
) — If anint
, compute logits for the lastlogits_to_keep
tokens. If0
, calculate logits for allinput_ids
(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor
, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
The DeepseekVLHybridModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
DeepseekVLHybridForConditionalGeneration
class transformers.DeepseekVLHybridForConditionalGeneration
< source >( config: DeepseekVLHybridConfig )
forward
< source >( input_ids: LongTensor = None pixel_values: FloatTensor = None high_res_pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None cache_position: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, image_size, image_size)
) — The tensors corresponding to the input images. Pixel values can be obtained using DeepseekVLHybridImageProcessor. See DeepseekVLHybridImageProcessor.call() for details (DeepseekVLHybridProcessor uses DeepseekVLHybridImageProcessor for processing images). - high_res_pixel_values (
torch.FloatTensor
of shape `(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using AutoImageProcessor. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
~cache_utils.Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_values
are passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user is expected to input only unprocessedinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - logits_to_keep (
Union[int, torch.Tensor]
, defaults to0
) — If anint
, compute logits for the lastlogits_to_keep
tokens. If0
, calculate logits for allinput_ids
(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor
, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
The DeepseekVLHybridForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, DeepseekVLHybridForConditionalGeneration
>>> model = DeepseekVLHybridForConditionalGeneration.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
>>> processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-7b-chat")
>>> messages = [
... {
... "role": "user", "content": [
... {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
... {"type": "text", "text": "Where is the cat standing?"},
... ]
... },
... ]
>>> inputs = processor.apply_chat_template(
... messages,
... tokenize=True,
... return_dict=True,
... return_tensors="pt",
... add_generation_prompt=True
... )
>>> # Generate
>>> generate_ids = model.generate(**inputs)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0]