HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
要約
Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is dri…