論文 Hugging Face 発表: 2026-06-10 HF ↑23

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

著者: Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li ほか9名

要約

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is dri…

#multimodal#llm#vision

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

要約

同じカテゴリの記事

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

World-R1: テキストから動画生成における3D制約の強化学習による整合