We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.
LLaVA-OneVision 1.5 ⭐ 754
A fully open-source family of Large Multimodal Models achieving state-of-the-art performance at substantially lower cost. Trains on native resolution images with an end-to-end MegatronLM-based framework supporting MoE, FP8, and long sequence parallelization — all for under $16,000 on A100 GPUs. Outperforms Qwen2.5-VL on most benchmarks. Includes open pre-training & SFT data, training code, recipes, and full logs.
🤗 Models & Datasets | 🖥️ Demo | 📄 Tech Report
NEO ⭐ 653 ICLR 2026
NEO Series: Native Vision-Language Models built from first principles. Rethinks the multimodal architecture by deeply integrating vision and language capabilities within a dense, monolithic model architecture, rather than bolting a vision encoder onto a language model. With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.
OneVision-Encoder ⭐ 269 CVPR 2025
A vision encoder designed around codec-aligned sparsity as a foundational principle for multimodal intelligence. Abandons uniform computation to selectively encode only 3.1%-25% of regions rich in signal entropy, consistently outperforming Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks despite using substantially fewer visual tokens.
🌐 Project Page | 📄 Paper | 🤗 Models
Otter ⭐ 3.3k IEEE TPAMI 2025
A multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo), trained on the MIMIC-IT dataset with 2.8M multimodal in-context instruction-response pairs. Demonstrates improved instruction-following and in-context learning capabilities across vision-language tasks and served as an early exploration into instruction-tuned multimodal models.
📄 Otter Paper | 📄 MIMIC-IT Paper | 🤗 Models | 🤗 MIMIC-IT Dataset
LongVA ⭐ 402 TMLR 2025
Transfers long-context capabilities from language to vision. LongVA can process 2000 frames or over 200K visual tokens, achieving state-of-the-art performance on Video-MME among 7B models — demonstrating that long context capability can zero-shot transfer from language to vision.
🌐 Blog | 📄 Paper | 🤗 Models | 🎥 Demo
RelateAnything ⭐ 462
The Relate Anything Model (RAM) takes an image as input and leverages SAM to identify corresponding masks, then reasons about relationships between any detected objects. Built on the Panoptic Scene Graph Generation work (ECCV 2022).
🤗 Demo | 📦 PSG Dataset
OpenR1-Multimodal ⭐ 1.5k
A speed-run investigation of R1's paradigm applied to multimodal models. Built on top of open-r1 and trl, this project adds multimodal model training with the GRPO algorithm, open-sourcing 8K multimodal RL training examples, trained models, and training scripts for community study on multimodal reasoning.
🤗 Models | 🤗 Datasets | 📊 Wandb Logs
OpenMMReasoner ⭐ 145 CVPR 2026
A fully transparent two-stage recipe (SFT + RL) for pushing the frontiers of multimodal reasoning. Constructs an 874K-sample cold-start dataset with step-by-step validation and a 74K-sample RL dataset, achieving 11.6% improvement over Qwen2.5-VL-7B-Instruct across nine multimodal reasoning benchmarks.
📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🌐 Blog
MMSearch-R1 ⭐ 402
An end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools. Integrates both image and text search capabilities, training models to autonomously reason about when and how to invoke external search tools.
📄 Paper | 🌐 Blog | 🤗 Model | 🤗 Data
LongVT ⭐ 195 CVPR 2026
Incentivizes "Thinking with Long Videos" via native tool calling. LongVT exploits LMMs' inherent temporal grounding ability as a native video cropping tool, enabling a global-to-local reasoning loop where the model skims globally and examines relevant clips for details until answers are grounded in visual evidence.
📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🖥️ Demo | 🌐 Blog
LMMS-Eval ⭐ 3.8k
The unified evaluation toolkit for large multimodal models, covering 100+ tasks across text, image, video, and audio. Supports 30+ models with reproducible, efficient, and statistically grounded benchmarking. Available on PyPI and translated into 17 languages.
🏠 Homepage | 📚 Documentation | 📦 PyPI
Multimodal-SAE ⭐ 183 ICCV 2025
For the first time in the multimodal domain, demonstrates that features learned by Sparse Autoencoders (SAEs) in a smaller LMM can be interpreted by a larger LMM. Provides a complete auto-interpretation pipeline for analyzing open-semantic features and steering model behavior.
📄 Paper | 🤗 Models & Data
LMMs-Engine ⭐ 735
A simple, unified multimodal model training engine. Supports FSDP2, USP, Muon optimizer, Liger kernel, packing, and expert parallelism across models like Qwen2.5-VL, Qwen3-VL, BAGEL, WanVideo, and more. Lean, flexible, and built for hacking at scale.
EgoLife ⭐ 399 CVPR 2025
For one week, six individuals lived together, capturing every moment through AI glasses, creating the EgoLife dataset. Includes EgoGPT (omni-modal clip-level understanding) and EgoRAG (long-context QA with hierarchical memory). Built to drive the future of egocentric AI life assistants.
📄 Paper | 🌐 Project Page | 🤗 Data