LMMs-Lab

LMMs-Lab: Building Multimodal Intelligence

We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.

Discord

🏗️ Models & Training

LLaVA-OneVision 1.5 ⭐ 754

A fully open-source family of Large Multimodal Models achieving state-of-the-art performance at substantially lower cost. Trains on native resolution images with an end-to-end MegatronLM-based framework supporting MoE, FP8, and long sequence parallelization — all for under $16,000 on A100 GPUs. Outperforms Qwen2.5-VL on most benchmarks. Includes open pre-training & SFT data, training code, recipes, and full logs.

🤗 Models & Datasets | 🖥️ Demo | 📄 Tech Report

NEO ⭐ 653 `ICLR 2026`

NEO Series: Native Vision-Language Models built from first principles. Rethinks the multimodal architecture by deeply integrating vision and language capabilities within a dense, monolithic model architecture, rather than bolting a vision encoder onto a language model. With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.

📄 Paper | 🤗 Models

OneVision-Encoder ⭐ 269 `CVPR 2025`

A vision encoder designed around codec-aligned sparsity as a foundational principle for multimodal intelligence. Abandons uniform computation to selectively encode only 3.1%-25% of regions rich in signal entropy, consistently outperforming Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks despite using substantially fewer visual tokens.

🌐 Project Page | 📄 Paper | 🤗 Models

Otter ⭐ 3.3k `IEEE TPAMI 2025`

A multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo), trained on the MIMIC-IT dataset with 2.8M multimodal in-context instruction-response pairs. Demonstrates improved instruction-following and in-context learning capabilities across vision-language tasks and served as an early exploration into instruction-tuned multimodal models.

📄 Otter Paper | 📄 MIMIC-IT Paper | 🤗 Models | 🤗 MIMIC-IT Dataset

LongVA ⭐ 402 `TMLR 2025`

Transfers long-context capabilities from language to vision. LongVA can process 2000 frames or over 200K visual tokens, achieving state-of-the-art performance on Video-MME among 7B models — demonstrating that long context capability can zero-shot transfer from language to vision.

🌐 Blog | 📄 Paper | 🤗 Models | 🎥 Demo

RelateAnything ⭐ 462

The Relate Anything Model (RAM) takes an image as input and leverages SAM to identify corresponding masks, then reasons about relationships between any detected objects. Built on the Panoptic Scene Graph Generation work (ECCV 2022).

🤗 Demo | 📦 PSG Dataset

🧠 Reasoning & Reinforcement Learning

OpenR1-Multimodal ⭐ 1.5k

A speed-run investigation of R1's paradigm applied to multimodal models. Built on top of open-r1 and trl, this project adds multimodal model training with the GRPO algorithm, open-sourcing 8K multimodal RL training examples, trained models, and training scripts for community study on multimodal reasoning.

🤗 Models | 🤗 Datasets | 📊 Wandb Logs

OpenMMReasoner ⭐ 145 `CVPR 2026`

A fully transparent two-stage recipe (SFT + RL) for pushing the frontiers of multimodal reasoning. Constructs an 874K-sample cold-start dataset with step-by-step validation and a 74K-sample RL dataset, achieving 11.6% improvement over Qwen2.5-VL-7B-Instruct across nine multimodal reasoning benchmarks.

📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🌐 Blog

MMSearch-R1 ⭐ 402

An end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools. Integrates both image and text search capabilities, training models to autonomously reason about when and how to invoke external search tools.

📄 Paper | 🌐 Blog | 🤗 Model | 🤗 Data

LongVT ⭐ 195 `CVPR 2026`

Incentivizes "Thinking with Long Videos" via native tool calling. LongVT exploits LMMs' inherent temporal grounding ability as a native video cropping tool, enabling a global-to-local reasoning loop where the model skims globally and examines relevant clips for details until answers are grounded in visual evidence.

📊 Evaluation & Analysis

LMMS-Eval ⭐ 3.8k

The unified evaluation toolkit for large multimodal models, covering 100+ tasks across text, image, video, and audio. Supports 30+ models with reproducible, efficient, and statistically grounded benchmarking. Available on PyPI and translated into 17 languages.

🏠 Homepage | 📚 Documentation | 📦 PyPI

Multimodal-SAE ⭐ 183 `ICCV 2025`

For the first time in the multimodal domain, demonstrates that features learned by Sparse Autoencoders (SAEs) in a smaller LMM can be interpreted by a larger LMM. Provides a complete auto-interpretation pipeline for analyzing open-semantic features and steering model behavior.

📄 Paper | 🤗 Models & Data

🔬 Training Frameworks

LMMs-Engine ⭐ 735

A simple, unified multimodal model training engine. Supports FSDP2, USP, Muon optimizer, Liger kernel, packing, and expert parallelism across models like Qwen2.5-VL, Qwen3-VL, BAGEL, WanVideo, and more. Lean, flexible, and built for hacking at scale.

🐳 Docker | 📦 PyPI

🌍 Datasets & Benchmarks

EgoLife ⭐ 399 `CVPR 2025`

For one week, six individuals lived together, capturing every moment through AI glasses, creating the EgoLife dataset. Includes EgoGPT (omni-modal clip-level understanding) and EgoRAG (long-context QA with hierarchical memory). Built to drive the future of egocentric AI life assistants.

📄 Paper | 🌐 Project Page | 🤗 Data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMMs-Lab

LMMs-Lab: Building Multimodal Intelligence

🏗️ Models & Training

LLaVA-OneVision 1.5 ⭐ 754

NEO ⭐ 653 `ICLR 2026`

OneVision-Encoder ⭐ 269 `CVPR 2025`

Otter ⭐ 3.3k `IEEE TPAMI 2025`

LongVA ⭐ 402 `TMLR 2025`

RelateAnything ⭐ 462

🧠 Reasoning & Reinforcement Learning

OpenR1-Multimodal ⭐ 1.5k

OpenMMReasoner ⭐ 145 `CVPR 2026`

MMSearch-R1 ⭐ 402

LongVT ⭐ 195 `CVPR 2026`

📊 Evaluation & Analysis

LMMS-Eval ⭐ 3.8k

Multimodal-SAE ⭐ 183 `ICCV 2025`

🔬 Training Frameworks

LMMs-Engine ⭐ 735

🌍 Datasets & Benchmarks

EgoLife ⭐ 399 `CVPR 2025`

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

LMMs-Lab: Building Multimodal Intelligence

🏗️ Models & Training

LLaVA-OneVision 1.5 ⭐ 754

NEO ⭐ 653 ICLR 2026

OneVision-Encoder ⭐ 269 CVPR 2025

Otter ⭐ 3.3k IEEE TPAMI 2025

LongVA ⭐ 402 TMLR 2025

RelateAnything ⭐ 462

🧠 Reasoning & Reinforcement Learning

OpenR1-Multimodal ⭐ 1.5k

OpenMMReasoner ⭐ 145 CVPR 2026

MMSearch-R1 ⭐ 402

LongVT ⭐ 195 CVPR 2026

📊 Evaluation & Analysis

LMMS-Eval ⭐ 3.8k

Multimodal-SAE ⭐ 183 ICCV 2025

🔬 Training Frameworks

LMMs-Engine ⭐ 735

🌍 Datasets & Benchmarks

EgoLife ⭐ 399 CVPR 2025

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

NEO ⭐ 653 `ICLR 2026`

OneVision-Encoder ⭐ 269 `CVPR 2025`

Otter ⭐ 3.3k `IEEE TPAMI 2025`

LongVA ⭐ 402 `TMLR 2025`

OpenMMReasoner ⭐ 145 `CVPR 2026`

LongVT ⭐ 195 `CVPR 2026`

Multimodal-SAE ⭐ 183 `ICCV 2025`

EgoLife ⭐ 399 `CVPR 2025`