Skip to content
@EvolvingLMMs-Lab

LMMs-Lab

Feeling and building multimodal intelligence.

LMMs-Lab: Building Multimodal Intelligence

We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.

Discord

GitHub User's stars


🏗️ Models & Training

A fully open-source family of Large Multimodal Models achieving state-of-the-art performance at substantially lower cost. Trains on native resolution images with an end-to-end MegatronLM-based framework supporting MoE, FP8, and long sequence parallelization — all for under $16,000 on A100 GPUs. Outperforms Qwen2.5-VL on most benchmarks. Includes open pre-training & SFT data, training code, recipes, and full logs.

🤗 Models & Datasets | 🖥️ Demo | 📄 Tech Report

NEO ⭐ 653 ICLR 2026

NEO Series: Native Vision-Language Models built from first principles. Rethinks the multimodal architecture by deeply integrating vision and language capabilities within a dense, monolithic model architecture, rather than bolting a vision encoder onto a language model. With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.

📄 Paper | 🤗 Models

OneVision-Encoder ⭐ 269 CVPR 2025

A vision encoder designed around codec-aligned sparsity as a foundational principle for multimodal intelligence. Abandons uniform computation to selectively encode only 3.1%-25% of regions rich in signal entropy, consistently outperforming Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks despite using substantially fewer visual tokens.

🌐 Project Page | 📄 Paper | 🤗 Models

Otter ⭐ 3.3k IEEE TPAMI 2025

A multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo), trained on the MIMIC-IT dataset with 2.8M multimodal in-context instruction-response pairs. Demonstrates improved instruction-following and in-context learning capabilities across vision-language tasks and served as an early exploration into instruction-tuned multimodal models.

📄 Otter Paper | 📄 MIMIC-IT Paper | 🤗 Models | 🤗 MIMIC-IT Dataset

LongVA ⭐ 402 TMLR 2025

Transfers long-context capabilities from language to vision. LongVA can process 2000 frames or over 200K visual tokens, achieving state-of-the-art performance on Video-MME among 7B models — demonstrating that long context capability can zero-shot transfer from language to vision.

🌐 Blog | 📄 Paper | 🤗 Models | 🎥 Demo

The Relate Anything Model (RAM) takes an image as input and leverages SAM to identify corresponding masks, then reasons about relationships between any detected objects. Built on the Panoptic Scene Graph Generation work (ECCV 2022).

🤗 Demo | 📦 PSG Dataset


🧠 Reasoning & Reinforcement Learning

A speed-run investigation of R1's paradigm applied to multimodal models. Built on top of open-r1 and trl, this project adds multimodal model training with the GRPO algorithm, open-sourcing 8K multimodal RL training examples, trained models, and training scripts for community study on multimodal reasoning.

🤗 Models | 🤗 Datasets | 📊 Wandb Logs

OpenMMReasoner ⭐ 145 CVPR 2026

A fully transparent two-stage recipe (SFT + RL) for pushing the frontiers of multimodal reasoning. Constructs an 874K-sample cold-start dataset with step-by-step validation and a 74K-sample RL dataset, achieving 11.6% improvement over Qwen2.5-VL-7B-Instruct across nine multimodal reasoning benchmarks.

📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🌐 Blog

MMSearch-R1 ⭐ 402

An end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools. Integrates both image and text search capabilities, training models to autonomously reason about when and how to invoke external search tools.

📄 Paper | 🌐 Blog | 🤗 Model | 🤗 Data

LongVT ⭐ 195 CVPR 2026

Incentivizes "Thinking with Long Videos" via native tool calling. LongVT exploits LMMs' inherent temporal grounding ability as a native video cropping tool, enabling a global-to-local reasoning loop where the model skims globally and examines relevant clips for details until answers are grounded in visual evidence.

📄 Paper | 🌐 Project Page | 🤗 Models | 🤗 Data | 🖥️ Demo | 🌐 Blog


📊 Evaluation & Analysis

LMMS-Eval ⭐ 3.8k

The unified evaluation toolkit for large multimodal models, covering 100+ tasks across text, image, video, and audio. Supports 30+ models with reproducible, efficient, and statistically grounded benchmarking. Available on PyPI and translated into 17 languages.

🏠 Homepage | 📚 Documentation | 📦 PyPI

Multimodal-SAE ⭐ 183 ICCV 2025

For the first time in the multimodal domain, demonstrates that features learned by Sparse Autoencoders (SAEs) in a smaller LMM can be interpreted by a larger LMM. Provides a complete auto-interpretation pipeline for analyzing open-semantic features and steering model behavior.

📄 Paper | 🤗 Models & Data


🔬 Training Frameworks

LMMs-Engine ⭐ 735

A simple, unified multimodal model training engine. Supports FSDP2, USP, Muon optimizer, Liger kernel, packing, and expert parallelism across models like Qwen2.5-VL, Qwen3-VL, BAGEL, WanVideo, and more. Lean, flexible, and built for hacking at scale.

🐳 Docker | 📦 PyPI


🌍 Datasets & Benchmarks

EgoLife ⭐ 399 CVPR 2025

For one week, six individuals lived together, capturing every moment through AI glasses, creating the EgoLife dataset. Includes EgoGPT (omni-modal clip-level understanding) and EgoRAG (long-context QA with hierarchical memory). Built to drive the future of egocentric AI life assistants.

📄 Paper | 🌐 Project Page | 🤗 Data

Pinned Loading

  1. lmms-eval lmms-eval Public

    One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

    Python 3.8k 527

  2. Otter Otter Public

    🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

    Python 3.3k 208

  3. LongVA LongVA Public

    Long Context Transfer from Language to Vision

    Python 402 19

  4. multimodal-sae multimodal-sae Public

    [ICCV 2025] Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.

    Python 183 11

  5. open-r1-multimodal open-r1-multimodal Public

    A fork to add multimodal model training to open-r1

    Python 1.5k 70

  6. EgoLife EgoLife Public

    [CVPR 2025] EgoLife: Towards Egocentric Life Assistant

    Python 401 19

Repositories

Showing 10 of 35 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.