DualVision: RGB–Infrared Multimodal Large Language Models
for Robust Visual Reasoning
CVPR Findings 2026
- Abrar Majeedi1
- Zhiyuan Ruan2
- Ziyi Zhao2
- Hongcheng Wang2
- Jianglin Lu3
- Yin Li1 1University of Wisconsin-Madison 2Amazon 3Northeastern University
When RGB visibility drops (e.g. at night) VLMs struggle to "see and reason," limiting their reliability in real-world applications such as autonomous driving. Infrared complements RGB by remaining robust under such degradations. DualVision efficiently fuses both modalities, enabling robust vision language modeling while reducing computation by ~75% compared to naĂŻve fusion.
Abstract
Multimodal large language models (MLLMs) have achieved impressive performance on visual reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a long-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DualVision, a lightweight fusion module that efficiently incorporates IR–RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR–RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR–RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DualVision delivers strong empirical performance under a wide range of visual degradations.
Motivation
Although RGB imagery provides rich color and texture information, it remains vulnerable when inputs are degraded by adverse visual conditions such as low-light environments, motion- or defocus-induced blur, and non-ideal weather like rain or fog. These degradations are frequent realities in practical deployment, particularly in domains such as transportation, surveillance, and health, where robustness is paramount.
Infrared (IR) imaging offers a valuable complement: by capturing electromagnetic radiation beyond the visible spectrum, IR can remain effective in darkness, fog, and other challenging environments. Fusing RGB and IR signals provides a promising pathway towards more robust visual perception and reasoning by leveraging their complementary strengths.
Method
DualVision presents a lightweight RGB-IR fusion module designed for MLLMs. Instead of interleaving tokens from IR and RGB, DualVision performs multi-scale localized cross-attention, allowing each RGB patch token to attend only to spatially corresponding IR regions. This design injects complementary IR cues where they are relevant, has low computational overhead, and remains compatible with many existing MLLMs.
Our key innovations include:
- 2D Local Cross-Attention: Each RGB token attends only to IR tokens within a radius-r region centered at its location, enforcing spatially aligned fusion while reducing the compute cost of global cross-modal interaction.
- Multi-Scale Design: Multiple 2D local cross-attention blocks are applied sequentially with progressively increased radii, capturing interactions across local regions with varying sizes while preserving locality.
- Computational Efficiency: RGB-IR concatenation yields 2N tokens with cost O(4N²). DualVision fuses both modalities into N tokens, reducing the cost to O(N²)—a 4× reduction (~75% compute savings).
Datasets
We introduce two new datasets to support training and evaluation of IR-RGB MLLMs:
- DV-204K: A dataset of ~25K aligned IR–RGB image pairs with ~204K modality-aware question-answer annotations designed for instruction tuning.
- DV-500: A carefully curated evaluation benchmark featuring 500 IR-RGB image pairs with 500 associated QA pairs, designed for evaluating cross-modal reasoning under various degradation conditions.
Experiments and Results
Performance by Modalities (IR, RGB, RGB+IR)
IR-only performance stays flat across degradations, RGB-only performance drops sharply as degradation severity increases, and RGB+IR provides the most robust results.
Comparison of Fusion Methods
We compare several fusion strategies: addition, adaptive addition, concatenation, and our DualVision. DualVision delivers the best performance, winning 11/13 evaluation settings over both clean and degraded RGB inputs.
Comparison with Baselines
We benchmark against both open- and closed-source MLLMs including LLaVA 1.5–7B, Qwen2-VL 7B, LLaVA-Next Interleave 7B, LLaMA-4 Scout, Claude Sonnet 3.5v2, and Claude Opus 4. DualVision consistently outperforms all baselines across both clean and degraded settings.
| Method | Original | Blur | Darkness | Fog | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Low | Mod. | High | Highest | Low | Mod. | High | Highest | Low | Mod. | High | Highest | ||
| Without Finetuning | |||||||||||||
| LLaVA 1.5 (7B) | 81.2 | 79.0 | 76.0 | 73.2 | 72.6 | 74.0 | 72.8 | 73.2 | 71.2 | 81.0 | 79.2 | 78.4 | 71.4 |
| Qwen2-VL (7B) | 89.8 | 77.8 | 73.6 | 70.8 | 69.4 | 89.6 | 85.6 | 82.6 | 78.4 | 85.0 | 79.4 | 75.2 | 65.6 |
| LLaVA-Next Interleave (7B) | 88.6 | 81.4 | 78.4 | 75.2 | 73.6 | 86.4 | 86.0 | 85.6 | 81.4 | 85.0 | 83.8 | 79.8 | 73.4 |
| Claude Sonnet-3.5 v2 | 87.4 | 77.8 | 74.8 | 70.8 | 72.0 | 85.4 | 78.8 | 75.6 | 68.0 | 80.0 | 70.2 | 68.4 | 64.4 |
| Finetuned | |||||||||||||
| LLaVA 1.5 (7B) + FT | 87.15 | 83.50 | 82.13 | 81.93 | 80.32 | 86.35 | 85.94 | 85.54 | 84.94 | 85.94 | 85.94 | 84.74 | 78.71 |
| DualVision (Ours) | 88.38 | 84.77 | 84.37 | 82.36 | 82.57 | 88.38 | 87.58 | 87.58 | 86.57 | 87.98 | 87.17 | 84.97 | 80.76 |
Qualitative Results
Both methods, finetuned on DV-204K, answer correctly with clean RGB–IR inputs (left). When the RGB is degraded (right), DualVision remains robust, while finetuned LLaVA-1.5 shows reduced reliability.
Contributions
- Our work is among the first to develop MLLMs that integrate RGB and IR modalities for robust visual reasoning under visual degradations (e.g., blur, low-light, and fog).
- We introduce DualVision, a lightweight IR-RGB fusion module with enhanced robustness to degradations, while remaining compatible with existing MLLMs.
- To support training and evaluation of IR-RGB MLLMs, we create and release two datasets: DV-204K for instruction tuning and DV-500 for evaluation.
- Through extensive experiments, we demonstrate strong empirical results of DualVision under various degradations.
Citation
@inproceedings{majeedi2026dualvision,
title={DualVision: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning},
author={Abrar Majeedi and Zhiyuan Ruan and Ziyi Zhao and Hongcheng Wang and Jianglin Lu and Yin Li},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
year={2026}
}
The website template was borrowed from Michaël Gharbi.