DualVision

DualVision: RGB–Infrared Multimodal Large Language Models
for Robust Visual Reasoning
CVPR Findings 2026

Abrar Majeedi¹
Zhiyuan Ruan²
Ziyi Zhao²
Hongcheng Wang²
Jianglin Lu³
Yin Li¹

When RGB visibility drops (e.g. at night) VLMs struggle to "see and reason," limiting their reliability in real-world applications such as autonomous driving. Infrared complements RGB by remaining robust under such degradations. DualVision efficiently fuses both modalities, enabling robust vision language modeling while reducing computation by ~75% compared to naïve fusion.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance on visual reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a long-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DualVision, a lightweight fusion module that efficiently incorporates IR–RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR–RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR–RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DualVision delivers strong empirical performance under a wide range of visual degradations.

Motivation

Although RGB imagery provides rich color and texture information, it remains vulnerable when inputs are degraded by adverse visual conditions such as low-light environments, motion- or defocus-induced blur, and non-ideal weather like rain or fog. These degradations are frequent realities in practical deployment, particularly in domains such as transportation, surveillance, and health, where robustness is paramount.

Infrared (IR) imaging offers a valuable complement: by capturing electromagnetic radiation beyond the visible spectrum, IR can remain effective in darkness, fog, and other challenging environments. Fusing RGB and IR signals provides a promising pathway towards more robust visual perception and reasoning by leveraging their complementary strengths.

Method

DualVision presents a lightweight RGB-IR fusion module designed for MLLMs. Instead of interleaving tokens from IR and RGB, DualVision performs multi-scale localized cross-attention, allowing each RGB patch token to attend only to spatially corresponding IR regions. This design injects complementary IR cues where they are relevant, has low computational overhead, and remains compatible with many existing MLLMs.

Our key innovations include:

2D Local Cross-Attention: Each RGB token attends only to IR tokens within a radius-r region centered at its location, enforcing spatially aligned fusion while reducing the compute cost of global cross-modal interaction.
Multi-Scale Design: Multiple 2D local cross-attention blocks are applied sequentially with progressively increased radii, capturing interactions across local regions with varying sizes while preserving locality.
Computational Efficiency: RGB-IR concatenation yields 2N tokens with cost O(4N²). DualVision fuses both modalities into N tokens, reducing the cost to O(N²)—a 4× reduction (~75% compute savings).

Datasets

We introduce two new datasets to support training and evaluation of IR-RGB MLLMs:

DV-204K: A dataset of ~25K aligned IR–RGB image pairs with ~204K modality-aware question-answer annotations designed for instruction tuning.
DV-500: A carefully curated evaluation benchmark featuring 500 IR-RGB image pairs with 500 associated QA pairs, designed for evaluating cross-modal reasoning under various degradation conditions.

Experiments and Results

Performance by Modalities (IR, RGB, RGB+IR)

IR-only performance stays flat across degradations, RGB-only performance drops sharply as degradation severity increases, and RGB+IR provides the most robust results.

Comparison of Fusion Methods

We compare several fusion strategies: addition, adaptive addition, concatenation, and our DualVision. DualVision delivers the best performance, winning 11/13 evaluation settings over both clean and degraded RGB inputs.

Comparison with Baselines

We benchmark against both open- and closed-source MLLMs including LLaVA 1.5–7B, Qwen2-VL 7B, LLaVA-Next Interleave 7B, LLaMA-4 Scout, Claude Sonnet 3.5v2, and Claude Opus 4. DualVision consistently outperforms all baselines across both clean and degraded settings.

Method	Original	Blur				Darkness				Fog
Method	Original	Low	Mod.	High	Highest	Low	Mod.	High	Highest	Low	Mod.	High	Highest
Without Finetuning
LLaVA 1.5 (7B)	81.2	79.0	76.0	73.2	72.6	74.0	72.8	73.2	71.2	81.0	79.2	78.4	71.4
Qwen2-VL (7B)	89.8	77.8	73.6	70.8	69.4	89.6	85.6	82.6	78.4	85.0	79.4	75.2	65.6
LLaVA-Next Interleave (7B)	88.6	81.4	78.4	75.2	73.6	86.4	86.0	85.6	81.4	85.0	83.8	79.8	73.4
Claude Sonnet-3.5 v2	87.4	77.8	74.8	70.8	72.0	85.4	78.8	75.6	68.0	80.0	70.2	68.4	64.4
Finetuned
LLaVA 1.5 (7B) + FT	87.15	83.50	82.13	81.93	80.32	86.35	85.94	85.54	84.94	85.94	85.94	84.74	78.71
DualVision (Ours)	88.38	84.77	84.37	82.36	82.57	88.38	87.58	87.58	86.57	87.98	87.17	84.97	80.76

Qualitative Results

Both methods, finetuned on DV-204K, answer correctly with clean RGB–IR inputs (left). When the RGB is degraded (right), DualVision remains robust, while finetuned LLaVA-1.5 shows reduced reliability.

Contributions

Our work is among the first to develop MLLMs that integrate RGB and IR modalities for robust visual reasoning under visual degradations (e.g., blur, low-light, and fog).
We introduce DualVision, a lightweight IR-RGB fusion module with enhanced robustness to degradations, while remaining compatible with existing MLLMs.
To support training and evaluation of IR-RGB MLLMs, we create and release two datasets: DV-204K for instruction tuning and DV-500 for evaluation.
Through extensive experiments, we demonstrate strong empirical results of DualVision under various degradations.

Citation


@inproceedings{majeedi2026dualvision,
  title={DualVision: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning},
  author={Abrar Majeedi and Zhiyuan Ruan and Ziyi Zhao and Hongcheng Wang and Jianglin Lu and Yin Li},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year={2026}
}

The website template was borrowed from Michaël Gharbi.

DualVision: RGB–Infrared Multimodal Large Language Models
for Robust Visual Reasoning
CVPR Findings 2026

Paper

Code

Dataset

Abstract

Motivation

Method

Datasets

Experiments and Results

Performance by Modalities (IR, RGB, RGB+IR)

Comparison of Fusion Methods

Comparison with Baselines

Qualitative Results

Contributions

Citation

DualVision: RGB–Infrared Multimodal Large Language Modelsfor Robust Visual Reasoning CVPR Findings 2026

Paper

Code

Dataset

Abstract

Motivation

Method

Datasets

Experiments and Results

Performance by Modalities (IR, RGB, RGB+IR)

Comparison of Fusion Methods

Comparison with Baselines

Qualitative Results

Contributions

Citation

DualVision: RGB–Infrared Multimodal Large Language Models
for Robust Visual Reasoning
CVPR Findings 2026