CARES

Context-Aware Resolution Selector for Vision Language Models

Moshe Kimhi1,2,
Nimrod Shabtay2,3,
Raja Giryes3,
Chaim Baskin4,
Eli Schwartz2
1Technion
2IBM Research
3Tel-Aviv University
4Ben-Gurion University of the Negev

Overview

CARES introduces a context-aware resolution selection mechanism for Vision Language Models. Rather than using fixed resolution settings, CARES intelligently adapts image processing resolutions based on task-specific requirements. This approach achieves superior accuracy-efficiency trade-offs by matching resolution decisions to the visual demands of each taskβ€”from text recognition and document analysis to chart understanding and OCR.
9 Benchmark Tasks
63% Avg FLOPs Savings
2 Core Approaches
4 VLM Models

What is CARES?

Different visual understanding tasks require different levels of image detail. A document analysis task might need high resolution to recognize small text, while a general image classification task might work well at lower resolutions. CARES learns this mapping, dynamically selecting the optimal resolution for each input without sacrificing accuracy.

The method achieves an average of 63% FLOPs savings across diverse benchmarks (Ai2D, ChartQA, DocVQA, OCRBench, SeedBench-2, MMMU, RealWorldQA, InfoVQA, MathVista) while maintaining or improving task performance.

Technical Approach

CARES provides two complementary strategies for resolution selection:

Strategy 1: Gate-based Classifiers

Separate binary classifiers predict whether current resolution is sufficient for accurate VLM processing. If not, a higher resolution is selected. Multiple variants leverage different feature extractors:

  • SigLIP Gate: Vision-language CLIP-style model
  • Multimodal Gate: Multi-resolution vision features
  • VLM Gate: Frozen intermediate VLM features
  • VLM-2 Gate: Advanced VLM feature extraction
🎯
Binary Resolution Classifier
Predicts: Sufficient? β†’ Yes/No

Strategy 2: Autoregressive Prediction

Direct prediction of optimal resolution without intermediate classification. This approach uses frozen features from specialized models:

  • SmolVLM Variant: Efficient frozen SmolVLM features
  • Granite-Docling LoRA: Document-specialized Granite model with low-rank adaptation

These models are particularly effective for specialized domains like document understanding and achieve comparable performance to gate classifiers with fewer parameters.

πŸš€
Autoregressive Predictor
Predicts: Optimal Resolution Value

Visual Method Overview

The CARES methodology involves efficient resolution selection through lightweight classifiers or direct prediction models:

CARES Method Overview
Figure: CARES method overview showing gate-based and autoregressive resolution selection strategies.

Evaluation Benchmarks

CARES is evaluated on 9 diverse visual understanding tasks:

Benchmark Results

CARES achieves significant FLOPs reductions (average 63%) while maintaining or improving accuracy across 9 diverse benchmarks. Results shown with CARES applied to state-of-the-art models:

Model Ai2D ChartQA DocVQA OCRBench SeedBench-2 Average
Score ↓FLOPs Score ↓FLOPs Score ↓FLOPs Score ↓FLOPs Score ↓FLOPs Score ↓FLOPs
Granite-Vision 3.3-2B 0.736β€” 0.862β€” 0.904β€” 0.796β€” 0.717β€” 0.803β€”
+ CARES 0.733-67% 0.870-69% 0.904-68% 0.795-68% 0.718-44% 0.804-63%
InternVL3-8B 0.836β€” 0.858β€” 0.923β€” 0.851β€” 0.785β€” 0.851β€”
+ CARES 0.836-66% 0.858-68% 0.923-69% 0.851-70% 0.785-44% 0.851-63%
Qwen2.5-VL-72B 0.866β€” 0.874β€” 0.955β€” 0.752β€” 0.807β€” 0.851β€”
+ CARES 0.870-85% 0.836-77% 0.948-84% 0.755-64% 0.785-77% 0.852-80%
GPT-4o 0.780β€” 0.556β€” 0.801β€” 0.770β€” 0.757β€” 0.733β€”
+ CARES 0.781-60% 0.557-60% 0.797-36% 0.746-33% 0.754-47% 0.727-47%

Key Findings

CARES consistently reduces computational cost by selecting optimal resolutions without sacrificing performance. The method achieves an average of 63% FLOPs savings across all models and benchmarks while maintaining comparable or better accuracy.

Qwen2.5-VL-72B shows the most impressive efficiency gains, achieving up to 85% FLOPs reduction on Ai2D while actually improving accuracy (0.866 β†’ 0.870). This demonstrates CARES' ability to identify redundant computation and focus on task-critical image regions.

Resources & Models

We provide pre-trained models, datasets, and code to enable reproducible research and deployment:

πŸ€–

SmolVLM Resolution Gate

Lightweight resolution classification model built on frozen SmolVLM features. Ideal for efficient deployment with minimal latency overhead.

View on Hugging Face β†’
πŸ“„

Granite-Docling AR Model

Autoregressive resolution predictor using Granite with LoRA adaptation. Specialized for document understanding and dense visual content.

View on Hugging Face β†’
πŸ“Š

Hardness Data Mix Dataset

Comprehensive training dataset combining multiple benchmarks with resolution difficulty annotations. Enables training custom resolution selectors.

View on Hugging Face β†’
πŸ’»

Implementation & Code

Complete codebase with training scripts, inference pipelines, and evaluation utilities. Supports multiple VLM architectures and custom datasets.

GitHub Repository β†’

Quick Start

To use CARES with your own VLM:

  1. Load a pre-trained gate: Download the SmolVLM or Granite resolution selector
  2. Prepare your images: Encode images at a low reference resolution
  3. Get resolution prediction: Feed through the gate to receive predicted optimal resolution
  4. Re-encode & process: Encode at predicted resolution and feed to your VLM

Citation

If you find CARES useful in your research, please cite our work:

@misc{kimhi2025carescontextawareresolutionselector,
      title={CARES: Context-Aware Resolution Selector for VLMs},
      author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
      year={2025},
      eprint={2510.19496},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}