CARES

Context-Aware Resolution Selector for Vision Language Models

Moshe Kimhi1,2,*,
Nimrod Shabtay1,3,*,
Raja Giryes3,
Chaim Baskin4,†,
Eli Schwartz1,†
* Equal contribution  |  Equal supervision

Overview

CARES Methodology Overview
CARES predicts the minimal sufficient resolution for a given image-query pair, with both a classifier-based variant and an autoregressive predictor.
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens, often to 97–99% of total tokens, resulting in high compute and latency even when lower-resolution images would suffice. We introduce CARES, a Context-Aware Resolution Selector that predicts the minimal sufficient input resolution for a given image-query pair before the target VLM is invoked. We study two variants: a lightweight classification model built on compact VLM features that assigns the pair to a resolution bucket, and a fine-tuned tiny autoregressive VLM that predicts the required resolution directly. While the first offers a simple and efficient feature-based selector, the second provides a flexible text-generation interface that is easy to serve with standard VLM stacks. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

Technical Approaches

Approach 1

Discrete Resolution Classifier

This variant uses compact VLM features to classify each image-query pair into a discrete resolution bucket. It is designed as a lightweight selector that captures the visual difficulty of the task while remaining efficient and easy to train.

Approach 2

Autoregressive (AR) Predictor

This variant fine-tunes a tiny VLM to predict the target resolution autoregressively. It outputs the resolution directly as text, making it especially convenient for deployment with standard serving frameworks such as vLLM or Ollama.

Example Conversations

User
In which cup size category is Pakistan listed?
Cup-size example image
process resolution: 765
Assistant
Pakistan is listed under Cup B.
User
The pink cup in the image represents which size?
Cup-size example image
process resolution: 386
Assistant
The pink cup represents Cup D.

Benchmark Results

Benchmark performance and estimated prefill-stage savings for Cost (measured in FLOPS for local models or $ for API models).

Model Ai2D ChartQA DocVQA OCRBench Average
ScoreCost ScoreCost ScoreCost ScoreCost ScoreCost
Granite-Vision 2B 0.74- 0.86- 0.90- 0.80- 0.59-
+ CARES 0.73-67% 0.87-69% 0.90-68% 0.80-68% 0.60-63%
+ CARES-AR 0.71-81% 0.84-81% 0.88-82% 0.77-75% 0.58-67%
InternVL3-8B 0.84- 0.86- 0.92- 0.85- 0.77-
+ CARES 0.84-66% 0.86-68% 0.92-69% 0.85-70% 0.77-64%
+ CARES-AR 0.84-86% 0.86-81% 0.92-80% 0.85-78% 0.76-76%
Qwen2.5-VL-72B 0.87- 0.87- 0.96- 0.75- 0.79-
+ CARES 0.87-85% 0.84-77% 0.95-84% 0.76-64% 0.80-70%
GPT-4o 0.78- 0.56- 0.80- 0.77- 0.69-
+ CARES 0.78-60% 0.56-60% 0.80-36% 0.75-33% 0.68-55%
Model Ai2D ChartQA DocVQA OCRBench SeedBench-2 MMMU RealWorldQA InfoVQA MathVista Average
ScoreCost ScoreCost ScoreCost ScoreCost ScoreCost ScoreCost ScoreCost ScoreCost ScoreCost ScoreCost
Granite-Vision 2B 0.74- 0.86- 0.90- 0.80- 0.72- 0.29- 0.17- 0.35- 0.48- 0.59-
+ CARES 0.73-67% 0.87-69% 0.90-68% 0.80-68% 0.72-44% 0.29-85% 0.19-72% 0.40-72% 0.48-22% 0.60-63%
+ CARES-AR 0.71-81% 0.84-81% 0.88-82% 0.77-75% 0.72-10% 0.30-84% 0.15-82% 0.39-81% 0.44-25% 0.58-67%
InternVL3-8B 0.84- 0.86- 0.92- 0.85- 0.79- 0.56- 0.68- 0.72- 0.69- 0.77-
+ CARES 0.84-66% 0.86-68% 0.92-69% 0.85-70% 0.79-44% 0.56-86% 0.68-82% 0.74-72% 0.69-22% 0.77-64%
+ CARES-AR 0.84-86% 0.86-81% 0.92-80% 0.85-78% 0.72-84% 0.55-85% 0.68-82% 0.74-81% 0.68-31% 0.76-76%
Qwen2.5-VL-72B 0.87- 0.87- 0.96- 0.75- 0.81- 0.62- 0.77- 0.73- 0.74- 0.79-
+ CARES 0.87-85% 0.84-77% 0.95-84% 0.76-64% 0.79-77% 0.62-86% 0.79-82% 0.84-72% 0.74-7% 0.80-70%
GPT-4o 0.78- 0.56- 0.80- 0.77- 0.76- 0.57- 0.61- 0.75- 0.64- 0.69-
+ CARES 0.78-60% 0.56-60% 0.80-36% 0.75-33% 0.75-47% 0.56-85% 0.61-84% 0.73-76% 0.61-17% 0.68-55%

Citation

@misc{kimhi2025carescontextawareresolutionselector,
      title={CARES: Context-Aware Resolution Selector for VLMs},
      author={Moshe Kimhi and Nimrod Shabtay and Raja Giryes and Chaim Baskin and Eli Schwartz},
      year={2025},
      eprint={2510.19496},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```