CARES

Overview

CARES predicts the minimal sufficient resolution for a given image-query pair, with both a classifier-based variant and an autoregressive predictor.

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens, often to 97–99% of total tokens, resulting in high compute and latency even when lower-resolution images would suffice. We introduce CARES, a Context-Aware Resolution Selector that predicts the minimal sufficient input resolution for a given image-query pair before the target VLM is invoked. We study two variants: a lightweight classification model built on compact VLM features that assigns the pair to a resolution bucket, and a fine-tuned tiny autoregressive VLM that predicts the required resolution directly. While the first offers a simple and efficient feature-based selector, the second provides a flexible text-generation interface that is easy to serve with standard VLM stacks. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

Technical Approaches

Approach 1

Discrete Resolution Classifier

This variant uses compact VLM features to classify each image-query pair into a discrete resolution bucket. It is designed as a lightweight selector that captures the visual difficulty of the task while remaining efficient and easy to train.

Approach 2

Autoregressive (AR) Predictor

This variant fine-tunes a tiny VLM to predict the target resolution autoregressively. It outputs the resolution directly as text, making it especially convenient for deployment with standard serving frameworks such as vLLM or Ollama.

Benchmark Results

Benchmark performance and estimated prefill-stage savings for Cost (measured in FLOPS for local models or $ for API models).

Model	Ai2D		ChartQA		DocVQA		OCRBench		Average
Model	Score	Cost	Score	Cost	Score	Cost	Score	Cost	Score	Cost
Granite-Vision 2B	0.74	-	0.86	-	0.90	-	0.80	-	0.59	-
+ CARES	0.73	-67%	0.87	-69%	0.90	-68%	0.80	-68%	0.60	-63%
+ CARES-AR	0.71	-81%	0.84	-81%	0.88	-82%	0.77	-75%	0.58	-67%
InternVL3-8B	0.84	-	0.86	-	0.92	-	0.85	-	0.77	-
+ CARES	0.84	-66%	0.86	-68%	0.92	-69%	0.85	-70%	0.77	-64%
+ CARES-AR	0.84	-86%	0.86	-81%	0.92	-80%	0.85	-78%	0.76	-76%
Qwen2.5-VL-72B	0.87	-	0.87	-	0.96	-	0.75	-	0.79	-
+ CARES	0.87	-85%	0.84	-77%	0.95	-84%	0.76	-64%	0.80	-70%
GPT-4o	0.78	-	0.56	-	0.80	-	0.77	-	0.69	-
+ CARES	0.78	-60%	0.56	-60%	0.80	-36%	0.75	-33%	0.68	-55%

Model	Ai2D		ChartQA		DocVQA		OCRBench		SeedBench-2		MMMU		RealWorldQA		InfoVQA		MathVista		Average
Model	Score	Cost	Score	Cost	Score	Cost	Score	Cost	Score	Cost	Score	Cost	Score	Cost	Score	Cost	Score	Cost	Score	Cost
Granite-Vision 2B	0.74	-	0.86	-	0.90	-	0.80	-	0.72	-	0.29	-	0.17	-	0.35	-	0.48	-	0.59	-
+ CARES	0.73	-67%	0.87	-69%	0.90	-68%	0.80	-68%	0.72	-44%	0.29	-85%	0.19	-72%	0.40	-72%	0.48	-22%	0.60	-63%
+ CARES-AR	0.71	-81%	0.84	-81%	0.88	-82%	0.77	-75%	0.72	-10%	0.30	-84%	0.15	-82%	0.39	-81%	0.44	-25%	0.58	-67%
InternVL3-8B	0.84	-	0.86	-	0.92	-	0.85	-	0.79	-	0.56	-	0.68	-	0.72	-	0.69	-	0.77	-
+ CARES	0.84	-66%	0.86	-68%	0.92	-69%	0.85	-70%	0.79	-44%	0.56	-86%	0.68	-82%	0.74	-72%	0.69	-22%	0.77	-64%
+ CARES-AR	0.84	-86%	0.86	-81%	0.92	-80%	0.85	-78%	0.72	-84%	0.55	-85%	0.68	-82%	0.74	-81%	0.68	-31%	0.76	-76%
Qwen2.5-VL-72B	0.87	-	0.87	-	0.96	-	0.75	-	0.81	-	0.62	-	0.77	-	0.73	-	0.74	-	0.79	-
+ CARES	0.87	-85%	0.84	-77%	0.95	-84%	0.76	-64%	0.79	-77%	0.62	-86%	0.79	-82%	0.84	-72%	0.74	-7%	0.80	-70%
GPT-4o	0.78	-	0.56	-	0.80	-	0.77	-	0.76	-	0.57	-	0.61	-	0.75	-	0.64	-	0.69	-
+ CARES	0.78	-60%	0.56	-60%	0.80	-36%	0.75	-33%	0.75	-47%	0.56	-85%	0.61	-84%	0.73	-76%	0.61	-17%	0.68	-55%

Overview

Technical Approaches

Discrete Resolution Classifier

Autoregressive (AR) Predictor

Example Conversations

Benchmark Results

Citation