Mean pooling and multi-window RGB encoding optimize vision-language performance on CT enterography, with retrieval-augmented generation substantially improving automated report severity accuracy over fine-tuning alone.
An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Representation geometry shapes task performance in vision-language modeling for CT enterography
Mean pooling and multi-window RGB encoding optimize vision-language performance on CT enterography, with retrieval-augmented generation substantially improving automated report severity accuracy over fine-tuning alone.