GeoArena: Evaluating Open-World Geographic Reasoning in Large Vision-Language Models

Pengyue Jia; Sharon Li; Xiangyu Zhao; Yingyi Zhang

arxiv: 2509.04334 · v5 · submitted 2025-09-04 · 💻 cs.CV

GeoArena: Evaluating Open-World Geographic Reasoning in Large Vision-Language Models

Pengyue Jia , Yingyi Zhang , Xiangyu Zhao , Sharon Li This is my paper

Pith reviewed 2026-05-18 18:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords GeoArenageographic reasoninglarge vision-language modelshuman preference evaluationopen-world inferencepairwise comparisonin-the-wild imagesreasoning alignment

0 comments

The pith

GeoArena tests large vision-language models on geographic reasoning by having humans compare their explanations on real-world photos instead of checking final location labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoArena as a way to judge how well AI vision-language models infer locations from everyday images by comparing the quality of their reasoning chains. It shifts the focus from whether a model names the right place to whether its explanation combines visual evidence with spatial knowledge in a plausible way. Human judges perform pairwise comparisons on thousands of in-the-wild photos, producing rankings across seventeen frontier models. This setup addresses the gap left by static label-matching tests that treat reasoning as a black box. The resulting public platform allows ongoing evaluation that aligns more closely with how humans assess geographic understanding.

Core claim

GeoArena reframes open-world geographic reasoning evaluation as a pairwise reasoning alignment task where human judges rate model explanations on in-the-wild images according to reasoning quality, evidence synthesis, and plausibility, thereby providing a dynamic benchmark that complements outcome-centric methods and reveals model behavior through thousands of judgments.

What carries the argument

GeoArena, the human-preference-based pairwise comparison framework that converts evaluation into direct head-to-head judgments of model-generated geographic explanations.

If this is right

LVLMs can be ranked and improved according to the quality of their step-by-step geographic reasoning rather than location-name accuracy alone.
The dynamic platform supports repeated testing as new models appear without requiring new static datasets.
Detailed breakdowns of judgment factors can guide development of models that better synthesize visual cues with world knowledge.
Existing geographic benchmarks become more informative when paired with this reasoning-focused layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pairwise format could transfer to other visual reasoning domains such as historical or ecological inference from images.
Automated approximations of human preferences might eventually scale the benchmark while preserving the core alignment signal.
Models that perform well here are likely to generalize better to real navigation or mapping tasks where explanations matter.
The work underscores the value of treating geographic reasoning as an open-ended synthesis process rather than a classification problem.

Load-bearing premise

Human judges can consistently rate model explanations on reasoning quality and plausibility without introducing systematic biases that distort the overall rankings.

What would settle it

A controlled study in which independent panels of judges produce statistically inconsistent preference rankings for the same set of model explanations on identical images would undermine the framework's reliability.

Figures

Figures reproduced from arXiv: 2509.04334 by Pengyue Jia, Sharon Li, Xiangyu Zhao, Yingyi Zhang.

**Figure 2.** Figure 2: Overview of GeoArena. 3.1 LIVE INTERFACE To facilitate user interaction, GeoArena is an online platform that allows any user to conveniently access the leaderboard and participate in data collection through a public link. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pair-wise Performance Comparison of Models (Win-Rate and Battle Count). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of Style Features in Model Outputs. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Case Study: Images Where Strong Models Excel but Weaker Models Fail [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Additional Case Study: Identifying the Percé Rock. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Additional Case Study: Identifying the Olympic Park, Beijing. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Additional Case Study: Identifying the Golf Course in Fiji. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Geographic reasoning is a fundamental cognitive capability that requires models to infer plausible locations by synthesizing visual evidence with spatial world knowledge. Despite recent advances in large vision-language models (LVLMs), existing evaluation paradigms remain largely outcome-centric, relying on static datasets and predefined labels that are conceptually misaligned with open-world geographic inference. Such outcome-centric evaluations often focus exclusively on label matching, leaving the underlying linguistic reasoning chains as unexamined black boxes. In this work, we introduce GeoArena, a dynamic, human-preference-based evaluation framework for benchmarking open-world geographic reasoning. GeoArena reframes evaluation as a pairwise reasoning alignment task on in-the-wild images, where human judges compare model-generated explanations based on reasoning quality, evidence synthesis, and plausibility. We deploy GeoArena as a public platform and benchmark 17 frontier LVLMs using thousands of human judgments, which complements existing benchmarks and supports the development of geographically grounded, human-aligned AI systems. We further provide detailed analyses of model behavior, including reliability of human preferences and factors influencing judgments of geographic reasoning quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoArena adds a human-preference layer to geographic reasoning benchmarks but its claims rest on unanchored judge consistency.

read the letter

The main thing to know is that this paper moves evaluation away from simple location label matching and toward pairwise human comparisons of model explanations on real photos. That reframing is the core new piece, and they back it with a public platform plus data from 17 models and thousands of judgments. The shift makes sense for open-world settings where the right answer isn't a single label but a plausible chain of visual and spatial inference. They also include some breakdown of what seems to drive the preferences, which is more than most benchmark papers deliver. That part is worth having on the record. The public deployment angle is practical too, since it could let others extend the set or rerun comparisons later. The soft spot is exactly where the stress test flags it. The whole setup assumes humans can reliably score reasoning quality, evidence synthesis, and plausibility without systematic drift toward fluent-sounding but shallow answers. The abstract mentions reliability analyses, yet nothing visible ties those judgments to verifiable location facts or independent geographer ratings. If the checks stay internal, the benchmark risks measuring annotator taste more than actual geographic fidelity. That assumption is load-bearing, and the current description leaves it thin. This paper is for people who build or audit LVLMs on spatial tasks and for benchmark designers who want alternatives to static datasets. Anyone tracking how evaluation methods evolve for reasoning chains would find the framework and the initial model comparisons useful. It deserves a serious referee because the problem is real and the pairwise approach is distinct enough to discuss, even with the validation gaps. I'd send it to review and specifically ask for external calibration data on the human side before accepting the results as a stable signal.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces GeoArena, a dynamic human-preference-based evaluation framework for open-world geographic reasoning in large vision-language models. It reframes evaluation as a pairwise alignment task on in-the-wild images, where human judges compare model-generated explanations according to reasoning quality, evidence synthesis, and plausibility. The work deploys the framework as a public platform, benchmarks 17 frontier LVLMs with thousands of judgments, and provides analyses of model behavior along with reliability of the collected human preferences.

Significance. If the human judgments prove reliable and free of systematic bias, the framework offers a useful complement to static, outcome-centric benchmarks by directly examining reasoning chains rather than label matching alone. The public platform and scale of judgments support community adoption and further development of geographically grounded LVLMs.

major comments (1)

[§4.2] §4.2 (Reliability of Human Preferences): The reported analyses rely on internal consistency measures such as agreement rates and repeated judgments within the same annotator pool. No calibration against verifiable geographic facts (e.g., known landmark locations) or agreement with expert geographers is described. This leaves the central claim—that the benchmark measures sound geographic inference—vulnerable to the possibility that judgments primarily reflect fluency or superficial cues rather than reasoning fidelity.

minor comments (2)

[Abstract and §3] The abstract and §3 could more explicitly state the exact number of judges per image pair and the criteria used to select the in-the-wild images.
[Table 1] Table 1 would benefit from an additional column reporting the number of unique images per model to clarify the scale of the evaluation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4.2] §4.2 (Reliability of Human Preferences): The reported analyses rely on internal consistency measures such as agreement rates and repeated judgments within the same annotator pool. No calibration against verifiable geographic facts (e.g., known landmark locations) or agreement with expert geographers is described. This leaves the central claim—that the benchmark measures sound geographic inference—vulnerable to the possibility that judgments primarily reflect fluency or superficial cues rather than reasoning fidelity.

Authors: We thank the referee for this observation. Section 4.2 currently presents inter-annotator agreement and intra-annotator consistency on repeated items as evidence of reliability within the crowd-sourced pool. We agree that these internal measures alone leave open the possibility that preferences track fluency or superficial cues rather than geographic reasoning fidelity. In the revised manuscript we will add a calibration experiment on a held-out subset of images containing well-known landmarks with verifiable locations; we will report agreement between crowd judgments and both ground-truth coordinates and, where feasible, annotations from expert geographers. This addition directly addresses the concern while preserving the dynamic, open-world character of the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: GeoArena is a new human-judgment benchmark with independent data collection.

full rationale

The paper introduces GeoArena as a fresh evaluation framework that collects new human preference judgments on model-generated geographic reasoning explanations for in-the-wild images. No equations, parameter fits, or self-citations are present that reduce any claimed result to the inputs by construction. The central contribution is the deployment of this platform and the collection of thousands of judgments across 17 models, which stands as external data rather than a renaming, self-definition, or fitted prediction. The derivation chain is therefore self-contained against the newly gathered human assessments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that human preference judgments provide a valid proxy for reasoning quality; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Human judges can reliably evaluate geographic reasoning quality, evidence synthesis, and plausibility via pairwise comparisons
This premise underpins the entire evaluation paradigm described in the abstract.

pith-pipeline@v0.9.0 · 5717 in / 1151 out tokens · 42785 ms · 2026-05-18T18:37:23.327686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Scene Type: whether the image depicts an indoor or outdoor setting

work page
[2]

Text Presence: whether the image contains prominent, recognizable text

work page
[3]

The corresponding results are presented in Table 7

Landmark Presence: whether the image features a landmark, such as a historical site or natural icon. The corresponding results are presented in Table 7. The figure comprises three doughnut charts, each illustrating the distribution of one of the annotated attributes across the GeoArena-1K dataset: Indoor/Outdoor Distribution:

work page
[4]

The first doughnut chart indicates that 94.2% of images are classified as outdoor scenes, with only 5.8% representing indoor environments. This pronounced skew toward outdoor imagery aligns with the global scope of GeoArena, where user-submitted images are likely dominated by exterior scenes captured in diverse geographic contexts

work page
[5]

no text") and 45.2% containing text (

Text Presence: The second doughnut chart reveals a more balanced distribution, with 54.8% of images lacking recognizable text ("no text") and 45.2% containing text ("has text"). This near-equitable split underscores the dataset’s richness, incorporating both text-free natural scenes and images with textual elements such as signs or labels. This variabilit...

work page
[6]

no landmark

Landmark Presence: The third doughnut chart shows that 84.2% of images do not contain landmarks ("no landmark"), while 15.8% do ("has landmark"). The low prevalence of landmarks reflects the dataset’s emphasis on general geographic scenes rather than iconic or tourist-heavy locations, offering a broad representation of natural and urban environments world...

work page
[7]

Accuracy of the predicted location

work page
[8]

Strength of reasoning and evidence

work page
[9]

win” if Response A is better - “loss

Clarity and specificity Output only one word: - “win” if Response A is better - “loss” if Response B is better - “tie” if both are equally good We first prompt the LVLM to act as an expert evaluator for the geolocalization task. For each sample, the model is provided with the prompt and the associated images, along with the responses from two candidate mo...

work page
[10]

El Arco” or “The Arch

Identify Distinctive Features: The key feature here is the rock formation with an arch. Such formations are often associatedwith coastal areas where erosion has shaped the landscape. 2. Consider Geographical Context: The presence of a large body of water and the specific type of rock formation suggests a coastal location. The clarity of the water and the ...

work page
[11]

Distinctive lamp posts The unique sculptural street lamps with leaf-like designs are characteristic of the Olympic Green area

work page
[12]

Large open plaza The expansive paved public space is typical of the Olympic Forest Park’s grand pedestrian areas

work page
[13]

Modern urban planning The organized layout and design aesthetic match Beijing’s Olympic district development

work page
[14]

This area was specifically developed as part of Beijing’s Olympic infrastructure and remains a popular public space and touristdestination

Architectural style The overall design language reflects the modernization efforts made for the 2008 Beijing Olympics. This area was specifically developed as part of Beijing’s Olympic infrastructure and remains a popular public space and touristdestination. The Olympic Forest Park is located in the Chaoyang District of Beijing and serves as both a recrea...

work page 2008
[15]

Climate and Vegetation: The presence of dense lush greenery and palm trees indicates a tropical or subtropical climate.Regions known for such landscapes include parts of the Caribbean, Southeast Asia, and the Pacific Islands. 2. Golf Course andInfrastructure: The well-maintained golf course and the presence of what appears to be resort infrastructure (e.g...

work page

[1] [1]

Scene Type: whether the image depicts an indoor or outdoor setting

work page

[2] [2]

Text Presence: whether the image contains prominent, recognizable text

work page

[3] [3]

The corresponding results are presented in Table 7

Landmark Presence: whether the image features a landmark, such as a historical site or natural icon. The corresponding results are presented in Table 7. The figure comprises three doughnut charts, each illustrating the distribution of one of the annotated attributes across the GeoArena-1K dataset: Indoor/Outdoor Distribution:

work page

[4] [4]

The first doughnut chart indicates that 94.2% of images are classified as outdoor scenes, with only 5.8% representing indoor environments. This pronounced skew toward outdoor imagery aligns with the global scope of GeoArena, where user-submitted images are likely dominated by exterior scenes captured in diverse geographic contexts

work page

[5] [5]

no text") and 45.2% containing text (

Text Presence: The second doughnut chart reveals a more balanced distribution, with 54.8% of images lacking recognizable text ("no text") and 45.2% containing text ("has text"). This near-equitable split underscores the dataset’s richness, incorporating both text-free natural scenes and images with textual elements such as signs or labels. This variabilit...

work page

[6] [6]

no landmark

Landmark Presence: The third doughnut chart shows that 84.2% of images do not contain landmarks ("no landmark"), while 15.8% do ("has landmark"). The low prevalence of landmarks reflects the dataset’s emphasis on general geographic scenes rather than iconic or tourist-heavy locations, offering a broad representation of natural and urban environments world...

work page

[7] [7]

Accuracy of the predicted location

work page

[8] [8]

Strength of reasoning and evidence

work page

[9] [9]

win” if Response A is better - “loss

Clarity and specificity Output only one word: - “win” if Response A is better - “loss” if Response B is better - “tie” if both are equally good We first prompt the LVLM to act as an expert evaluator for the geolocalization task. For each sample, the model is provided with the prompt and the associated images, along with the responses from two candidate mo...

work page

[10] [10]

El Arco” or “The Arch

Identify Distinctive Features: The key feature here is the rock formation with an arch. Such formations are often associatedwith coastal areas where erosion has shaped the landscape. 2. Consider Geographical Context: The presence of a large body of water and the specific type of rock formation suggests a coastal location. The clarity of the water and the ...

work page

[11] [11]

Distinctive lamp posts The unique sculptural street lamps with leaf-like designs are characteristic of the Olympic Green area

work page

[12] [12]

Large open plaza The expansive paved public space is typical of the Olympic Forest Park’s grand pedestrian areas

work page

[13] [13]

Modern urban planning The organized layout and design aesthetic match Beijing’s Olympic district development

work page

[14] [14]

This area was specifically developed as part of Beijing’s Olympic infrastructure and remains a popular public space and touristdestination

Architectural style The overall design language reflects the modernization efforts made for the 2008 Beijing Olympics. This area was specifically developed as part of Beijing’s Olympic infrastructure and remains a popular public space and touristdestination. The Olympic Forest Park is located in the Chaoyang District of Beijing and serves as both a recrea...

work page 2008

[15] [15]

Climate and Vegetation: The presence of dense lush greenery and palm trees indicates a tropical or subtropical climate.Regions known for such landscapes include parts of the Caribbean, Southeast Asia, and the Pacific Islands. 2. Golf Course andInfrastructure: The well-maintained golf course and the presence of what appears to be resort infrastructure (e.g...

work page