Pith Number

pith:CQEI36IJ

pith:2026:CQEI36IJNBHHZC2CHBUTGJ2QMZ

not attested not anchored not stored refs resolved

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Gueter Josmy Faure, Hung-Ting Su, Posheng Chen, Powen Cheng, Winston H. Hsu

Vision-language models cannot reliably locate invisible functional objects from task instructions and commonsense.

arxiv:2605.14704 v1 · 2026-05-14 · cs.CV · cs.AI · cs.RO

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{CQEI36IJNBHHZC2CHBUTGJ2QMZ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs.

C2weakest assumption

The semi-automatic pipeline accurately creates 855 instances that genuinely require commonsense and spatial reasoning beyond superficial visual cues, rather than introducing artifacts that explain the low model performance.

C3one line summary

SceneFunRI benchmark shows current VLMs struggle severely with inferring locations of invisible functional objects, with the strongest model (Gemini 3 Flash) reaching only 15.20 CAcc@75.

References

38 extracted · 38 resolved · 4 Pith anchors

[1] Image amodal completion: A survey.Computer Vision and Image Understanding, 229:103661, 2023 2023

[2] Open-world amodal appearance completion 2025

[3] It’s not easy being wrong: Large language models struggle with process of elimination reasoning 2024

[4] Scaling spatial intelligence with multimodal foundation models

[5] SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes 2024

Receipt and verification

First computed	2026-05-17T23:38:59.290291Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

14088df909684e7c8b4238693327506666bc4de6bf1c46cbd714cf846ebb3700

Aliases

arxiv: 2605.14704 · arxiv_version: 2605.14704v1 · doi: 10.48550/arxiv.2605.14704 · pith_short_12: CQEI36IJNBHH · pith_short_16: CQEI36IJNBHHZC2C · pith_short_8: CQEI36IJ

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/CQEI36IJNBHHZC2CHBUTGJ2QMZ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 14088df909684e7c8b4238693327506666bc4de6bf1c46cbd714cf846ebb3700

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "54637346a93f1dc6c32674dbf6a01de1093a6119253cb8b203a427902373360a",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.RO"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-14T11:21:41Z",
    "title_canon_sha256": "fb13c269bda0be01455a9e6ad02abdb673621e71c92e40ae576b22d11b96e459"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14704",
    "kind": "arxiv",
    "version": 1
  }
}