RAWild: Sensor-Agnostic RAW Object Detection via Physics-Guided Curve and Grid Modeling

Gengjia Chang; Jun Liu; Shuhong Liu; Tatsuya Harada; Xuangeng Chu; Yinqiang Zheng; Ziteng Cui

arxiv: 2605.05941 · v1 · submitted 2026-05-07 · 💻 cs.CV

RAWild: Sensor-Agnostic RAW Object Detection via Physics-Guided Curve and Grid Modeling

Shuhong Liu , Gengjia Chang , Jun Liu , Xuangeng Chu , Yinqiang Zheng , Tatsuya Harada , Ziteng Cui This is my paper

Pith reviewed 2026-05-08 14:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords acrossdetectionobjectsensorsensor-agnosticdepthsframeworkgeneralization

0 comments

The pith

RAWild achieves sensor-agnostic RAW object detection by using physics-guided global-local tone mapping driven by RAW priors plus a simulation pipeline, delivering SOTA results across heterogeneous sensors and bit depths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RAW images from cameras contain more detailed light information than the processed JPEGs most systems use, but each sensor type produces different raw numbers because of its exposure settings, color filters, and bit depth. The method splits the sensor differences into two parts: a global curve that fixes overall brightness and contrast, and a local grid that adjusts colors in different image regions. Both parts are guided by statistical patterns found in real RAW data. To train the system without needing every possible sensor, the authors built a simulator that generates fake RAW images matching many real sensor behaviors. Experiments on several RAW datasets with bit depths from 10 to 24 bits show the single trained model performs better than previous approaches when tested on single datasets, mixed datasets, and under robustness challenges.

Core claim

By factoring sensor-induced variations into a global tonal correction and a spatially adaptive local color adjustment, both driven by RAW distribution priors, our framework enables a single network to train jointly across heterogeneous sensors.

Load-bearing premise

That sensor-induced variations can be accurately and completely factored into global tonal correction plus spatially adaptive local color adjustment driven by RAW distribution priors, and that the physics-based simulation pipeline produces data realistic enough for cross-sensor generalization.

read the original abstract

Camera sensor RAW data offers intrinsic advantages for object detection, including deeper bit depth, preserved physical information, and freedom from image signal processor (ISP) distortions. However, varying exposure conditions, spectral sensitivities, and bit depths across devices introduce substantially larger domain gaps than sRGB, making sensor-agnostic generalization a fundamental challenge. In this study, we present \textbf{RAWild}, a physics-guided global-local tone mapping framework for sensor-agnostic RAW object detection. By factoring sensor-induced variations into a global tonal correction and a spatially adaptive local color adjustment, both driven by RAW distribution priors, our framework enables a single network to train jointly across heterogeneous sensors. To further support cross-sensor generalization, we construct a physics-based RAW simulation pipeline that synthesizes realistic sensor outputs spanning diverse spectral sensitivities, illuminants, and sensor non-idealities. Extensive experiments across multiple RAW benchmarks covering bit depths from 10 to 24 demonstrate state-of-the-art (SOTA) performance under single-dataset, mixed-dataset, and challenging robustness settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAWild's main move is factoring RAW sensor gaps into a global curve plus local grid correction driven by distribution priors, plus a simulation pipeline, so one detector can train across devices.

read the letter

The one or two things worth knowing upfront are that the paper gives a concrete way to handle the larger domain shifts in RAW data compared to sRGB by splitting corrections into global tonal adjustment and spatially adaptive local color shifts, both pulled from RAW priors, and they pair it with a physics-based simulator that generates data across bit depths, spectral responses, and some non-idealities. This lets them claim joint training works for object detection without per-sensor retraining. If the simulation holds up, it is a practical step for keeping physical information in the pipeline instead of losing it to ISP processing. The approach is new in how it combines the curve-and-grid modeling specifically for detection rather than just tone mapping or generic adaptation. They also put work into covering a wide range of sensor conditions in the simulator, which is more than a routine extension. The paper does a solid job laying out why RAW generalization is harder and why preserving the raw signal matters for detection tasks. The experiments are described as covering single-dataset, mixed, and robustness cases on multiple benchmarks, which shows they tried to test the cross-sensor angle directly. The soft spots are mostly around the simulation's fidelity. The central claim needs the generated data to be close enough to real sensors that the learned corrections transfer; if read noise statistics, microlens effects, or bit-depth interactions are not fully captured, joint training on the synthetic set will not guarantee real-world results. The abstract states SOTA but without seeing error bars, full ablations, or how the priors were derived from the data, it is hard to judge how much the gains depend on the modeling versus other factors. There is also the usual risk that the priors end up tuned to the evaluation distribution. This paper is for computer vision researchers who work with raw sensor data or build systems that must run across multiple cameras without retraining. A reader focused on domain generalization or physics-informed methods in imaging would get value from the setup and the reported results. It deserves a serious referee because the problem is real, the method is defined, and they provide experiments across settings, even if the simulation validation will need close checking. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper presents RAWild, a physics-guided global-local tone mapping framework for sensor-agnostic RAW object detection. Sensor-induced variations are factored into a global tonal correction and a spatially adaptive local color adjustment, both driven by RAW distribution priors, enabling joint training of a single detector across heterogeneous sensors with varying exposures, spectral sensitivities, and bit depths. A physics-based RAW simulation pipeline is introduced to synthesize realistic data spanning diverse spectral sensitivities, illuminants, bit depths (10-24), and non-idealities. Experiments on multiple RAW benchmarks claim state-of-the-art performance under single-dataset, mixed-dataset, and robustness settings.

Significance. If the claims hold, the work addresses an important challenge in computer vision by enabling direct use of RAW data for object detection, preserving physical information and avoiding ISP distortions. The physics-based simulation pipeline is a notable strength for supporting cross-sensor generalization and data synthesis, provided it is empirically validated against real sensor distributions.

major comments (3)

[Abstract and simulation pipeline section] Abstract and simulation pipeline section: The central claim that the physics-based simulation produces data realistic enough for cross-sensor generalization after global-local corrections is load-bearing, yet the manuscript provides no quantitative validation (e.g., KL divergence, Wasserstein distance, or per-channel histogram comparisons) between simulated and real RAW captures from the same sensors. Unmodeled effects such as read-noise statistics or microlens variations could remain and undermine joint training.
[Framework description] Framework description: The decomposition of sensor variations into global tonal correction plus spatially adaptive local color adjustment driven by RAW priors is presented as complete and parameter-light, but no ablation studies are referenced that isolate the contribution of each component or test whether the priors are derived independently of the evaluation data. This risks circularity in the reported generalization gains.
[Experimental results] Experimental results: The SOTA claims across single-dataset, mixed-dataset, and robustness settings lack reported error bars, statistical significance tests, or detailed baseline comparisons (e.g., against prior RAW or ISP-based detectors). Without these, it is impossible to confirm that the performance improvements are attributable to the proposed corrections rather than dataset specifics.

minor comments (2)

[Abstract] Clarify the exact computation of 'RAW distribution priors' (e.g., whether they are per-image histograms, global statistics, or learned) in the main text, as the abstract uses the term without definition.
Ensure consistency in bit-depth reporting (10-24) between the abstract, simulation pipeline description, and experimental tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We will revise the manuscript to incorporate quantitative validation of the simulation pipeline, ablation studies for the framework components, and enhanced statistical reporting in the experiments. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract and simulation pipeline section] Abstract and simulation pipeline section: The central claim that the physics-based simulation produces data realistic enough for cross-sensor generalization after global-local corrections is load-bearing, yet the manuscript provides no quantitative validation (e.g., KL divergence, Wasserstein distance, or per-channel histogram comparisons) between simulated and real RAW captures from the same sensors. Unmodeled effects such as read-noise statistics or microlens variations could remain and undermine joint training.

Authors: We agree that quantitative validation would strengthen the central claim. The manuscript currently supports the simulation's utility through qualitative visual comparisons and downstream detection gains across sensors. In revision, we will add quantitative metrics including KL divergence, Wasserstein distance, and per-channel histogram comparisons between simulated and real RAW data from matching sensors. We will also expand the discussion of simulation limitations to explicitly address potential unmodeled effects such as read-noise statistics and microlens variations, including any mitigation strategies or sensitivity analysis. revision: yes
Referee: [Framework description] Framework description: The decomposition of sensor variations into global tonal correction plus spatially adaptive local color adjustment driven by RAW priors is presented as complete and parameter-light, but no ablation studies are referenced that isolate the contribution of each component or test whether the priors are derived independently of the evaluation data. This risks circularity in the reported generalization gains.

Authors: We concur that isolating component contributions via ablations is essential. The revised manuscript will include new ablation experiments that separately evaluate the global tonal correction and the spatially adaptive local adjustment. We will also clarify the procedure for deriving RAW distribution priors, confirming they are computed exclusively from training splits with no overlap to evaluation data, thereby eliminating any risk of circularity. revision: yes
Referee: [Experimental results] Experimental results: The SOTA claims across single-dataset, mixed-dataset, and robustness settings lack reported error bars, statistical significance tests, or detailed baseline comparisons (e.g., against prior RAW or ISP-based detectors). Without these, it is impossible to confirm that the performance improvements are attributable to the proposed corrections rather than dataset specifics.

Authors: We acknowledge the need for greater statistical rigor. In the revised experiments section, we will report error bars as standard deviations over multiple random seeds, include statistical significance tests (e.g., paired t-tests against baselines), and expand baseline comparisons to include additional prior RAW-specific and ISP-based detectors. These additions will help attribute performance gains more clearly to the physics-guided corrections. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and simulation presented as independent of target data

full rationale

The abstract and claims describe a physics-guided global-local tone mapping driven by RAW distribution priors together with a separate physics-based simulation pipeline that synthesizes outputs across spectral sensitivities, illuminants, and non-idealities. No quoted equations, self-citations, or descriptions show any prediction or result reducing by construction to fitted parameters from the same evaluation data, nor any uniqueness theorem or ansatz imported solely from the authors' prior work. The central claim of joint training across heterogeneous sensors therefore rests on externally verifiable physics modeling rather than definitional or statistical self-reference, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of RAW distribution priors and the realism of the physics-based simulation pipeline; no explicit free parameters or invented entities are detailed in the abstract.

axioms (1)

domain assumption Sensor-induced variations can be factored into global tonal correction and spatially adaptive local color adjustment driven by RAW distribution priors
Invoked to enable joint training across heterogeneous sensors.

invented entities (1)

Physics-based RAW simulation pipeline no independent evidence
purpose: Synthesize realistic sensor outputs spanning spectral sensitivities, illuminants, and non-idealities
Introduced to support cross-sensor generalization training.

pith-pipeline@v0.9.0 · 5502 in / 1311 out tokens · 66954 ms · 2026-05-08T14:19:23.162977+00:00 · methodology

RAWild: Sensor-Agnostic RAW Object Detection via Physics-Guided Curve and Grid Modeling

Core claim

Load-bearing premise

discussion (0)