POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

Heemin Yang; Jiyun Won; Jungseul Ok; Sunghyun Cho; Woohyeok Kim

arxiv: 2604.06938 · v1 · submitted 2026-04-08 · 💻 cs.CV

POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

Jiyun Won , Heemin Yang , Woohyeok Kim , Jungseul Ok , Sunghyun Cho This is my paper

Pith reviewed 2026-05-10 18:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords image signal processingISP pipelinereinforcement learningsequence predictiontask-aware optimizationmodular computer visionneural architecture search

0 comments

The pith

Sequence-level reinforcement learning selects and tunes an entire ISP pipeline in one forward pass using only the final task reward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POS-ISP as a way to optimize sequences of image signal processing modules together with their parameters for better performance on specific computer vision tasks. It replaces step-by-step decisions or separate architecture searches with a single policy that outputs the full ordered pipeline at once. Training relies solely on a reward computed at the end of the pipeline, removing the need for intermediate supervision signals or repeated executions of partial pipelines. This change is shown to produce higher accuracy on downstream tasks while lowering overall training cost across several experiments. The core shift is from local, incremental optimization to a global sequence prediction problem.

Core claim

POS-ISP formulates modular ISP optimization as a global sequence prediction problem. The method uses a reinforcement learning policy to predict the entire module sequence and its parameters in a single forward pass, then optimizes the resulting pipeline with a terminal task reward. This removes the requirement for intermediate supervision and avoids redundant pipeline executions during training, yielding more stable learning and lower computational overhead than neural architecture search or step-wise RL baselines.

What carries the argument

Sequence-level RL policy that outputs the complete ordered list of ISP modules and their tunable parameters together in one inference step, scored only by the final downstream task metric.

If this is right

Task accuracy rises because the policy can learn coherent module orders rather than myopic local choices.
Training stability improves by removing per-stage decision points that accumulate variance.
Compute during optimization falls since each training step evaluates only one complete pipeline instead of multiple partial ones.
The same trained policy can be applied to different downstream tasks by swapping only the final reward function.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-sequence idea could be tested on other modular vision pipelines where module order strongly affects final output quality.
If the policy generalizes across tasks, retraining for a new objective might require far fewer samples than re-optimizing from scratch.
Replacing the RL policy with a differentiable surrogate could further reduce training variance while preserving the single-pass advantage.

Load-bearing premise

A single reward signal measured only after the full pipeline runs is sufficient to train a stable policy that discovers effective module sequences and parameter settings without any stepwise guidance.

What would settle it

A controlled comparison in which the sequence-level policy, after training, produces lower task accuracy or higher final latency than a well-tuned step-wise RL baseline on the same set of downstream tasks would show the global formulation does not deliver the claimed gains.

Figures

Figures reproduced from arXiv: 2604.06938 by Heemin Yang, Jiyun Won, Jungseul Ok, Sunghyun Cho, Woohyeok Kim.

**Figure 1.** Figure 1: Overview of the proposed method. POS-ISP aims at constructing the ISP pipeline that best performs for the downstream task. The sequence predictor predicts the image processing module sequence based on the learned policy, and the parameter predictor estimates the corresponding parameters of each module [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Detailed architecture of sequence predictor. The sequence predictor predicts the image processing module sequence based on the learned policy. Given hi , an MLP-based decoder followed by a softmax layer predicts the probability distribution π(ai) over candidate modules, parameterizing the conditional distribution p(ai | a<i) based on the hidden state hi . During the ISP search, we train the sequence pred… view at source ↗

**Figure 3.** Figure 3: Comparison of different ISP methods on object detection and instance segmentation tasks. Reference images are well-lit scenes from the LOD and LIS datasets, with brightness increased by 1.5× for visualization. More results are in the supplementary material. LOD-Dark LOD-All Method mAP @0.5:0.95 mAP @0.5 mAP @0.75 mAP @0.5:0.95 mAP @0.5 mAP @0.75 Input RAW 44.1 67.7 47.5 53.6 70.5 57.5 Camera ISP 37.6 55.4 … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on image enhancement. We use the images retouched by Expert C from the Adobe FiveK dataset as ground truth. Our method more closely matches the brightness and color tones of the ground truth. the sum of detection and mask losses from a YOLOv11- seg [13] model pretrained on the COCO dataset, with the segmentation model parameters kept frozen during optimization. Evaluation is conduct… view at source ↗

**Figure 5.** Figure 5: Optimization behavior. (a) Task score on the test set over training progress. (b) (left) policy entropy convergence and (right) relative likelihood of the final pipeline. to perceptual quality enhancement. Additional results are provided in the supplementary material. 4.2. Optimization Stability Training dynamics We further analyze the optimization behavior during training. In [PITH_FULL_IMAGE:figures/ful… view at source ↗

read the original abstract

Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at https://w1jyun.github.io/POS-ISP

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POS-ISP frames ISP optimization as one-shot sequence prediction with RL and a terminal task reward, which cleanly sidesteps stepwise decisions but leaves credit assignment unresolved.

read the letter

The main thing here is a reinforcement learning policy that outputs the full ISP module sequence and all parameters in a single forward pass, then trains only on the end-of-pipeline task reward. This is presented as an improvement over NAS, which has a train-inference gap, and over stepwise RL, which incurs extra cost and instability from per-stage choices. The paper earns credit for stating those drawbacks plainly and for showing that the global formulation removes the need for intermediate supervision or repeated pipeline runs during training. Experiments are claimed to deliver better task accuracy at lower cost across several downstream problems, which would be useful if the numbers hold. The soft spot is exactly the one flagged in the stress test. A single terminal reward for sequences of five to ten heterogeneous modules, each with discrete choices and continuous parameters, gives an extremely sparse signal. Standard policy gradients will have high variance unless the authors added a strong baseline, value function, or some other mitigation that is not visible in the abstract. Without seeing the training curves, ablations on reward scaling, or comparisons to stepwise methods with the same compute budget, it is hard to judge whether the claimed stability is real or just lucky hyperparameter tuning. The work is aimed at groups that tune ISP pipelines for specific vision tasks such as detection or classification on embedded cameras. It is worth sending to peer review because the formulation is a clear, practical shift from the cited priors and the problem itself matters for real camera systems, even if the RL details need close checking.

Referee Report

2 major / 1 minor

Summary. The paper introduces POS-ISP, a sequence-level reinforcement learning framework for task-aware optimization of modular image signal processing (ISP) pipelines. It formulates the problem as a global sequence prediction task where the entire module sequence and continuous parameters are predicted in a single forward pass, optimized end-to-end using only a terminal task-specific reward. This is positioned as an improvement over neural architecture search (which has train-inference mismatch) and step-wise RL (which suffers from instability and redundant executions), with claimed gains in downstream task performance and reduced computational cost.

Significance. If the central claims hold, the work could establish sequence-level RL as a stable paradigm for joint discrete-continuous optimization of ISP pipelines, reducing reliance on intermediate supervision and enabling more efficient task-specific adaptations in computer vision. This would be particularly valuable for applications where ISP is a bottleneck, provided the approach generalizes beyond the evaluated tasks without excessive policy gradient variance.

major comments (2)

[Abstract] Abstract: the central claim that a single terminal task reward suffices to train a policy over variable-length module sequences (typically 5-10 stages with discrete choices and continuous parameters) without intermediate supervision is load-bearing, yet the description provides no details on the policy architecture, baseline, or variance-reduction techniques used to mitigate credit assignment difficulties in this sparse-reward setting.
[Abstract] Abstract: the assertion of improved task performance and reduced computational cost across multiple downstream tasks cannot be evaluated, as no datasets, baselines, quantitative metrics, or ablation results are reported, leaving the experimental validation of the sequence-level formulation unassessable.

minor comments (1)

The abstract mentions a project page but does not indicate whether code, trained models, or exact experimental protocols will be released, which would be needed to verify the claimed stability and efficiency gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their feedback. The abstract is intentionally concise, with full technical and experimental details provided in the manuscript body. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that a single terminal task reward suffices to train a policy over variable-length module sequences (typically 5-10 stages with discrete choices and continuous parameters) without intermediate supervision is load-bearing, yet the description provides no details on the policy architecture, baseline, or variance-reduction techniques used to mitigate credit assignment difficulties in this sparse-reward setting.

Authors: We agree the abstract omits these specifics due to length constraints. Section 3 of the manuscript fully specifies the policy as a sequence model that predicts the complete variable-length module sequence and continuous parameters in a single forward pass. Training uses a REINFORCE objective with a learned baseline for variance reduction, enabling stable optimization from the terminal task reward alone without intermediate supervision or per-stage rewards. This is the core of the sequence-level formulation. revision: no
Referee: [Abstract] Abstract: the assertion of improved task performance and reduced computational cost across multiple downstream tasks cannot be evaluated, as no datasets, baselines, quantitative metrics, or ablation results are reported, leaving the experimental validation of the sequence-level formulation unassessable.

Authors: The abstract summarizes the outcome; the full manuscript reports the experiments in Section 4, including the specific datasets and tasks, direct comparisons against NAS and step-wise RL baselines, quantitative metrics for task performance and computational cost, and ablations isolating the sequence-level optimization. These results support the claims of improved performance and reduced cost. revision: no

Circularity Check

0 steps flagged

No circularity: new RL formulation is self-contained

full rationale

The paper presents POS-ISP as a novel sequence-level RL formulation that predicts full module sequences and parameters in one forward pass, trained solely on terminal task reward. No equations or claims reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. The central method is introduced as an alternative to NAS and step-wise RL without re-deriving prior results or smuggling ansatzes; training stability is asserted as an empirical outcome rather than a definitional necessity. This is the expected non-finding for a methods paper proposing a new optimization paradigm.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on the assumption that terminal-reward RL can discover good sequences without intermediate signals, plus standard RL training machinery whose details are not supplied.

free parameters (1)

RL policy and reward scaling hyperparameters
Typical in any RL method; values are not stated in the abstract.

axioms (1)

domain assumption A terminal task reward alone suffices to learn stable and effective module sequences
Explicitly invoked by the claim that intermediate supervision can be eliminated.

pith-pipeline@v0.9.0 · 5483 in / 1024 out tokens · 33893 ms · 2026-05-10T18:32:56.035047+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lseq =−ÊA∼π[R(Iin,A,Θ)·∑k i=1 logπ(ai)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Unsuper- vised scale-consistent depth and ego-motion learning from monocular video

Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsuper- vised scale-consistent depth and ego-motion learning from monocular video. InNeurIPS, 2019. 1, 4

work page 2019
[2]

Learning photographic global tonal adjustment with a database of input / output image pairs

Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Fr ´edo Durand. Learning photographic global tonal adjustment with a database of input / output image pairs. InCVPR, 2011. 9

work page 2011
[3]

Instance segmentation in the dark.IJCV, 2023

Linwei Chen, Ying Fu, Kaixuan Wei, Dezhi Zheng, and Felix Heide. Instance segmentation in the dark.IJCV, 2023. 4, 9

work page 2023
[4]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InCVPR, 2012. 2, 9

work page 2012
[5]

Craft- ing object detection in very low light

Yang Hong, Kaixuan Wei, Linwei Chen, and Ying Fu. Craft- ing object detection in very low light. InBMVC, 2021. 3, 4, 5, 9

work page 2021
[6]

Exposure: A white-box photo post-processing framework.ACM TOG, 2018

Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen Lin. Exposure: A white-box photo post-processing framework.ACM TOG, 2018. 9

work page 2018
[7]

Efficient offline reinforcement learning: The critic is critical.arXiv, 2024

Adam Jelley, Trevor McInroe, Sam Devlin, and Amos Storkey. Efficient offline reinforcement learning: The critic is critical.arXiv, 2024. 3

work page 2024
[8]

Yolov13: Real-time object detection with hypergraph- enhanced adaptive visual perception.arXiv, 2025

Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, and Yue Gao. Yolov13: Real-time object detection with hypergraph- enhanced adaptive visual perception.arXiv, 2025. 4

work page 2025
[9]

Safe policy iteration: A monotonically improving approximate policy iteration approach.JMLR,

Alberto Maria Metelli, Matteo Pirotta, Daniele Calandriello, and Marcello Restelli. Safe policy iteration: A monotonically improving approximate policy iteration approach.JMLR,

work page
[10]

Film: Visual reasoning with a general conditioning layer.arXiv, 2017

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer.arXiv, 2017. 8

work page 2017
[11]

Yolov3: An incremental improvement.arXiv, 2018

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.arXiv, 2018. 3, 5

work page 2018
[12]

Drl-isp: Multi-objective camera isp with deep reinforcement learn- ing

Ukcheol Shin, Kyunghyun Lee, and In So Kweon. Drl-isp: Multi-objective camera isp with deep reinforcement learn- ing. InIROS, 2022. 1, 2, 3, 4, 8, 9

work page 2022
[13]

A reinterpretation of the policy oscillation phe- nomenon in approximate policy iteration

Paul Wagner. A reinterpretation of the policy oscillation phe- nomenon in approximate policy iteration. InNeurIPS, 2011. 3

work page 2011
[14]

Adaptiveisp: Learning an adaptive image signal proces- sor for object detection

Yujin Wang, Tianyi Xu, Fan Zhang, Tianfan Xue, and Jinwei Gu. Adaptiveisp: Learning an adaptive image signal proces- sor for object detection. InNeurIPS, 2024. 1, 2, 3, 4, 6, 8, 9

work page 2024
[15]

Reconfigisp: Reconfigurable camera image processing pipeline

Ke Yu, Zexian Li, Yue Peng, Chen Change Loy, and Jinwei Gu. Reconfigisp: Reconfigurable camera image processing pipeline. InICCV, 2021. 1, 4, 8 (d) AdaptiveISP(b) DRL-ISP (c) ReconfigISP (f) Ground truth(e) POS-ISP (Ours)(a) Input RAW 0.0 0.44 0.67 0.0 0.44 0.67 0.0 0.44 0.67 0.0 0.44 0.67 Figure S5.Qualitative comparison on the image enhancement task.We ...

work page 2021

[1] [1]

Unsuper- vised scale-consistent depth and ego-motion learning from monocular video

Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsuper- vised scale-consistent depth and ego-motion learning from monocular video. InNeurIPS, 2019. 1, 4

work page 2019

[2] [2]

Learning photographic global tonal adjustment with a database of input / output image pairs

Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Fr ´edo Durand. Learning photographic global tonal adjustment with a database of input / output image pairs. InCVPR, 2011. 9

work page 2011

[3] [3]

Instance segmentation in the dark.IJCV, 2023

Linwei Chen, Ying Fu, Kaixuan Wei, Dezhi Zheng, and Felix Heide. Instance segmentation in the dark.IJCV, 2023. 4, 9

work page 2023

[4] [4]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InCVPR, 2012. 2, 9

work page 2012

[5] [5]

Craft- ing object detection in very low light

Yang Hong, Kaixuan Wei, Linwei Chen, and Ying Fu. Craft- ing object detection in very low light. InBMVC, 2021. 3, 4, 5, 9

work page 2021

[6] [6]

Exposure: A white-box photo post-processing framework.ACM TOG, 2018

Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen Lin. Exposure: A white-box photo post-processing framework.ACM TOG, 2018. 9

work page 2018

[7] [7]

Efficient offline reinforcement learning: The critic is critical.arXiv, 2024

Adam Jelley, Trevor McInroe, Sam Devlin, and Amos Storkey. Efficient offline reinforcement learning: The critic is critical.arXiv, 2024. 3

work page 2024

[8] [8]

Yolov13: Real-time object detection with hypergraph- enhanced adaptive visual perception.arXiv, 2025

Mengqi Lei, Siqi Li, Yihong Wu, Han Hu, You Zhou, Xinhu Zheng, Guiguang Ding, Shaoyi Du, Zongze Wu, and Yue Gao. Yolov13: Real-time object detection with hypergraph- enhanced adaptive visual perception.arXiv, 2025. 4

work page 2025

[9] [9]

Safe policy iteration: A monotonically improving approximate policy iteration approach.JMLR,

Alberto Maria Metelli, Matteo Pirotta, Daniele Calandriello, and Marcello Restelli. Safe policy iteration: A monotonically improving approximate policy iteration approach.JMLR,

work page

[10] [10]

Film: Visual reasoning with a general conditioning layer.arXiv, 2017

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer.arXiv, 2017. 8

work page 2017

[11] [11]

Yolov3: An incremental improvement.arXiv, 2018

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement.arXiv, 2018. 3, 5

work page 2018

[12] [12]

Drl-isp: Multi-objective camera isp with deep reinforcement learn- ing

Ukcheol Shin, Kyunghyun Lee, and In So Kweon. Drl-isp: Multi-objective camera isp with deep reinforcement learn- ing. InIROS, 2022. 1, 2, 3, 4, 8, 9

work page 2022

[13] [13]

A reinterpretation of the policy oscillation phe- nomenon in approximate policy iteration

Paul Wagner. A reinterpretation of the policy oscillation phe- nomenon in approximate policy iteration. InNeurIPS, 2011. 3

work page 2011

[14] [14]

Adaptiveisp: Learning an adaptive image signal proces- sor for object detection

Yujin Wang, Tianyi Xu, Fan Zhang, Tianfan Xue, and Jinwei Gu. Adaptiveisp: Learning an adaptive image signal proces- sor for object detection. InNeurIPS, 2024. 1, 2, 3, 4, 6, 8, 9

work page 2024

[15] [15]

Reconfigisp: Reconfigurable camera image processing pipeline

Ke Yu, Zexian Li, Yue Peng, Chen Change Loy, and Jinwei Gu. Reconfigisp: Reconfigurable camera image processing pipeline. InICCV, 2021. 1, 4, 8 (d) AdaptiveISP(b) DRL-ISP (c) ReconfigISP (f) Ground truth(e) POS-ISP (Ours)(a) Input RAW 0.0 0.44 0.67 0.0 0.44 0.67 0.0 0.44 0.67 0.0 0.44 0.67 Figure S5.Qualitative comparison on the image enhancement task.We ...

work page 2021