FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition
Pith reviewed 2026-05-24 20:26 UTC · model grok-4.3
The pith
Fusing object and scene cues in a CNN with a new coherence loss reaches state-of-the-art accuracy on two scene recognition benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that the FOSNet architecture fuses object and scene streams inside one convolutional network and is trained with scene coherence loss to enforce uniform sceneness, producing the highest reported accuracies of 60.14 percent on Places 2 and 90.37 percent on MIT Indoor 67.
What carries the argument
The FOS (fusion of object and scene) Net that merges object and scene feature streams together with the scene coherence loss that penalizes inconsistent scene labels across an image.
If this is right
- The fused network plus coherence loss outperforms previous scene recognition methods on Places 2 and MIT Indoor 67.
- End-to-end training lets the model learn how to combine object and scene cues without staged pipelines.
- The coherence loss exploits the constant scene class property to raise accuracy on standard benchmarks.
- The method places second on SUN 397, showing competitive results across three large datasets.
Where Pith is reading between the lines
- The same object-scene fusion idea could be tested on video or multi-view scene data where temporal consistency replaces spatial uniformity.
- If the uniform-sceneness premise holds only for certain image types, performance might degrade on photos containing multiple distinct regions.
- Adding the coherence loss to other CNN backbones might produce similar gains without redesigning the full FOSNet structure.
Load-bearing premise
The scene coherence loss improves performance because sceneness spreads evenly and the scene class remains the same throughout the image.
What would settle it
Training the same FOSNet architecture without the scene coherence loss and measuring whether accuracy on Places 2 or MIT Indoor 67 drops by more than a few percentage points would test the loss contribution directly.
Figures
read the original abstract
Scene recognition is an image recognition problem aimed at predicting the category of the place at which the image is taken. In this paper, a new scene recognition method using the convolutional neural network (CNN) is proposed. The proposed method is based on the fusion of the object and the scene information in the given image and the CNN framework is named as FOS (fusion of object and scene) Net. In addition, a new loss named scene coherence loss (SCL) is developed to train the FOSNet and to improve the scene recognition performance. The proposed SCL is based on the unique traits of the scene that the 'sceneness' spreads and the scene class does not change all over the image. The proposed FOSNet was experimented with three most popular scene recognition datasets, and their state-of-the-art performance is obtained in two sets: 60.14% on Places 2 and 90.37% on MIT indoor 67. The second highest performance of 77.28% is obtained on SUN 397.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FOSNet, a CNN architecture for scene recognition that fuses object and scene information, trained end-to-end with a novel Scene Coherence Loss (SCL). SCL is motivated by the assumption that 'sceneness' spreads and the scene class remains constant across the full image. Experiments on Places2, MIT Indoor67, and SUN397 are reported to achieve SOTA accuracies of 60.14% and 90.37% on the first two datasets, respectively, with 77.28% (second-best) on SUN397.
Significance. If the performance claims hold under rigorous validation, the fusion of object/scene cues with a coherence loss could advance scene recognition methods by exploiting image-wide consistency properties. The work would benefit from demonstrating that the reported gains are attributable to the proposed components rather than unverified assumptions.
major comments (2)
- [Abstract] Abstract: The reported final accuracies (60.14% Places2, 90.37% MIT67) are supplied with no experimental protocol, baselines, error bars, ablation studies, or training details, so the central performance claim and the contribution of FOSNet + SCL cannot be evaluated.
- [Abstract] Abstract: SCL is justified by the assumption that 'the scene class does not change all over the image', yet no validation, counter-example analysis, or ablation is provided to test whether this holds on the cited datasets (which routinely contain foreground objects, partial views, or mixed elements); this assumption is load-bearing for attributing the SOTA numbers to the loss design.
minor comments (1)
- [Abstract] Abstract: The sentence 'their state-of-the-art performance is obtained in two sets' is ambiguous and should explicitly name the two datasets that achieve SOTA.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported final accuracies (60.14% Places2, 90.37% MIT67) are supplied with no experimental protocol, baselines, error bars, ablation studies, or training details, so the central performance claim and the contribution of FOSNet + SCL cannot be evaluated.
Authors: The abstract is length-limited and therefore omits protocol details, but the full manuscript provides them in Section 4 (implementation and training details) and Section 5 (baselines, comparisons on Places2, MIT Indoor67 and SUN397, and ablation studies on the fusion module and SCL). Single-run results are standard in this literature; error bars can be added if required. We will revise the abstract to include a short pointer to the experimental sections so that the performance claims are immediately traceable. revision: partial
-
Referee: [Abstract] Abstract: SCL is justified by the assumption that 'the scene class does not change all over the image', yet no validation, counter-example analysis, or ablation is provided to test whether this holds on the cited datasets (which routinely contain foreground objects, partial views, or mixed elements); this assumption is load-bearing for attributing the SOTA numbers to the loss design.
Authors: Section 3.2 motivates SCL from the whole-image labeling convention used in scene datasets. We agree that an explicit check of the assumption would improve attribution of gains. We will add a short discussion (with example images) of cases where the assumption may be violated (dominant foreground objects, mixed scenes) and an ablation that isolates the contribution of SCL versus the fusion backbone alone. revision: yes
Circularity Check
No circularity: empirical SOTA claims rest on experiments, not self-referential derivation
full rationale
The paper introduces FOSNet architecture and scene coherence loss (SCL) motivated by an explicit assumption about scene properties ('sceneness' spreads and scene class is constant across the image). Performance numbers (60.14% Places2, 90.37% MIT67) are reported from direct experiments on standard datasets. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the abstract or provided text. The central claims are empirical results, not a derivation that reduces to its own inputs by construction; the assumption is presented as design motivation rather than a proven or fitted step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Multi-scale recognition with DAG-CNNs,
S. Yang and D. Ramanan, “Multi-scale recognition with DAG-CNNs,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1215–1223
work page 2015
-
[2]
Relay backpropagation for effective learning of deep convolutional neural networks,
L. Shen, Z. Lin, and Q. Huang, “Relay backpropagation for effective learning of deep convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2016. 10
work page 2016
-
[3]
Squeeze-and-excitation networks,
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
work page 2018
-
[4]
Dft-based transformation invariant pooling layer for visual classification,
J. Ryu, M.-H. Yang, and J. Lim, “Dft-based transformation invariant pooling layer for visual classification,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2018, pp. 84–99
work page 2018
-
[5]
Scene recognition with cnns: objects, scales and dataset bias,
L. Herranz, S. Jiang, and X. Li, “Scene recognition with cnns: objects, scales and dataset bias,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 571–579
work page 2016
-
[6]
Scene recognition with objectness,
X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou, “Scene recognition with objectness,” Pattern Recognition, vol. 74, pp. 474–487, 2018
work page 2018
-
[7]
Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns,
L. Wang, S. Guo, W. Huang, Y . Xiong, and Y . Qiao, “Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2055– 2068, 2017
work page 2055
-
[8]
Deriving high-level scene descriptions from deep scene cnn features,
A. Bayat and M. Pomplun, “Deriving high-level scene descriptions from deep scene cnn features,” in Image Processing Theory, Tools and Applications (IPTA), 2017 Seventh International Conference on , 2017
work page 2017
-
[9]
Scene recognition via object-to-scene class conversion: end-to-end training,
H. Seong, J. Hyun, H. Chang, S. Lee, S. Woo, and E. Kim, “Scene recognition via object-to-scene class conversion: end-to-end training,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), July 2019
work page 2019
-
[10]
Imagenet large scale visual recognition challenge,
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015
work page 2015
-
[11]
The pascal visual object classes (voc) challenge,
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision , vol. 88, no. 2, pp. 303–338, 2010
work page 2010
-
[12]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755
work page 2014
-
[13]
From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,
Z. Zhao and M. Larson, “From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,” in Proceedings of the 26th ACM International Conference on Multimedia (ACM MM) , 2018, pp. 1760–1768
work page 2018
-
[14]
Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,
X. Song, S. Jiang, and L. Herranz, “Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,” IEEE Trans- actions on Image Processing , vol. 26, no. 6, pp. 2721–2735, 2017
work page 2017
-
[15]
Harvesting discriminative meta objects with deep cnn features for scene classification,
R. Wu, B. Wang, W. Wang, and Y . Yu, “Harvesting discriminative meta objects with deep cnn features for scene classification,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1287–1295
work page 2015
-
[16]
Scene categorization model using deep visually sensitive features,
J. Shi, H. Zhu, S. Yu, W. Wu, and H. Shi, “Scene categorization model using deep visually sensitive features,” IEEE Access, 2019
work page 2019
-
[17]
Scene recognition and weakly supervised object localization with deformable part-based models,
M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011
work page 2011
-
[18]
Reconfigurable models for scene recognition,
S. N. Parizi, J. G. Oberlin, and P. F. Felzenszwalb, “Reconfigurable models for scene recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012, pp. 2775– 2782
work page 2012
-
[19]
Scene categorization using deeply learned gaze shifting kernel,
X. Sun, L. Zhang, Z. Wang, J. Chang, Y . Yao, P. Li, and R. Zimmermann, “Scene categorization using deeply learned gaze shifting kernel,” IEEE Transactions on Cybernetics , 2019
work page 2019
-
[20]
Fusing object semantics and deep appearance features for scene recognition,
N. Sun, W. Li, J. Liu, G. Han, and C. Wu, “Fusing object semantics and deep appearance features for scene recognition,” IEEE Transactions on Circuits and Systems for Video Technology , 2018
work page 2018
-
[21]
Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,
G.-S. Xie, X.-Y . Zhang, S. Yan, and C.-L. Liu, “Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 27, no. 6, pp. 1263–1274, 2017
work page 2017
-
[22]
Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,
Z. Wang, L. Wang, Y . Wang, B. Zhang, and Y . Qiao, “Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2028– 2041, 2017
work page 2028
-
[23]
A robust indoor scene recognition method based on sparse representation,
G. Nascimento, C. Laranjeira, V . Braz, A. Lacerda, and E. R. Nasci- mento, “A robust indoor scene recognition method based on sparse representation,” in Iberoamerican Congress on Pattern Recognition . Springer, 2017, pp. 408–415
work page 2017
-
[24]
C. Cortes and V . Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995
work page 1995
-
[25]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770–778
work page 2016
-
[26]
Densely connected convolutional networks,
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , July 2017
work page 2017
-
[27]
Aggregated residual transformations for deep neural networks,
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492–1500
work page 2017
-
[28]
Places: A 10 million image database for scene recognition,
B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 6, pp. 1452– 1464, 2018
work page 2018
-
[29]
Learning deep features for discriminative localization,
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929
work page 2016
-
[30]
Partial Convolution based Padding
G. Liu, K. J. Shih, T. Wang, F. A. Reda, K. Sapra, Z. Yu, A. Tao, and B. Catanzaro, “Partial convolution based padding,” arXiv preprint arXiv:1811.11718, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Learnable pooling with Context Gating for video classification
A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating for video classification,” arXiv preprint arXiv:1706.06905 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Batch normalization: Accelerating deep network training by reducing internal covariate shift,
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on Machine Learning (ICML) , 2015
work page 2015
-
[33]
Sun database: Large-scale scene recognition from abbey to zoo,
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3485–3492
work page 2010
-
[34]
A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 413–420
work page 2009
-
[35]
Learning deep features for scene recognition using places database,
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Proceed- ings of the Advances in Neural Information Processing Systems (NIPS) , 2014, pp. 487–495
work page 2014
-
[36]
Imagenet classifica- tion with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- tion with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2012, pp. 1097–1105
work page 2012
-
[37]
Very deep convolutional networks for large-scale image recognition,
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2015
work page 2015
-
[38]
Going deeper with convolutions,
C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
work page 2015
-
[39]
Rethinking the inception architecture for computer vision,
C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 2818–2826
work page 2016
-
[40]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Training and investigating residual nets,
S. Gross and M. Wilber, “Training and investigating residual nets,” https: //github.com/facebook/fb.resnet.torch, 2016
work page 2016
-
[42]
A survey on transfer learning,
S. J. Pan, Q. Yang et al. , “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering , vol. 22, no. 10, pp. 1345–1359, 2010
work page 2010
-
[43]
Places401 and places365 models,
L. Shen, Z. Lin, G. Sun, and J. Hu, “Places401 and places365 models,” https://github.com/lishen-shirley/Places2-CNNs, 2016
work page 2016
-
[44]
Yolo9000: better, faster, stronger,
J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7263–7271
work page 2017
-
[45]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997
work page 1997
-
[46]
L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high- level image representation for scene classification & semantic feature sparsification,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2010, pp. 1378–1386. 11 Hongje Seong received the BS degree in electrical and electronic engineering from Yonsei University...
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.