pith. sign in

arxiv: 1907.07570 · v2 · pith:NVBMLGUWnew · submitted 2019-07-17 · 💻 cs.CV

FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition

Pith reviewed 2026-05-24 20:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene recognitionFOSNetscene coherence lossobject scene fusionPlaces2MIT Indoor67SUN397convolutional neural network
0
0 comments X

The pith

Fusing object and scene cues in a CNN with a new coherence loss reaches state-of-the-art accuracy on two scene recognition benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FOSNet, an end-to-end CNN that integrates object and scene information from an input image to predict the scene category. It adds a scene coherence loss that trains the network to keep scene predictions consistent across the image, drawing on the property that sceneness is uniform and the scene class does not vary within a single photo. Experiments on three standard datasets produce 60.14 percent accuracy on Places 2 and 90.37 percent on MIT Indoor 67, exceeding prior methods, with a second-place 77.28 percent on SUN 397. A reader would care because reliable scene labels support downstream tasks such as robot navigation and photo organization. The entire system is designed to be trained jointly without separate stages.

Core claim

The authors claim that the FOSNet architecture fuses object and scene streams inside one convolutional network and is trained with scene coherence loss to enforce uniform sceneness, producing the highest reported accuracies of 60.14 percent on Places 2 and 90.37 percent on MIT Indoor 67.

What carries the argument

The FOS (fusion of object and scene) Net that merges object and scene feature streams together with the scene coherence loss that penalizes inconsistent scene labels across an image.

If this is right

  • The fused network plus coherence loss outperforms previous scene recognition methods on Places 2 and MIT Indoor 67.
  • End-to-end training lets the model learn how to combine object and scene cues without staged pipelines.
  • The coherence loss exploits the constant scene class property to raise accuracy on standard benchmarks.
  • The method places second on SUN 397, showing competitive results across three large datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same object-scene fusion idea could be tested on video or multi-view scene data where temporal consistency replaces spatial uniformity.
  • If the uniform-sceneness premise holds only for certain image types, performance might degrade on photos containing multiple distinct regions.
  • Adding the coherence loss to other CNN backbones might produce similar gains without redesigning the full FOSNet structure.

Load-bearing premise

The scene coherence loss improves performance because sceneness spreads evenly and the scene class remains the same throughout the image.

What would settle it

Training the same FOSNet architecture without the scene coherence loss and measuring whether accuracy on Places 2 or MIT Indoor 67 drops by more than a few percentage points would test the loss contribution directly.

Figures

Figures reproduced from arXiv: 1907.07570 by Euntai Kim, Hongje Seong, Junhyuk Hyun.

Figure 1
Figure 1. Figure 1: An overall architecture of FOSNet. ObjectNet Object Feature 𝒙𝒙𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 Object Score GAP Convolution Layers • • • [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structure of ObjectNet. not need to be labeled in the scene recognition dataset; only a pre-trained CNN trained on the object recognition dataset is enough and the relationship between scene and objects is trained in a weakly supervised way. In this paper, a new fusion method named correlative context gating (CCG) is proposed. The CCG is an extended version of the CCM and it generates more accurate scene f… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of scene coherence loss (SCL). [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scene coherence in a scene image. Even if a scene image is divided [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of convolution with zero padding and partial convolution. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Trainable fusion modules with object feature and scene feature. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Classification loss and SCL curves of ResNet-18 trained (a) with [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The class activation map (CAM) [29] results using ResNet-18. The ground truth about the scene class of the image is on top of the image. The [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Scene recognition is an image recognition problem aimed at predicting the category of the place at which the image is taken. In this paper, a new scene recognition method using the convolutional neural network (CNN) is proposed. The proposed method is based on the fusion of the object and the scene information in the given image and the CNN framework is named as FOS (fusion of object and scene) Net. In addition, a new loss named scene coherence loss (SCL) is developed to train the FOSNet and to improve the scene recognition performance. The proposed SCL is based on the unique traits of the scene that the 'sceneness' spreads and the scene class does not change all over the image. The proposed FOSNet was experimented with three most popular scene recognition datasets, and their state-of-the-art performance is obtained in two sets: 60.14% on Places 2 and 90.37% on MIT indoor 67. The second highest performance of 77.28% is obtained on SUN 397.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes FOSNet, a CNN architecture for scene recognition that fuses object and scene information, trained end-to-end with a novel Scene Coherence Loss (SCL). SCL is motivated by the assumption that 'sceneness' spreads and the scene class remains constant across the full image. Experiments on Places2, MIT Indoor67, and SUN397 are reported to achieve SOTA accuracies of 60.14% and 90.37% on the first two datasets, respectively, with 77.28% (second-best) on SUN397.

Significance. If the performance claims hold under rigorous validation, the fusion of object/scene cues with a coherence loss could advance scene recognition methods by exploiting image-wide consistency properties. The work would benefit from demonstrating that the reported gains are attributable to the proposed components rather than unverified assumptions.

major comments (2)
  1. [Abstract] Abstract: The reported final accuracies (60.14% Places2, 90.37% MIT67) are supplied with no experimental protocol, baselines, error bars, ablation studies, or training details, so the central performance claim and the contribution of FOSNet + SCL cannot be evaluated.
  2. [Abstract] Abstract: SCL is justified by the assumption that 'the scene class does not change all over the image', yet no validation, counter-example analysis, or ablation is provided to test whether this holds on the cited datasets (which routinely contain foreground objects, partial views, or mixed elements); this assumption is load-bearing for attributing the SOTA numbers to the loss design.
minor comments (1)
  1. [Abstract] Abstract: The sentence 'their state-of-the-art performance is obtained in two sets' is ambiguous and should explicitly name the two datasets that achieve SOTA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported final accuracies (60.14% Places2, 90.37% MIT67) are supplied with no experimental protocol, baselines, error bars, ablation studies, or training details, so the central performance claim and the contribution of FOSNet + SCL cannot be evaluated.

    Authors: The abstract is length-limited and therefore omits protocol details, but the full manuscript provides them in Section 4 (implementation and training details) and Section 5 (baselines, comparisons on Places2, MIT Indoor67 and SUN397, and ablation studies on the fusion module and SCL). Single-run results are standard in this literature; error bars can be added if required. We will revise the abstract to include a short pointer to the experimental sections so that the performance claims are immediately traceable. revision: partial

  2. Referee: [Abstract] Abstract: SCL is justified by the assumption that 'the scene class does not change all over the image', yet no validation, counter-example analysis, or ablation is provided to test whether this holds on the cited datasets (which routinely contain foreground objects, partial views, or mixed elements); this assumption is load-bearing for attributing the SOTA numbers to the loss design.

    Authors: Section 3.2 motivates SCL from the whole-image labeling convention used in scene datasets. We agree that an explicit check of the assumption would improve attribution of gains. We will add a short discussion (with example images) of cases where the assumption may be violated (dominant foreground objects, mixed scenes) and an ablation that isolates the contribution of SCL versus the fusion backbone alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims rest on experiments, not self-referential derivation

full rationale

The paper introduces FOSNet architecture and scene coherence loss (SCL) motivated by an explicit assumption about scene properties ('sceneness' spreads and scene class is constant across the image). Performance numbers (60.14% Places2, 90.37% MIT67) are reported from direct experiments on standard datasets. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the abstract or provided text. The central claims are empirical results, not a derivation that reduces to its own inputs by construction; the assumption is presented as design motivation rather than a proven or fitted step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no equations, methods, or implementation details are available to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5717 in / 1007 out tokens · 22048 ms · 2026-05-24T20:26:22.236518+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    Multi-scale recognition with DAG-CNNs,

    S. Yang and D. Ramanan, “Multi-scale recognition with DAG-CNNs,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1215–1223

  2. [2]

    Relay backpropagation for effective learning of deep convolutional neural networks,

    L. Shen, Z. Lin, and Q. Huang, “Relay backpropagation for effective learning of deep convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2016. 10

  3. [3]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  4. [4]

    Dft-based transformation invariant pooling layer for visual classification,

    J. Ryu, M.-H. Yang, and J. Lim, “Dft-based transformation invariant pooling layer for visual classification,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2018, pp. 84–99

  5. [5]

    Scene recognition with cnns: objects, scales and dataset bias,

    L. Herranz, S. Jiang, and X. Li, “Scene recognition with cnns: objects, scales and dataset bias,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 571–579

  6. [6]

    Scene recognition with objectness,

    X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou, “Scene recognition with objectness,” Pattern Recognition, vol. 74, pp. 474–487, 2018

  7. [7]

    Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns,

    L. Wang, S. Guo, W. Huang, Y . Xiong, and Y . Qiao, “Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2055– 2068, 2017

  8. [8]

    Deriving high-level scene descriptions from deep scene cnn features,

    A. Bayat and M. Pomplun, “Deriving high-level scene descriptions from deep scene cnn features,” in Image Processing Theory, Tools and Applications (IPTA), 2017 Seventh International Conference on , 2017

  9. [9]

    Scene recognition via object-to-scene class conversion: end-to-end training,

    H. Seong, J. Hyun, H. Chang, S. Lee, S. Woo, and E. Kim, “Scene recognition via object-to-scene class conversion: end-to-end training,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), July 2019

  10. [10]

    Imagenet large scale visual recognition challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015

  11. [11]

    The pascal visual object classes (voc) challenge,

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision , vol. 88, no. 2, pp. 303–338, 2010

  12. [12]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755

  13. [13]

    From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,

    Z. Zhao and M. Larson, “From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,” in Proceedings of the 26th ACM International Conference on Multimedia (ACM MM) , 2018, pp. 1760–1768

  14. [14]

    Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,

    X. Song, S. Jiang, and L. Herranz, “Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,” IEEE Trans- actions on Image Processing , vol. 26, no. 6, pp. 2721–2735, 2017

  15. [15]

    Harvesting discriminative meta objects with deep cnn features for scene classification,

    R. Wu, B. Wang, W. Wang, and Y . Yu, “Harvesting discriminative meta objects with deep cnn features for scene classification,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1287–1295

  16. [16]

    Scene categorization model using deep visually sensitive features,

    J. Shi, H. Zhu, S. Yu, W. Wu, and H. Shi, “Scene categorization model using deep visually sensitive features,” IEEE Access, 2019

  17. [17]

    Scene recognition and weakly supervised object localization with deformable part-based models,

    M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011

  18. [18]

    Reconfigurable models for scene recognition,

    S. N. Parizi, J. G. Oberlin, and P. F. Felzenszwalb, “Reconfigurable models for scene recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012, pp. 2775– 2782

  19. [19]

    Scene categorization using deeply learned gaze shifting kernel,

    X. Sun, L. Zhang, Z. Wang, J. Chang, Y . Yao, P. Li, and R. Zimmermann, “Scene categorization using deeply learned gaze shifting kernel,” IEEE Transactions on Cybernetics , 2019

  20. [20]

    Fusing object semantics and deep appearance features for scene recognition,

    N. Sun, W. Li, J. Liu, G. Han, and C. Wu, “Fusing object semantics and deep appearance features for scene recognition,” IEEE Transactions on Circuits and Systems for Video Technology , 2018

  21. [21]

    Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,

    G.-S. Xie, X.-Y . Zhang, S. Yan, and C.-L. Liu, “Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 27, no. 6, pp. 1263–1274, 2017

  22. [22]

    Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,

    Z. Wang, L. Wang, Y . Wang, B. Zhang, and Y . Qiao, “Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2028– 2041, 2017

  23. [23]

    A robust indoor scene recognition method based on sparse representation,

    G. Nascimento, C. Laranjeira, V . Braz, A. Lacerda, and E. R. Nasci- mento, “A robust indoor scene recognition method based on sparse representation,” in Iberoamerican Congress on Pattern Recognition . Springer, 2017, pp. 408–415

  24. [24]

    Support-vector networks,

    C. Cortes and V . Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995

  25. [25]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770–778

  26. [26]

    Densely connected convolutional networks,

    G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , July 2017

  27. [27]

    Aggregated residual transformations for deep neural networks,

    S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492–1500

  28. [28]

    Places: A 10 million image database for scene recognition,

    B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 6, pp. 1452– 1464, 2018

  29. [29]

    Learning deep features for discriminative localization,

    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929

  30. [30]

    Partial Convolution based Padding

    G. Liu, K. J. Shih, T. Wang, F. A. Reda, K. Sapra, Z. Yu, A. Tao, and B. Catanzaro, “Partial convolution based padding,” arXiv preprint arXiv:1811.11718, 2018

  31. [31]

    Learnable pooling with Context Gating for video classification

    A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating for video classification,” arXiv preprint arXiv:1706.06905 , 2017

  32. [32]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift,

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on Machine Learning (ICML) , 2015

  33. [33]

    Sun database: Large-scale scene recognition from abbey to zoo,

    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3485–3492

  34. [34]

    Recognizing indoor scenes,

    A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 413–420

  35. [35]

    Learning deep features for scene recognition using places database,

    B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Proceed- ings of the Advances in Neural Information Processing Systems (NIPS) , 2014, pp. 487–495

  36. [36]

    Imagenet classifica- tion with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- tion with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2012, pp. 1097–1105

  37. [37]

    Very deep convolutional networks for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2015

  38. [38]

    Going deeper with convolutions,

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  39. [39]

    Rethinking the inception architecture for computer vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 2818–2826

  40. [40]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677 , 2017

  41. [41]

    Training and investigating residual nets,

    S. Gross and M. Wilber, “Training and investigating residual nets,” https: //github.com/facebook/fb.resnet.torch, 2016

  42. [42]

    A survey on transfer learning,

    S. J. Pan, Q. Yang et al. , “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering , vol. 22, no. 10, pp. 1345–1359, 2010

  43. [43]

    Places401 and places365 models,

    L. Shen, Z. Lin, G. Sun, and J. Hu, “Places401 and places365 models,” https://github.com/lishen-shirley/Places2-CNNs, 2016

  44. [44]

    Yolo9000: better, faster, stronger,

    J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7263–7271

  45. [45]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

  46. [46]

    Object bank: A high- level image representation for scene classification & semantic feature sparsification,

    L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high- level image representation for scene classification & semantic feature sparsification,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2010, pp. 1378–1386. 11 Hongje Seong received the BS degree in electrical and electronic engineering from Yonsei University...