FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition

Euntai Kim; Hongje Seong; Junhyuk Hyun

arxiv: 1907.07570 · v2 · pith:NVBMLGUWnew · submitted 2019-07-17 · 💻 cs.CV

FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition

Hongje Seong , Junhyuk Hyun , Euntai Kim This is my paper

Pith reviewed 2026-05-24 20:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene recognitionFOSNetscene coherence lossobject scene fusionPlaces2MIT Indoor67SUN397convolutional neural network

0 comments

The pith

Fusing object and scene cues in a CNN with a new coherence loss reaches state-of-the-art accuracy on two scene recognition benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FOSNet, an end-to-end CNN that integrates object and scene information from an input image to predict the scene category. It adds a scene coherence loss that trains the network to keep scene predictions consistent across the image, drawing on the property that sceneness is uniform and the scene class does not vary within a single photo. Experiments on three standard datasets produce 60.14 percent accuracy on Places 2 and 90.37 percent on MIT Indoor 67, exceeding prior methods, with a second-place 77.28 percent on SUN 397. A reader would care because reliable scene labels support downstream tasks such as robot navigation and photo organization. The entire system is designed to be trained jointly without separate stages.

Core claim

The authors claim that the FOSNet architecture fuses object and scene streams inside one convolutional network and is trained with scene coherence loss to enforce uniform sceneness, producing the highest reported accuracies of 60.14 percent on Places 2 and 90.37 percent on MIT Indoor 67.

What carries the argument

The FOS (fusion of object and scene) Net that merges object and scene feature streams together with the scene coherence loss that penalizes inconsistent scene labels across an image.

If this is right

The fused network plus coherence loss outperforms previous scene recognition methods on Places 2 and MIT Indoor 67.
End-to-end training lets the model learn how to combine object and scene cues without staged pipelines.
The coherence loss exploits the constant scene class property to raise accuracy on standard benchmarks.
The method places second on SUN 397, showing competitive results across three large datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same object-scene fusion idea could be tested on video or multi-view scene data where temporal consistency replaces spatial uniformity.
If the uniform-sceneness premise holds only for certain image types, performance might degrade on photos containing multiple distinct regions.
Adding the coherence loss to other CNN backbones might produce similar gains without redesigning the full FOSNet structure.

Load-bearing premise

The scene coherence loss improves performance because sceneness spreads evenly and the scene class remains the same throughout the image.

What would settle it

Training the same FOSNet architecture without the scene coherence loss and measuring whether accuracy on Places 2 or MIT Indoor 67 drops by more than a few percentage points would test the loss contribution directly.

Figures

Figures reproduced from arXiv: 1907.07570 by Euntai Kim, Hongje Seong, Junhyuk Hyun.

**Figure 1.** Figure 1: An overall architecture of FOSNet. ObjectNet Object Feature 𝒙𝒙𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 Object Score GAP Convolution Layers • • • [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Structure of ObjectNet. not need to be labeled in the scene recognition dataset; only a pre-trained CNN trained on the object recognition dataset is enough and the relationship between scene and objects is trained in a weakly supervised way. In this paper, a new fusion method named correlative context gating (CCG) is proposed. The CCG is an extended version of the CCM and it generates more accurate scene f… view at source ↗

**Figure 5.** Figure 5: Visualization of scene coherence loss (SCL). [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗

**Figure 4.** Figure 4: Scene coherence in a scene image. Even if a scene image is divided [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: Illustration of convolution with zero padding and partial convolution. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Trainable fusion modules with object feature and scene feature. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Classification loss and SCL curves of ResNet-18 trained (a) with [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: The class activation map (CAM) [29] results using ResNet-18. The ground truth about the scene class of the image is on top of the image. The [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Scene recognition is an image recognition problem aimed at predicting the category of the place at which the image is taken. In this paper, a new scene recognition method using the convolutional neural network (CNN) is proposed. The proposed method is based on the fusion of the object and the scene information in the given image and the CNN framework is named as FOS (fusion of object and scene) Net. In addition, a new loss named scene coherence loss (SCL) is developed to train the FOSNet and to improve the scene recognition performance. The proposed SCL is based on the unique traits of the scene that the 'sceneness' spreads and the scene class does not change all over the image. The proposed FOSNet was experimented with three most popular scene recognition datasets, and their state-of-the-art performance is obtained in two sets: 60.14% on Places 2 and 90.37% on MIT indoor 67. The second highest performance of 77.28% is obtained on SUN 397.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FOSNet fuses object and scene streams with a coherence loss assuming uniform scene class across the image, but the abstract supplies no baselines, ablations, or protocol to support the SOTA numbers.

read the letter

The main thing here is that FOSNet combines an object stream with a scene stream in a CNN and adds a scene coherence loss motivated by the claim that sceneness spreads and the scene label stays constant over the whole image. They report 60.14% on Places2 and 90.37% on MIT Indoor67 as state of the art, with 77.28% on SUN397 coming in second. That is the concrete output the abstract gives us. The fusion idea and the loss are presented as the contributions, and the loss is tied directly to the uniformity assumption. The paper tests the usual three scene datasets, which is the expected setup for this task. Beyond that, the abstract gives almost nothing else. There is no description of the fusion mechanism, no list of baselines, no ablation results, and no training protocol or variance numbers. Without those, the performance numbers cannot be checked against prior CNN scene models or against simpler fusions. The uniformity assumption behind the loss also looks exposed. Many images in these datasets contain foreground objects or mixed regions, so the premise that the scene class does not change across the image is not obviously true. If that assumption does not hold, the loss may not be the source of any gains, and the SOTA attribution becomes hard to accept. This is the kind of incremental architecture paper that scene-recognition researchers might scan for the loss formulation. A reader already working on multi-stream models could extract the idea quickly if the full text supplies the missing implementation and comparison details. I would send it to peer review because the claims are on standard benchmarks and can be tested once the experimental section is examined, even though the current description leaves the central results unverified.

Referee Report

2 major / 1 minor

Summary. The paper proposes FOSNet, a CNN architecture for scene recognition that fuses object and scene information, trained end-to-end with a novel Scene Coherence Loss (SCL). SCL is motivated by the assumption that 'sceneness' spreads and the scene class remains constant across the full image. Experiments on Places2, MIT Indoor67, and SUN397 are reported to achieve SOTA accuracies of 60.14% and 90.37% on the first two datasets, respectively, with 77.28% (second-best) on SUN397.

Significance. If the performance claims hold under rigorous validation, the fusion of object/scene cues with a coherence loss could advance scene recognition methods by exploiting image-wide consistency properties. The work would benefit from demonstrating that the reported gains are attributable to the proposed components rather than unverified assumptions.

major comments (2)

[Abstract] Abstract: The reported final accuracies (60.14% Places2, 90.37% MIT67) are supplied with no experimental protocol, baselines, error bars, ablation studies, or training details, so the central performance claim and the contribution of FOSNet + SCL cannot be evaluated.
[Abstract] Abstract: SCL is justified by the assumption that 'the scene class does not change all over the image', yet no validation, counter-example analysis, or ablation is provided to test whether this holds on the cited datasets (which routinely contain foreground objects, partial views, or mixed elements); this assumption is load-bearing for attributing the SOTA numbers to the loss design.

minor comments (1)

[Abstract] Abstract: The sentence 'their state-of-the-art performance is obtained in two sets' is ambiguous and should explicitly name the two datasets that achieve SOTA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The reported final accuracies (60.14% Places2, 90.37% MIT67) are supplied with no experimental protocol, baselines, error bars, ablation studies, or training details, so the central performance claim and the contribution of FOSNet + SCL cannot be evaluated.

Authors: The abstract is length-limited and therefore omits protocol details, but the full manuscript provides them in Section 4 (implementation and training details) and Section 5 (baselines, comparisons on Places2, MIT Indoor67 and SUN397, and ablation studies on the fusion module and SCL). Single-run results are standard in this literature; error bars can be added if required. We will revise the abstract to include a short pointer to the experimental sections so that the performance claims are immediately traceable. revision: partial
Referee: [Abstract] Abstract: SCL is justified by the assumption that 'the scene class does not change all over the image', yet no validation, counter-example analysis, or ablation is provided to test whether this holds on the cited datasets (which routinely contain foreground objects, partial views, or mixed elements); this assumption is load-bearing for attributing the SOTA numbers to the loss design.

Authors: Section 3.2 motivates SCL from the whole-image labeling convention used in scene datasets. We agree that an explicit check of the assumption would improve attribution of gains. We will add a short discussion (with example images) of cases where the assumption may be violated (dominant foreground objects, mixed scenes) and an ablation that isolates the contribution of SCL versus the fusion backbone alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims rest on experiments, not self-referential derivation

full rationale

The paper introduces FOSNet architecture and scene coherence loss (SCL) motivated by an explicit assumption about scene properties ('sceneness' spreads and scene class is constant across the image). Performance numbers (60.14% Places2, 90.37% MIT67) are reported from direct experiments on standard datasets. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the abstract or provided text. The central claims are empirical results, not a derivation that reduces to its own inputs by construction; the assumption is presented as design motivation rather than a proven or fitted step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no equations, methods, or implementation details are available to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5717 in / 1007 out tokens · 22048 ms · 2026-05-24T20:26:22.236518+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

[1]

Multi-scale recognition with DAG-CNNs,

S. Yang and D. Ramanan, “Multi-scale recognition with DAG-CNNs,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1215–1223

work page 2015
[2]

Relay backpropagation for effective learning of deep convolutional neural networks,

L. Shen, Z. Lin, and Q. Huang, “Relay backpropagation for effective learning of deep convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2016. 10

work page 2016
[3]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018
[4]

Dft-based transformation invariant pooling layer for visual classiﬁcation,

J. Ryu, M.-H. Yang, and J. Lim, “Dft-based transformation invariant pooling layer for visual classiﬁcation,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2018, pp. 84–99

work page 2018
[5]

Scene recognition with cnns: objects, scales and dataset bias,

L. Herranz, S. Jiang, and X. Li, “Scene recognition with cnns: objects, scales and dataset bias,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 571–579

work page 2016
[6]

Scene recognition with objectness,

X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou, “Scene recognition with objectness,” Pattern Recognition, vol. 74, pp. 474–487, 2018

work page 2018
[7]

Knowledge guided disambiguation for large-scale scene classiﬁcation with multi-resolution cnns,

L. Wang, S. Guo, W. Huang, Y . Xiong, and Y . Qiao, “Knowledge guided disambiguation for large-scale scene classiﬁcation with multi-resolution cnns,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2055– 2068, 2017

work page 2055
[8]

Deriving high-level scene descriptions from deep scene cnn features,

A. Bayat and M. Pomplun, “Deriving high-level scene descriptions from deep scene cnn features,” in Image Processing Theory, Tools and Applications (IPTA), 2017 Seventh International Conference on , 2017

work page 2017
[9]

Scene recognition via object-to-scene class conversion: end-to-end training,

H. Seong, J. Hyun, H. Chang, S. Lee, S. Woo, and E. Kim, “Scene recognition via object-to-scene class conversion: end-to-end training,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), July 2019

work page 2019
[10]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015

work page 2015
[11]

The pascal visual object classes (voc) challenge,

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision , vol. 88, no. 2, pp. 303–338, 2010

work page 2010
[12]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755

work page 2014
[13]

From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,

Z. Zhao and M. Larson, “From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,” in Proceedings of the 26th ACM International Conference on Multimedia (ACM MM) , 2018, pp. 1760–1768

work page 2018
[14]

Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,

X. Song, S. Jiang, and L. Herranz, “Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,” IEEE Trans- actions on Image Processing , vol. 26, no. 6, pp. 2721–2735, 2017

work page 2017
[15]

Harvesting discriminative meta objects with deep cnn features for scene classiﬁcation,

R. Wu, B. Wang, W. Wang, and Y . Yu, “Harvesting discriminative meta objects with deep cnn features for scene classiﬁcation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1287–1295

work page 2015
[16]

Scene categorization model using deep visually sensitive features,

J. Shi, H. Zhu, S. Yu, W. Wu, and H. Shi, “Scene categorization model using deep visually sensitive features,” IEEE Access, 2019

work page 2019
[17]

Scene recognition and weakly supervised object localization with deformable part-based models,

M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011

work page 2011
[18]

Reconﬁgurable models for scene recognition,

S. N. Parizi, J. G. Oberlin, and P. F. Felzenszwalb, “Reconﬁgurable models for scene recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012, pp. 2775– 2782

work page 2012
[19]

Scene categorization using deeply learned gaze shifting kernel,

X. Sun, L. Zhang, Z. Wang, J. Chang, Y . Yao, P. Li, and R. Zimmermann, “Scene categorization using deeply learned gaze shifting kernel,” IEEE Transactions on Cybernetics , 2019

work page 2019
[20]

Fusing object semantics and deep appearance features for scene recognition,

N. Sun, W. Li, J. Liu, G. Han, and C. Wu, “Fusing object semantics and deep appearance features for scene recognition,” IEEE Transactions on Circuits and Systems for Video Technology , 2018

work page 2018
[21]

Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,

G.-S. Xie, X.-Y . Zhang, S. Yan, and C.-L. Liu, “Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 27, no. 6, pp. 1263–1274, 2017

work page 2017
[22]

Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,

Z. Wang, L. Wang, Y . Wang, B. Zhang, and Y . Qiao, “Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2028– 2041, 2017

work page 2028
[23]

A robust indoor scene recognition method based on sparse representation,

G. Nascimento, C. Laranjeira, V . Braz, A. Lacerda, and E. R. Nasci- mento, “A robust indoor scene recognition method based on sparse representation,” in Iberoamerican Congress on Pattern Recognition . Springer, 2017, pp. 408–415

work page 2017
[24]

Support-vector networks,

C. Cortes and V . Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995

work page 1995
[25]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770–778

work page 2016
[26]

Densely connected convolutional networks,

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , July 2017

work page 2017
[27]

Aggregated residual transformations for deep neural networks,

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492–1500

work page 2017
[28]

Places: A 10 million image database for scene recognition,

B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 6, pp. 1452– 1464, 2018

work page 2018
[29]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929

work page 2016
[30]

Partial Convolution based Padding

G. Liu, K. J. Shih, T. Wang, F. A. Reda, K. Sapra, Z. Yu, A. Tao, and B. Catanzaro, “Partial convolution based padding,” arXiv preprint arXiv:1811.11718, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Learnable pooling with Context Gating for video classification

A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating for video classiﬁcation,” arXiv preprint arXiv:1706.06905 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on Machine Learning (ICML) , 2015

work page 2015
[33]

Sun database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3485–3492

work page 2010
[34]

Recognizing indoor scenes,

A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 413–420

work page 2009
[35]

Learning deep features for scene recognition using places database,

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Proceed- ings of the Advances in Neural Information Processing Systems (NIPS) , 2014, pp. 487–495

work page 2014
[36]

Imagenet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca- tion with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2012, pp. 1097–1105

work page 2012
[37]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2015

work page 2015
[38]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[39]

Rethinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 2818–2826

work page 2016
[40]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Training and investigating residual nets,

S. Gross and M. Wilber, “Training and investigating residual nets,” https: //github.com/facebook/fb.resnet.torch, 2016

work page 2016
[42]

A survey on transfer learning,

S. J. Pan, Q. Yang et al. , “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering , vol. 22, no. 10, pp. 1345–1359, 2010

work page 2010
[43]

Places401 and places365 models,

L. Shen, Z. Lin, G. Sun, and J. Hu, “Places401 and places365 models,” https://github.com/lishen-shirley/Places2-CNNs, 2016

work page 2016
[44]

Yolo9000: better, faster, stronger,

J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7263–7271

work page 2017
[45]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[46]

Object bank: A high- level image representation for scene classiﬁcation & semantic feature sparsiﬁcation,

L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high- level image representation for scene classiﬁcation & semantic feature sparsiﬁcation,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2010, pp. 1378–1386. 11 Hongje Seong received the BS degree in electrical and electronic engineering from Yonsei University...

work page 2010

[1] [1]

Multi-scale recognition with DAG-CNNs,

S. Yang and D. Ramanan, “Multi-scale recognition with DAG-CNNs,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1215–1223

work page 2015

[2] [2]

Relay backpropagation for effective learning of deep convolutional neural networks,

L. Shen, Z. Lin, and Q. Huang, “Relay backpropagation for effective learning of deep convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2016. 10

work page 2016

[3] [3]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018

[4] [4]

Dft-based transformation invariant pooling layer for visual classiﬁcation,

J. Ryu, M.-H. Yang, and J. Lim, “Dft-based transformation invariant pooling layer for visual classiﬁcation,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2018, pp. 84–99

work page 2018

[5] [5]

Scene recognition with cnns: objects, scales and dataset bias,

L. Herranz, S. Jiang, and X. Li, “Scene recognition with cnns: objects, scales and dataset bias,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 571–579

work page 2016

[6] [6]

Scene recognition with objectness,

X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou, “Scene recognition with objectness,” Pattern Recognition, vol. 74, pp. 474–487, 2018

work page 2018

[7] [7]

Knowledge guided disambiguation for large-scale scene classiﬁcation with multi-resolution cnns,

L. Wang, S. Guo, W. Huang, Y . Xiong, and Y . Qiao, “Knowledge guided disambiguation for large-scale scene classiﬁcation with multi-resolution cnns,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2055– 2068, 2017

work page 2055

[8] [8]

Deriving high-level scene descriptions from deep scene cnn features,

A. Bayat and M. Pomplun, “Deriving high-level scene descriptions from deep scene cnn features,” in Image Processing Theory, Tools and Applications (IPTA), 2017 Seventh International Conference on , 2017

work page 2017

[9] [9]

Scene recognition via object-to-scene class conversion: end-to-end training,

H. Seong, J. Hyun, H. Chang, S. Lee, S. Woo, and E. Kim, “Scene recognition via object-to-scene class conversion: end-to-end training,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), July 2019

work page 2019

[10] [10]

Imagenet large scale visual recognition challenge,

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015

work page 2015

[11] [11]

The pascal visual object classes (voc) challenge,

M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision , vol. 88, no. 2, pp. 303–338, 2010

work page 2010

[12] [12]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 740–755

work page 2014

[13] [13]

From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,

Z. Zhao and M. Larson, “From volcano to toyshop: Adaptive discrim- inative region discovery for scene recognition,” in Proceedings of the 26th ACM International Conference on Multimedia (ACM MM) , 2018, pp. 1760–1768

work page 2018

[14] [14]

Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,

X. Song, S. Jiang, and L. Herranz, “Multi-scale multi-feature context modeling for scene recognition in the semantic manifold,” IEEE Trans- actions on Image Processing , vol. 26, no. 6, pp. 2721–2735, 2017

work page 2017

[15] [15]

Harvesting discriminative meta objects with deep cnn features for scene classiﬁcation,

R. Wu, B. Wang, W. Wang, and Y . Yu, “Harvesting discriminative meta objects with deep cnn features for scene classiﬁcation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1287–1295

work page 2015

[16] [16]

Scene categorization model using deep visually sensitive features,

J. Shi, H. Zhu, S. Yu, W. Wu, and H. Shi, “Scene categorization model using deep visually sensitive features,” IEEE Access, 2019

work page 2019

[17] [17]

Scene recognition and weakly supervised object localization with deformable part-based models,

M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011

work page 2011

[18] [18]

Reconﬁgurable models for scene recognition,

S. N. Parizi, J. G. Oberlin, and P. F. Felzenszwalb, “Reconﬁgurable models for scene recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012, pp. 2775– 2782

work page 2012

[19] [19]

Scene categorization using deeply learned gaze shifting kernel,

X. Sun, L. Zhang, Z. Wang, J. Chang, Y . Yao, P. Li, and R. Zimmermann, “Scene categorization using deeply learned gaze shifting kernel,” IEEE Transactions on Cybernetics , 2019

work page 2019

[20] [20]

Fusing object semantics and deep appearance features for scene recognition,

N. Sun, W. Li, J. Liu, G. Han, and C. Wu, “Fusing object semantics and deep appearance features for scene recognition,” IEEE Transactions on Circuits and Systems for Video Technology , 2018

work page 2018

[21] [21]

Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,

G.-S. Xie, X.-Y . Zhang, S. Yan, and C.-L. Liu, “Hybrid cnn and dictionary-based models for scene recognition and domain adaptation,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 27, no. 6, pp. 1263–1274, 2017

work page 2017

[22] [22]

Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,

Z. Wang, L. Wang, Y . Wang, B. Zhang, and Y . Qiao, “Weakly supervised patchnets: Describing and aggregating local patches for scene recogni- tion,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 2028– 2041, 2017

work page 2028

[23] [23]

A robust indoor scene recognition method based on sparse representation,

G. Nascimento, C. Laranjeira, V . Braz, A. Lacerda, and E. R. Nasci- mento, “A robust indoor scene recognition method based on sparse representation,” in Iberoamerican Congress on Pattern Recognition . Springer, 2017, pp. 408–415

work page 2017

[24] [24]

Support-vector networks,

C. Cortes and V . Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995

work page 1995

[25] [25]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770–778

work page 2016

[26] [26]

Densely connected convolutional networks,

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , July 2017

work page 2017

[27] [27]

Aggregated residual transformations for deep neural networks,

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492–1500

work page 2017

[28] [28]

Places: A 10 million image database for scene recognition,

B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 6, pp. 1452– 1464, 2018

work page 2018

[29] [29]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929

work page 2016

[30] [30]

Partial Convolution based Padding

G. Liu, K. J. Shih, T. Wang, F. A. Reda, K. Sapra, Z. Yu, A. Tao, and B. Catanzaro, “Partial convolution based padding,” arXiv preprint arXiv:1811.11718, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Learnable pooling with Context Gating for video classification

A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating for video classiﬁcation,” arXiv preprint arXiv:1706.06905 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on Machine Learning (ICML) , 2015

work page 2015

[33] [33]

Sun database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3485–3492

work page 2010

[34] [34]

Recognizing indoor scenes,

A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 413–420

work page 2009

[35] [35]

Learning deep features for scene recognition using places database,

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in Proceed- ings of the Advances in Neural Information Processing Systems (NIPS) , 2014, pp. 487–495

work page 2014

[36] [36]

Imagenet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca- tion with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2012, pp. 1097–1105

work page 2012

[37] [37]

Very deep convolutional networks for large-scale image recognition,

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2015

work page 2015

[38] [38]

Going deeper with convolutions,

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015

[39] [39]

Rethinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 2818–2826

work page 2016

[40] [40]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Training and investigating residual nets,

S. Gross and M. Wilber, “Training and investigating residual nets,” https: //github.com/facebook/fb.resnet.torch, 2016

work page 2016

[42] [42]

A survey on transfer learning,

S. J. Pan, Q. Yang et al. , “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering , vol. 22, no. 10, pp. 1345–1359, 2010

work page 2010

[43] [43]

Places401 and places365 models,

L. Shen, Z. Lin, G. Sun, and J. Hu, “Places401 and places365 models,” https://github.com/lishen-shirley/Places2-CNNs, 2016

work page 2016

[44] [44]

Yolo9000: better, faster, stronger,

J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7263–7271

work page 2017

[45] [45]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[46] [46]

Object bank: A high- level image representation for scene classiﬁcation & semantic feature sparsiﬁcation,

L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing, “Object bank: A high- level image representation for scene classiﬁcation & semantic feature sparsiﬁcation,” in Proceedings of the Advances in Neural Information Processing Systems (NIPS) , 2010, pp. 1378–1386. 11 Hongje Seong received the BS degree in electrical and electronic engineering from Yonsei University...

work page 2010