End-To-End Deep Neural Network for Salient Object Detection in Complex Environments

Published: December 15, 2023

doi:

¹Zhengzhou University of Economics and Business, ²The 713 Research Institute of CSSC

Summary

The present protocol describes a novel end-to-end salient object detection algorithm. It leverages deep neural networks to enhance the precision of salient object detection within intricate environmental contexts.

Abstract

Salient object detection has emerged as a burgeoning area of interest within the realm of computer vision. However, prevailing algorithms exhibit diminished precision when tasked with detecting salient objects within intricate and multifaceted environments. In light of this pressing concern, this article presents an end-to-end deep neural network that aims to detect salient objects within complex environments. The study introduces an end-to-end deep neural network that aims to detect salient objects within complex environments. Comprising two interrelated components, namely a pixel-level multiscale full convolutional network and a deep encoder-decoder network, the proposed network integrates contextual semantics to produce visual contrast across multiscale feature maps while employing deep and shallow image features to improve the accuracy of object boundary identification. The integration of a fully connected conditional random field (CRF) model further enhances the spatial coherence and contour delineation of salient maps. The proposed algorithm is extensively evaluated against 10 contemporary algorithms on the SOD and ECSSD databases. The evaluation results demonstrate that the proposed algorithm outperforms other approaches in terms of precision and accuracy, thereby establishing its efficacy in salient object detection within complex environments.

Introduction

Salient object detection mimics human visual attention, swiftly identifying key image regions while suppressing background information. This technique is widely employed as a pre-processing tool in tasks such as image cropping¹, semantic segmentation², and image editing³. It streamlines tasks like background replacement and foreground extraction, improving editing efficiency and precision. Additionally, it aids in semantic segmentation by enhancing target localization. The potential of salient object detection to enhance computational efficiency and conserve memory underscores its significant research and application prospects.

Over the years, salient object detection has evolved from initial traditional algorithms to the incorporation of deep learning algorithms. The objective of these advancements has been to narrow the gap between salient object detection and human visual mechanisms. This has led to the adoption of deep convolutional network models for the study of salient object detection. Borji et al.⁴ summarized and generalized most of the classical traditional algorithms, which rely on the underlying features of the image. Despite some improvement in detection accuracy, manual experience, and cognition continue to pose challenges for salient object detection in complex environments.

The use of Convolutional Neural Networks (CNNs) is prevalent in the domain of salient object detection. In this context, deep convolutional neural networks are utilized for weight updates through autonomous learning. Convolutional neural networks have been employed to extract contextual semantics from images through the use of cascaded convolutional and pooling layers, enabling the learning of complex image features at higher levels, which have higher discrimination and characterization ability for salient object detection in different environments.

In 2016, fully convolutional neural networks⁵ gained significant traction as a popular approach for salient object detection, based on which researchers started pixel-level salient object detection. Many models are usually built on existing networks (e.g., VGG16⁶, ResNet⁷), aimed at enhancing image representation and strengthening the effect of edge detection.

Liu et al.⁸ used an already-trained neural network as the framework to compute the image globally and then refined the object boundary using a hierarchical network. The combination of the two networks forms the final deep saliency network. This was accomplished by feeding the previously acquired salient map into the network as prior knowledge in a repetitive manner. Zhang et al.⁹ effectively fused image semantic and spatial information using deep networks with bidirectional information transfer from shallow to deep and from deep to shallow layers, respectively. The detection of salient objects using a mutual learning deep model was put forward by Wu et al.¹⁰. The model utilizes foreground and edge information within a convolutional neural network to facilitate the detection process. Li et al.¹¹ employed the 'hole algorithm' of neural networks to address the challenge of fixing the receptive fields of diverse layers in deep neural networks in the context of salient object detection. However, super-pixel segmentation is used for object edge acquisition, greatly increasing the computational effort and computing time. Ren et al.¹² devised a multi-scale encoder-decoder network to detect salient objects and utilized convolutional neural networks to effectively combine deep and shallow features. Although the challenge of boundary blurring in object detection is resolved through this approach, the multi-scale fusion of information unavoidably results in heightened computational demands.

The literature review¹³ proposes that saliency detection, from traditional methods to deep learning methods, is summarized, and the evolution of saliency target detection from its origins to the era of deep learning can be seen very clearly. Various RGB-D-based salient object detection models with good performance have been proposed in the literature¹⁴. The above literature reviews and classifies the various types of algorithms for saliency object detection and describes their application scenarios, the databases used, and the evaluation metrics. This article also provides a qualitative and quantitative analysis of the proposed algorithms concerning their suggested databases and evaluation metrics.

All the above algorithms have obtained remarkable results in public databases, providing a basis for salient object detection in complex environments. Although there have been numerous research achievements in this field both domestically and internationally, there are still some issues to be addressed. (1) Traditional non-deep learning algorithms tend to have low accuracy due to their reliance on manually labeled features such as color, texture, and frequency, which can be easily affected by subjective experience and perception. Consequently, the precision of their salient object detection capabilities is diminished. Detecting salient objects in complex environments using traditional non-deep learning algorithms is challenging due to their difficulty in handling intricate scenarios. (2) Conventional methods for salient object detection exhibit limited accuracy due to their dependence on manually labeled features such as color, texture, and frequency. Additionally, region-level detection can be computationally expensive, often ignoring spatial consistency, and tends to poorly detect object boundaries. These issues need to be addressed to enhance the precision of salient object detection. (3) Salient object detection in intricate environments presents a challenge for most algorithms. Most salient object detection algorithms face serious challenges due to the increasingly complex salient object detection environment with variable backgrounds (similar background and foreground colors, complex background textures, etc.), many uncertainties such as inconsistent detection object sizes, and the unclear definition of foreground and background edges.

Most of the current algorithms exhibit low accuracy in detecting salient objects in complex environments with similar background and foreground colors, complex background textures, and blurred edges. Although current deep learning-based salient object algorithms demonstrate higher accuracy than traditional detection methods, the underlying image features they utilize still fall short in characterizing semantic features effectively, leaving room for improvement in their performance.

In summary, this study proposes an end-to-end deep neural network for a salient object detection algorithm, aiming to enhance the accuracy of salient object detection in complex environments, improve target edges, and better characterize semantic features. The contributions of this paper are as follows: (1) The first network employs VGG16 as the base network and modifies its five pooling layers using the 'hole algorithm'¹¹. The pixel-level multi-scale fully convolutional neural network learns image features from different spatial scales, addressing the challenge of static receptive fields across various layers of deep neural networks and enhancing the detection accuracy in significant areas of focus in the field. (2) Recent efforts to improve the accuracy of salient object detection have focused on leveraging deeper neural networks, such as VGG16, to extract both depth features from the encoder network and shallow features from the decoder network. This approach effectively enhances the detection accuracy of object boundaries and improves semantic information, particularly in complex environments with variable backgrounds, inconsistent object sizes, and indistinct boundaries between foreground and background. (3) Recent endeavors to enhance the precision of salient object detection have emphasized the use of deeper networks, including VGG16, for extracting deep features from the encoder network and shallow features from the decoder network. This approach has demonstrated improved detection of object boundaries and greater semantic information, especially in complex environments with varying backgrounds, object sizes, and indistinct boundaries between the foreground and background. Additionally, the integration of a fully connected conditional random field (CRF) model has been implemented to augment the spatial coherence and contour precision of salient maps. The effectiveness of this approach was evaluated on SOD and ECSSD datasets with complex backgrounds and was found to be statistically significant.

Related work
Fu et al.¹⁵ proposed a joint approach using RGB and deep learning for salient object detection. Lai et al.¹⁶ introduced a weakly supervised model for salient object detection, learning saliency from annotations, primarily utilizing scribble labels to save annotation time. While these algorithms presented a fusion of two complementary networks for saliency object detection, they lack in-depth investigation into saliency detection under complex scenarios. Wang et al.¹⁷ designed a two-mode iterative fusion of neural network features, both bottom-up and top-down, progressively optimizing the results of the previous iteration until convergence. Zhang et al.¹⁸ effectively fused image semantic and spatial information using deep networks with bidirectional information transfer from shallow to deep and from deep to shallow layers, respectively. The detection of salient objects using a mutual learning deep model was proposed by Wu et al.¹⁹. The model utilizes foreground and edge information within a convolutional neural network to facilitate the detection process. These deep neural network-based salient object detection models have achieved remarkable performance on publicly available datasets, enabling salient object detection in complex natural scenes. Nevertheless, designing even more superior models remains an important objective in this research field and serves as the primary motivation for this study.

Overall framework
The proposed model's schematic representation, as depicted in Figure 1, is primarily derived from the VGG16 architecture, incorporating both a pixel-level multiscale fully convolutional neural network (DCL) and a deep encoder-decoder network (DEDN). The model eliminates all final pooling and fully connected layers of VGG16 while accommodating input image dimensions of W × H. The operational mechanism involves the initial processing of the input image via the DCL, facilitating the extraction of deep features, while shallow features are obtained from the DEDN networks. The amalgamation of these characteristics is subsequently subjected to a fully connected conditional random field (CRF) model, augmenting the spatial coherence and contour accuracy of the saliency maps produced.

To ascertain the model's efficacy, it underwent testing and validation on SOD²⁰ and ECSSD²¹ datasets with intricate backgrounds. After the input image passes through the DCL, different scale feature maps with various receptive fields are obtained, and contextual semantics are combined to produce a W × H salient map with inter-dimensional coherence. The DCL employs a pair of convolutional layers with 7 x 7 kernels to substitute the final pooling layer of the original VGG16 network, enhancing the preservation of spatial information in the feature maps. This, combined with contextual semantics, produces a W × H salient map with inter-dimensional coherence. Similarly, the Deep Encoder-Decoder Network (DEDN) utilizes convolutional layers with 3 x 3 kernels in the decoders and a single convolutional layer after the last decoding module. Leveraging deep and shallow features of the image, it is possible to generate a salient map with a spatial dimension of W × H, addressing the challenge of indistinct object boundaries. The study describes a pioneering technique for salient object detection that amalgamates the DCL and DEDN models into a unified network. The weights of these two deep networks are learned through a training process, and the resultant saliency maps are merged and then refined using a fully connected Conditional Random Field (CRF). The primary objective of this refinement is to improve spatial consistency and contour localization.

Pixel-level multiscale fully convolutional neural network
The VGG16 architecture originally consisted of five pooling layers, each with a stride of 2. Each pooling layer compresses the image size to increase the number of channels, obtaining more contextual information. The DCL model is inspired by literature¹³ and is an improvement on the framework of VGG16. In this article, a pixel-level DCL model¹¹ is used, as shown in Figure 2 within the architecture of VGG16, a deep convolutional neural network. The initial four maximum pooling layers are interconnected with three kernels. The first kernel is 3 × 3 × 128; the second kernel is 1 × 1 × 128; and the third kernel is 1 × 1 × 1. To achieve a uniform size of feature maps after the initial four pooling layers, connected to three kernels, with each size being equivalent to one-eighth of the original image, the step size of the first kernel connected to these four largest pooling layers is set to 4, 2, 1, and 1, respectively.

To preserve the original receptive field in the different kernels, the "hole algorithm" proposed in literature¹¹ is used to extend the size of the kernel by adding zeros, thus maintaining the integrity of the kernel. These four feature maps are connected to the first kernel with different step sizes. Consequently, the feature maps produced in the final stage possess identical dimensions. The four feature maps constitute a set of multi-scale features obtained from distinct scales, each representing varying sizes of receptive fields. The resultant feature maps obtained from the four intermediate layers are concatenated with the ultimate feature map derived from VGG16, thus generating a 5-channel output. The ensuing output is subsequently subjected to a 1 × 1 × 1 kernel with the sigmoid activation function, ultimately producing the salient map (with a resolution of one-eighth of the original image). The image is up-sampled and enlarged using bilinear interpolation, ensuring that the resultant image, referred to as the saliency map, maintains an identical resolution as the initial image.

Deep encoder-decoder network
Similarly, the VGG16 network is employed as the backbone network. VGG16 is characterized by a low number of shallow feature map channels but high resolution and a high number of deep feature channels but low resolution. Pooling layers and down-sampling increase the computational speed of the deep network at the cost of reducing its feature map resolution. To address this issue, following the analysis in literature¹⁴, the encoder network is used to modify the full connectivity of the last pooling layer in the original VGG16. This modification involves replacing it with two convolutional layers with 7 × 7 kernels (larger convolutional kernels increase the receptive field). Both convolution kernels are equipped with a normalization (BN) operation and a modified linear unit (ReLU). This adjustment results in an encoder output feature map that better preserves image space information.

While the encoder improves high-level image semantics for the global localization of salient objects, the boundary-blurring problem of its salient object is not effectively improved. To tackle this issue, deep features are fused with shallow features, inspired by edge detection work¹², proposing the encoder-decoder network model (DEDN) as shown in Figure 3. The encoder architecture comprises three kernels interconnected with the initial four, while the decoder systematically enhances the feature map resolution using the maximum values retrieved from the maximum pooling layers.

In this innovative methodology for salient object detection, during the decoder phase, a convolutional layer with a 3 × 3 kernel is utilized in combination with a batch normalization layer and an adapted linear unit. At the conclusion of the final decoding module within the decoder architecture, a solitary-channel convolutional layer is employed to procure a salient map of spatial dimensions W × H. The salient map is generated through a collaborative fusion of the encoder-decoder model, yielding the outcome, and the complementary fusion of the two-i.e., the complementary fusion of deep information and shallow information. This not only achieves accurate localization of the salient object and increases the receptive field but also effectively preserves image detail information and strengthens the boundary of the salient object.

Integration mechanism
The encoder architecture comprises three kernels, which are associated with the initial four maximum pooling layers of the VGG16 model. In contrast, the decoder is intentionally formulated to progressively augment the resolution of feature maps acquired from the up-sampling layers by harnessing the maximum values garnered from the corresponding pooling layers. A convolutional layer utilizing a 3 x 3 kernel,a batch normalization layer, and a modified linear unit are then utilized in the decoder, followed by a single-channel convolutional layer to generate a salient map of dimensions W × H. The weights of the two deep networks are learned through alternating training cycles. The first network's parameters were kept fixed, while the second network's parameters underwent training for a total of fifty cycles. During the process, the weights of the saliency map (^S1 and ^S2) used for fusion are updated via a random gradient. The loss function¹¹ is:

(1)

In the given expression, the symbol G represents the manually labeled value, while W signifies the complete set of network parameters. The weight β_i serves as a balancing factor to regulate the proportion of salient pixels versus non-salient pixels in the computation process.

The image I is characterized by three parameters: |I|, |I|_– and |I|₊, which represent the total number of pixels, the count of non-salient pixels, and the count of salient pixels, respectively.

Since the salient maps obtained from the above two networks do not consider the coherence of neighboring pixels, a fully connected pixel-level saliency refinement model CRF¹⁵ is used to improve spatial coherence. The energy equation¹¹ is as follows, solving the binary pixel labeling problem.

(2)

where L denotes the binary label (salient value or non-salient value) assigned to all pixels. The variable P(l_i) denotes the likelihood of a given pixel x_i being assigned a specific label l_i, indicating the likelihood of the pixel x_i being saliency. In the beginning, P(1) = S_i and P(0) = 1 – S_i, where S_i denote the saliency value at the pixel x_i within the fused saliency map S. θ_i,j(l_i,l_j) is the pairwise potential, defined as follows.

(3)

Among them, if l_i ≠ l_j, then μ(l_i,l_j) = 1, otherwise μ(l_i,l_j) = 0. The computation of θ_i,j involves the utilization of two kernels, where the initial kernel is dependent on both the pixel position P and the pixel intensity I. This results in the proximity of pixels with similar colors exhibiting comparable saliency values. The two parameters, σ_α and σ_β, regulate the extent to which color similarity and spatial proximity influence the outcome. The objective of the second kernel is to eliminate isolated small regions. The minimization of energy is achieved through high-dimensional filtering, which expedites the mean field of the Conditional Random Field (CRF) distribution. Upon computation, the salient map denoted as S_crf exhibits enhanced spatial coherence and contour with regards to the salient objects detected.

Experimental configurations
In this article, a deep network for salient target detection based on the VGG16 neural network is constructed using Python. The proposed model is compared with other methods using the SOD²⁰ and ECSSD²¹ datasets. The SOD image database is known for its complex and cluttered backgrounds, similarity in colors between foreground and background, and small object sizes. Each image in this dataset is assigned a manually labeled true value for both quantitative and qualitative performance evaluation. On the other hand, the ECSSD dataset primarily consists of images sourced from the Internet, featuring more complex and realistic natural scenes with low contrast between the image background and salient objects.

The evaluation indexes used to compare the model in this paper include the commonly used Precision-Recall curve, F_β and E_MAE. To quantitatively assess the predicted saliency map, the Precision-Recall (P-R) curve²² is employed by altering the threshold from 0 to 255 for binarizing the saliency map. F_β is a comprehensive assessment metric, calculated with the precision and recall equations derived from the binarized salient map and a true value map.

(4)

where β is the weight parameter to adjust the accuracy and recall, setting β² = 0.3. The calculation of E_MAE is equivalent to computing the mean absolute error between the resultant saliency map and the ground truth map, as defined by the ensuing mathematical expression:

(5)

Let T_s (u,v) denote the extracted value of the salient map (u,v) pixels, and let T_G (u,v) denote the corresponding value of the true map (u,v) pixels.

Protocol

1. Experimental setup and procedure Load the pre-trained VGG16 model. NOTE: The first step is to load the pre-trained VGG16 model from the Keras library6. To load a pre-trained VGG16 model in Python using popular deep learning libraries like PyTorch (see Table of Materials), follow these general steps: Import torch. Import torchvision.models as models. Load the pre-trained V…

Representative Results

This study introduces an end-to-end deep neural network comprising two complementary networks: a pixel-level multi-scale fully convolutional network and a deep encoder-decoder network. The first network integrates contextual semantics to derive visual contrasts from multi-scale feature maps, addressing the challenge of fixed receptive fields in deep neural networks across different layers. The second network utilizes both deep and shallow image features to mitigate the issue of blurred boundaries in target objects. Final…

Discussion

The article introduces an end-to-end deep neural network specifically designed for the detection of salient objects in complex environments. The network is composed of two interconnected components: a pixel-level multiscale fully convolutional network (DCL) and a deep encoder-decoder network (DEDN). These components work synergistically, incorporating contextual semantics to generate visual contrasts within multiscale feature maps. Additionally, they leverage both deep and shallow image features to improve the precision …

Disclosures

The authors have nothing to disclose.

Acknowledgements

This work is supported by 2024 Henan Provincial Higher Education Institutions Key Scientific Research Project Funding Program Establishment (Project Number:24A520053). This study is also supported by Specialized Creation and Integration Characteristic Demonstration Course Construction in Henan Province.

Materials

Matlab	MathWorks	Matlab R2016a	MATLAB's programming interface provides development tools for improving code quality maintainability and maximizing performance. It provides tools for building applications using custom graphical interfaces. It provides tools for combining MATLAB-based algorithms with external applications and languages
Processor	Intel	11th Gen Intel(R) Core (TM) i5-1135G7 @ 2.40GHz	64-bit Win11 processor
Pycharm	JetBrains	PyCharm 3.0	PyCharm is a Python IDE (Integrated Development Environment) a list of required python: modulesmatplotlib skimage torch os time pydensecrf opencv glob PIL torchvision numpy tkinter
PyTorch	Facebook	PyTorch 1.4	PyTorch is an open source Python machine learning library , based on Torch , used for natural language processing and other applications.PyTorch can be viewed both as the addition of GPU support numpy , but also can be viewed as a powerful deep neural network with automatic derivatives .

References

Wang, W. G., Shen, J. B., Ling, H. B. A deep network solution for attention and aesthetics aware photo cropping. IEEE Transactions on Pattern Analysis and Machine Intelligence. 41 (7), 1531-1544 (2018).
Wang, W. G., Sun, G. L., Gool, L. V. Looking beyond single images for weakly supervised semantic segmentation learning. IEEE Transactions on Pattern Analysis and Machine. , (2022).
Mei, H. L., et al. Exploring dense context for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology. 32 (3), 1378-1389 (2021).
Borji, A., Itti, L. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (1), 185-207 (2012).
Long, J., Shelhamer, E., Darrell, T. Fully convolutional networks for semantic segmentation. , 3431-3440 (2015).
Simonyan, K., Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. , 1409-1556 (2014).
He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. , 770-778 (2016).
Liu, N., Han, J. Dhsnet: Deep hierarchical saliency network for salient object detection. , 678-686 (2016).
Zhang, L., Dai, J., Lu, H., He, Y., Wang, G. A bi-directional message passing model for salient object detection. , 1741-1750 (2018).
Wu, R., et al. A mutual learning method for salient object detection with intertwined multi-supervision. , 8150-8159 (2019).
Li, G., Yu, Y. Deep contrast learning for salient object detection. , 478-487 (2019).
Ren, Q., Hu, R. Multi-scale deep encoder-decoder network for salient object detection. Neurocomputing. 316, 95-104 (2018).
Wang, W. G., et al. Salient object detection in the deep learning era: An in-depth survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 44 (6), 3239-3259 (2021).
Zhou, T., et al. RGB-D salient object detection: A survey. Computational Visual Media. 7, 37-69 (2021).
Fu, K., et al. Siamese network for RGB-D salient object detection and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence. 44 (9), 5541-5559 (2021).
Lai, Q., et al. Weakly supervised visual saliency prediction. IEEE Transactions on Image Processing. 31, 3111-3124 (2022).
Zhang, L., Dai, J., Lu, H., He, Y., Wang, G. A bi-directional message passing model for salient object detection. , 1741-1750 (2018).
Wu, R. A mutual learning method for salient object detection with intertwined multi-supervision. , 8150-8159 (2019).
Wang, W., Shen, J., Dong, X., Borji, A., Yang, R. Inferring salient objects from human fixations. IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (8), 1913-1927 (2019).
Movahedi, V., Elder, J. H. Design and perceptual validation of performance measures for salient object segmentation. , 49-56 (2010).
Shi, J., Yan, Q., Xu, L., Jia, J. Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (4), 717-729 (2015).
Achanta, R., Hemami, S., Estrada, F., Susstrunk, S. Frequency-tuned salient region detection. , 1597-1604 (2009).
Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M. H. Saliency detection via graph-based manifold ranking. , 3166-3173 (2013).
Wei, Y., et al. Geodesic saliency using background priors. Computer Vision-ECCV 2012. , 29-42 (2012).
Margolin, R., Tal, A., Zelnik-Manor, L. What makes a patch distinct. , 1139-1146 (2013).
Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A. Saliency filters: Contrast based filtering for salient region detection. , 733-740 (2012).
Hou, X., Harel, J., Koch, C. Image signature: Highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence. 34 (1), 194-201 (2011).
Jiang, H., et al. Salient object detection: A discriminative regional feature integration approach. , 2083-2090 (2013).
Li, G., Yu, Y. Visual saliency based on multiscale deep features. , 5455-5463 (2015).
Lee, G., Tai, Y. W., Kim, J. Deep saliency with encoded low level distance map and high-level features. , 660-668 (2016).
Liu, N., Han, J. Dhsnet: Deep hierarchical saliency network for salient object detection. , 678-686 (2016).

Play Video

PDF

DOI

DOWNLOAD MATERIALS LIST

Cite This Article

Wang, Y., Wang, Z. End-To-End Deep Neural Network for Salient Object Detection in Complex Environments. J. Vis. Exp. (202), e65554, doi:10.3791/65554 (2023).