Pyramid Grafting Network (PGNet): Understanding Paper Series 1

The domain Salient Object Detection has seen a lot of new research work published recently and at this year's CVPR 2022, we saw the paper on "Pyramid Grafting Network for One-Stage High-Resolution Saliency Detection" by Xie et. al. The authors propose the following:

A novel one-stage framework for Salient Object Detection (SOD) using a transformer and CNN backbone to extract features.
An attention-based Cross-Model Grafting Module (CMGM)
A new Attention Guided Loss (AGL)
An Ultra-High-Resolution Saliency Detection dataset called UHRSD contains 5920 images at 4K-8K resolutions.

The authors claim that their new method achieves superior performance compared to the state-of-the-art methods on both existing datasets and the new one introduced.

Let's try to understand the main 4 components of this paper

The reason to introduce a new framework

Most of the low-resolution SOD networks are designed in an Encoder-Decoder style, and as the input resolution increases dramatically, the size of features extracted increases, but the receptive field determined by the network is fixed, making the relative receptive field small, ultimately resulting in the inability to capture global semantics that is vital to SOD task. The papers previous to this HRSOD and DHQSOD both are inconsistent contextual semantic transfer between stages and time-consuming methods to train.

The One-Stage SOD framework

The key idea here is that the network consists of two encoders grafted together to form a single encoder and multiple decoders working in a series that effectively function as a single decoder.

Encoders

The authors propose a SWIN transformer based Encoder that captures global semantic information from Low-Resolution images and a ResNet-18 based CNN that captures rich details from High-Resolution images. During the encoding process, two encoders are fed with images of different resolutions in order to capture global semantic information and detailed information respectively in parallel.

The ResNet-18 encoder generates 5 feature maps. These are the $R_7, R_5, R_4, R_3, R_2$ layers of the ResNet model. From the SWIN transformer the $S_1, S_2, S_3, S_4$ layers are used.

Decoder

The decoding phase can be divided into three substages, first Swin decoding, followed by grafted feature decoding, and finally Resnet decoding in a staggered structure. The feature decoded in the second sub-stage is produced from Cross-Model Grafting Module (CMGM), where the global semantic information is grafted from the SWIN branch to Resnet-18 branch.

Cross-Model Grafting Module

Next, the $R_5, S_2$ features are grafted together. For feature $S_2$, due to the transformer’s ability to capture information over long distances, it has global semantic information that is important for saliency detection. In contrast, CNN's perform well at extracting local information thus $R_5$ have relatively rich details and also earlier levels have much noise in the background.

CMGM re-calculates the point-wise relationship between ResNet feature and Transformer feature, transferring the global semantic information from the Transformer branch to the ResNet branch so as to remedy the common errors. Firstly, the CMGM flattens both the layers, then inspired by the multi-head self-attention mechanism, they apply layer norm and linear projection to them. From here they apply $$ Y = \text{softmax}(f_R^q \text{ x } f_S^{k^T}) $$ $$ Z = Y \text{ x } f_R^v $$ where, the superscripts q, k, and v represent the query, key, and value output from the multi-head self-attention.

Attention Guided Loss

In order for CMGM to better serve the purpose of transferring information from the Transformer branch to the ResNet branch, the authors designed the Attention Guided Loss to supervise the Cross Attention Matrix explicitly. The Cross Attention Matrix is similar to the attention matrix generated from ground truth, because the salient features have a higher similarity, in other words the dot product should has a larger activation value. The Attention Guided Loss is based on weighted binary cross entropy (wBCE) to supervise the Cross Attention Matrix CAM generated from CMGM.

Implementation Details

The whole network was trained end-to-end by using stochastic gradient descent (SGD). The maximum learning rate was set to 0.003 for SWIN backbone and 0.03 for others. The Momentum and weight decay are set to 0.9 and 0.0005 respectively. Batchsize was set to 16 and the maximum epoch is set to 32. For data augmentation, the authors use random flip, crop and multi-scale input images

Evaluation

The model was tested on both the High-Res and Low-Res datasets. There were 4 metrics chosen. Mean Absolute Error (MAE), Structural Similarity Measure $S_m$, E-Measer $E_m$ and Boundary Displacement Error (BDE)

The authors also perform an ablation study for composition and for grafting positions and see that the model performs best with those layers and model combinations used for grafting.

Limitation

The authors claim that the training process is still quite demanding on GPU memory usage, resulting in a high cost of training, and also state that for excessive resolution such as 4K, images need to be downsampled first before input.

Pyramid Grafting Network (PGNet): Understanding Paper Series 1

The reason to introduce a new framework

The One-Stage SOD framework

Encoders

Decoder

Cross-Model Grafting Module

Attention Guided Loss

Implementation Details

Evaluation

Limitation

Comments

More from this blog

Multi Armed Bandits - What, Why and How ?

Command Palette

The reason to introduce a new framework

The One-Stage SOD framework

Encoders

Decoder

Cross-Model Grafting Module

Attention Guided Loss

Implementation Details

Evaluation

Limitation

Comments

More from this blog