LSRFormer: Efficient Transformer Supply Convolutional Neural Networks with Global Information for Aerial Image Segmentation

IEEE Transactions on Geoscience and Remote Sensing 2024 · Renhe Zhang, Qian Zhang, Guixu Zhang ·

Both local context and global context information are essential for the semantic segmentation of aerial images. Convolutional Neural Networks (CNNs) can capture local context information well but cannot model the global dependencies. Vision transformers (ViTs) are good at extracting global information but cannot retain the spatial details well. In order to leverage the advantages of these two paradigms, we integration them in one model in this study. However, global token interaction of ViT brings high computational cost, which makes it difficult to apply to large-sized aerial images. To handle this problem, we propose a novel efficient ViT block named long-short-range transformer (LSRFormer). Instead of mainstream ViTs designed as backbones, LSRFormer is a pre-training-free and plug-and-play module to be appended after CNN stages to supplement the global information. It is composed of long-range self-attention (LR-SA), short-range self-attention (SR-SA), and multi-scale-convolutional feed-forward-network (MSC-FFN). LR-SA establishes long-range dependencies at the junction of the windows and SR-SA diffuses the long-range information from window boundary to internal. MSC-FFN can capture multi-scale information inside the ViT block. We append LSRFormer block after each CNN stage of a pure convolutional network to build a model named ConvLSR-Net. Compared with existing models which combining CNN and ViTs, our model can learn both local and global representation at all stages of the model. In particular, ConvLSR-Net achieves state-of-the-art (SOTA) results on four challenging aerial image segmentation benchmarks, including iSAID, LoveDA, ISPRS Potsdam and Vaihingen. Code has been released at https://github.com/stdcoutzrh/ConvLSR-Net.

PDF