| Citation: | ZHANG Z,YI H H,ZHENG J. Few-shot object detection of aerial image based on language guidance vision[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(7):2338-2348 (in Chinese) doi: 10.13700/j.bh.1001-5965.2023.0491 |
This paper presents a few-shot object detection of aerial images based on language guidance vision in response to the problem of decreased detection accuracy in existing aerial image object detection brought on by a lack of training data and changes in the aerial image dataset, such as changes in shooting angle, image quality, lighting conditions, background environment, and significant changes in object appearance and the addition of new object categories. First, a word-region alignment branch replaces the classification branch in conventional object detection networks, and a word-region alignment classification score with both language and visual information is obtained as a prediction classification result. Then, object detection and phrase grounding are unified into a single task, leveraging language information to enhance visual detection accuracy. Furthermore, to address the challenge of accuracy fluctuations caused by changes in textual prompts during few-shot object detection, a language visual bias network is designed to mine the association between language features and visual features. This network aims to improve the alignment between language and vision, mitigate accuracy fluctuations, and further enhance the accuracy of few-shot detection. Extensive experimental results on UAVDT, Visdrone, AeriaDrone, VEDAI, and CARPK_PUCPR demonstrate the superior performance of the proposed method. Particularly, on the UAVDT dataset, the method achieves an impressive mAP of 14.6% at 30-shot. Compared with aerial image detection algorithms clustered detection (ClusDet), density map guided object detection network (DMNet), global-local self-adaptive network (GLSAN) and coarse-grained density map network (CDMNet), the detection accuracy is 0.9%, −0.1%, −2.4% and −2.2% higher during full data training. The mAP of the approach can achieve 58.0% on the PUCPR dataset at 30-shot, which is 1.0%, 0.8%, 0.1% and 0.3% higher than the accuracy of full data trained generic object detection methods fully convolutional one-stage object detector (FCOS), adaptive training sample selection (ATSS), generalized focal loss V2 (GFLV2) and VarifocalNet (VFNET), respectively. These results highlight the robust few-shot generalization and transfer capabilities of the proposed method.
| [1] |
DU D W, WEN L Y, ZHU P F, et al. VisDrone-DET2020: the vision meets drone object detection in image challenge results[C]//Proceedings of the Computer Vision – ECCV 2020 Workshops. Berlin: Springer, 2020: 692-712.
|
| [2] |
DU D W, QI Y K, YU H Y, et al. The unmanned aerial vehicle benchmark: object detection and tracking[M]//Computer Vision – ECCV 2018. Berlin: Springer, 2018: 375-391.
|
| [3] |
HSIEH M R, LIN Y L, HSU W H. Drone-based object counting by spatially regularized regional proposal network[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 4165-4173.
|
| [4] |
RAZAKARIVONY S, JURIE F. Vehicle detection in aerial imagery: a small target detection benchmark[J]. Journal of Visual Communication and Image Representation, 2016, 34: 187-203. doi: 10.1016/j.jvcir.2015.11.002
|
| [5] |
LI L H, ZHANG P C, ZHANG H T, et al. Grounded language-image pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 10955-10965.
|
| [6] |
YANG F, FAN H, CHU P, et al. Clustered object detection in aerial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 8310-8319.
|
| [7] |
LI C L, YANG T, ZHU S J, et al. Density map guided object detection in aerial images[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE Press, 2020: 737-746.
|
| [8] |
DENG S T, LI S, XIE K, et al. A global-local self-adaptive network for drone-view object detection[J]. IEEE Transactions on Image Processing, 2020, 30: 1556-1569.
|
| [9] |
DUAN C Z, WEI Z W, ZHANG C, et al. Coarse-grained density map guided object detection in aerial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE Press, 2021: 2789-2798.
|
| [10] |
宋闯, 赵佳佳, 王康, 等. 面向智能感知的小样本学习研究综述[J]. 航空学报, 2020, 41(S1): 723756.
SONG C, ZHAO J J, WANG K, et al. A review of small sample learning for intelligent perception[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(S1): 723756(in Chinese).
|
| [11] |
段立娟, 袁蓥, 王文健, 等. 基于多模态联合语义感知的零样本目标检测[J]. 北京亚洲成人在线一二三四五六区学报, 2024, 50(2): 368-375.
DUAN L J, YUAN Y, WANG W J, et al. Zero-shot object detection based on multi-modal joint semantic perception[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(2): 368-375(in Chinese).
|
| [12] |
YAN X P, CHEN Z L, XU A N, et al. Meta R-CNN: towards general solver for instance-level low-shot learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 9576-9585.
|
| [13] |
REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031
|
| [14] |
FAN Q, ZHUO W, TANG C K, et al. Few-shot object detection with attention-RPN and multi-relation detector[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 4012-4021.
|
| [15] |
SUN B, LI B H, CAI S C, et al. FSCE: few-shot object detection via contrastive proposal encoding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 7348-7358.
|
| [16] |
HAN G X, MA J W, HUANG S Y, et al. Few-shot object detection with fully cross-transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 5311-5320.
|
| [17] |
DENG J Y, LI X, FANG Y. Few-shot object detection on remote sensing images[EB/OL]. (2020-06-14)[2023-02-01]. http://arxiv.org/abs/2006.07826v2.
|
| [18] |
CHENG G, YAN B W, SHI P Z, et al. Prototype-CNN for few-shot object detection in remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 5604610.
|
| [19] |
HUANG X, HE B K, TONG M, et al. Few-shot object detection on remote sensing images via shared attention module and balanced fine-tuning strategy[J]. Remote Sensing, 2021, 13(19): 3816. doi: 10.3390/rs13193816
|
| [20] |
李红光, 王玉峰, 杨丽春. 基于元学习的小样本遥感图像目标检测[J]. 北京亚洲成人在线一二三四五六区学报, 2024, 50(8): 2503-2513.
LI H G, WANG Y F, YANG L C. Meta-learning-based few-shot object detection for remote sensing images[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(8): 2503-2513(in Chinese).
|
| [21] |
ZHOU Y, HU H, ZHAO J Q, et al. Few-shot object detection via context-aware aggregation for remote sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 6509605.
|
| [22] |
RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. (2021-02-26)[2023-02-01]. http://arxiv.org/abs/2103.00020.
|
| [23] |
ZHONG Y W, YANG J W, ZHANG P C, et al. RegionCLIP: region-based language-image pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 16772-16782.
|
| [24] |
ZHAO S Y, ZHANG Z X, SCHULTER S, et al. Exploiting unlabeled data with vision and language models for object detection[M]//Computer Vision – ECCV 2022. Berlin: Springer, 2022: 159-175.
|
| [25] |
DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2023-02-01]. http://arxiv.org/abs/1810.04805v2.
|
| [26] |
DAI X Y, CHEN Y P, XIAO B, et al. Dynamic head: unifying object detection heads with attentions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 7369-7378.
|
| [27] |
ZHOU K Y, YANG J K, LOY C C, et al. Conditional prompt learning for vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 16795-16804.
|
| [28] |
TIAN Z, SHEN C H, CHEN H, et al. FCOS: a simple and strong anchor-free object detector[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(4): 1922-1933.
|
| [29] |
ZHANG S F, CHI C, YAO Y Q, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 9756-9765.
|
| [30] |
LI X, WANG W H, HU X L, et al. Generalized focal loss V2: learning reliable localization quality estimation for dense object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 11627-11636.
|
| [31] |
ZHANG H Y, WANG Y, DAYOUB F, et al. VarifocalNet: an IoU-aware dense object detector[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 8510-8519.
|