语言引导视觉的小样本航拍图像目标检测

张智; 易华挥; 郑锦

doi:10.13700/j.bh.1001-5965.2023.0491

语言引导视觉的小样本航拍图像目标检测

doi: 10.13700/j.bh.1001-5965.2023.0491

张智¹,
易华挥¹,
郑锦^2, ,

1.
中国民航大学计算机科学与技术学院，天津 300300
2.
北京亚洲成人在线一二三四五六区计算机学院，北京 100191

基金项目:

国家自然科学基金(62372029);中央高校基本科研业务费专项资金(3122023007)

详细信息

通讯作者:
E-mail：JinZheng@cqjj8.com

中图分类号: V279；TP391.4
计量
- 文章访问数: 409
- HTML全文浏览量: 84
- PDF下载量: 26
- 被引次数: 0
出版历程
- 收稿日期: 2023-07-27
- 录用日期: 2023-09-22
- 网络出版日期: 2023-10-30
- 整期出版日期: 2025-07-31

Few-shot object detection of aerial image based on language guidance vision

ZHANG Zhi¹,
YI Huahui¹,
ZHENG Jin^{2
, ,}

1.
School of Computer Science and Technology，Civil Aviation University of China，Tianjin 300300，China
2.
School of Computer Science and Engineering，Beihang University，Beijing 100191，China

Funds:

National Natural Science Foundation of China (62372029); The Fundamental Research Funds for the Central Universities (3122023007)

More Information

Corresponding author: E-mail：JinZheng@cqjj8.com

摘要

摘要:
针对现有航拍图像目标检测方法在航拍数据集变化时，即拍摄视角、图像质量、照明条件、背景环境等发生大幅变化，以及目标外观变化明显、目标类别新增时，不经过对新数据集全样本训练，而采用原有数据集直接推理，检测精度大幅下降的问题，提出语言引导视觉的小样本航拍图像目标检测方法。采用词语-区域对齐分支取代传统目标检测网络中的分类分支，得到同时具有语言和视觉信息的词语-区域对齐分类分数作为预测分类结果，进而将目标检测和词语定位统一为一个任务，并利用语言引导提升视觉目标检测精度。针对输入文本语言变化引起小样本目标检测精度波动的情况，设计语言视觉偏置网络，挖掘语言特征和视觉特征的关联关系，提升语言视觉的匹配度，缩小精度波动，并进一步提升小样本目标检测精度。在UAVDT、Visdrone、AeriaDrone、VEDAI、CARPK_PUCPR数据集上的大量实验结果证明了所提方法的优越性能，在UAVDT航拍数据集上所提方法在30样本时平均精度均值（mAP）可达14.6%，相比航拍图像检测方法簇检测器（ClusDet）、密度图引导的目标检测网络（DMNet）、全局-局部自适应网络（GLSAN）和粗粒度密度图网络（CDMNet）在全样本训练的精度，分别提高了0.9%、−0.1%、−2.4%和−2.2%；在CARPK_PUCPR数据集上所提方法在30样本时mAP可达58.0%，相比通用目标检测方法全卷积单阶段目标检测器（FCOS）、自适应训练样本选择（ATSS）、广义焦点损失V2（GFLV2）和交并比感知密集目标检测器（VFNET）在全样本训练的精度，分别提高了1.0%、0.8%、0.1%和0.3%，体现了所提方法强大的小样本泛化和迁移能力。
- 航拍图像 /
- 小样本 /
- 词语-区域对齐分支 /
- 语言视觉偏置网络 /
- 目标检测
Abstract:
This paper presents a few-shot object detection of aerial images based on language guidance vision in response to the problem of decreased detection accuracy in existing aerial image object detection brought on by a lack of training data and changes in the aerial image dataset, such as changes in shooting angle, image quality, lighting conditions, background environment, and significant changes in object appearance and the addition of new object categories. First, a word-region alignment branch replaces the classification branch in conventional object detection networks, and a word-region alignment classification score with both language and visual information is obtained as a prediction classification result. Then, object detection and phrase grounding are unified into a single task, leveraging language information to enhance visual detection accuracy. Furthermore, to address the challenge of accuracy fluctuations caused by changes in textual prompts during few-shot object detection, a language visual bias network is designed to mine the association between language features and visual features. This network aims to improve the alignment between language and vision, mitigate accuracy fluctuations, and further enhance the accuracy of few-shot detection. Extensive experimental results on UAVDT, Visdrone, AeriaDrone, VEDAI, and CARPK_PUCPR demonstrate the superior performance of the proposed method. Particularly, on the UAVDT dataset, the method achieves an impressive mAP of 14.6% at 30-shot. Compared with aerial image detection algorithms clustered detection (ClusDet), density map guided object detection network (DMNet), global-local self-adaptive network (GLSAN) and coarse-grained density map network (CDMNet), the detection accuracy is 0.9%, −0.1%, −2.4% and −2.2% higher during full data training. The mAP of the approach can achieve 58.0% on the PUCPR dataset at 30-shot, which is 1.0%, 0.8%, 0.1% and 0.3% higher than the accuracy of full data trained generic object detection methods fully convolutional one-stage object detector (FCOS), adaptive training sample selection (ATSS), generalized focal loss V2 (GFLV2) and VarifocalNet (VFNET), respectively. These results highlight the robust few-shot generalization and transfer capabilities of the proposed method.
- aerial images /
- few-shot /
- word-region aligned branch /
- language visual bias network /
- object detection

HTML全文

图 1 现有的航拍图像目标检测方法处理流程

Figure 1. Implementation process of existing aerial image object detection method

下载: 全尺寸图片幻灯片

图 2 本文方法的整体框架

Figure 2. Overall framework of the proposed method

下载: 全尺寸图片幻灯片

图 3 转化解码器

Figure 3. Transformation decoder

下载: 全尺寸图片幻灯片

图 4 无人机数据集展示

Figure 4. Display of drone datasets

下载: 全尺寸图片幻灯片

图 5 消融实验的可视化结果

Figure 5. Visualization results of ablation experiment

下载: 全尺寸图片幻灯片

图 6 本文方法30-shot的可视化结果

Figure 6. Visualization results of the proposed method at 30-shot

下载: 全尺寸图片幻灯片

表 1 无人机数据集类别分析

Table 1. Classification analysis of drone datasets

数据集	类别	样本数量
Visdrone^[1]	pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, motor	54200
AerialDrone^[5]	docks, boats, lifts, jetskis, cars	1600
UAVDT^[2]	car, truck, bus	841500
VEDAI^[4]	boat, camping car, pickup, plane, tractor, truck, vans	733
CARPK_PUCPR^[3]	car	89800

下载: 导出CSV

表 2 消融实验

Table 2. Ablation experiment

数据集	方法	mAP/%
数据集	方法	1-shot	3-shot	5-shot	10-shot	30-shot
AerialDrone^[5]	Baseline	0	17.5	23.8	30.5	33.1
	Baseline + 词语-区域对齐分支	19.4	28.9	33.0	37.4	42.9
	Baseline + all	19.8	32.7	33.7	39.0	43.1
VisDrone-DET^[1]	Baseline	0.4	1.1	2.1	1.3	1.6
	Baseline + 词语-区域对齐分支	15.9	16.6	18.3	18.7	21.8
	Baseline + all	16.1	17.2	18.4	18.8	21.8
CARPK_PUCPR^[3]	Baseline	0	1.4	11.1	15.3	30.3
	Baseline + 词语-区域对齐分支	46.2	52.0	55.5	56.0	57.8
	Baseline + all	46.9	55.0	56.4	54.2	58.0
VEDAI^[4]	Baseline	0	3.6	6.8	8.1	17.4
	Baseline + 词语-区域对齐分支	8.7	11.6	12.7	13.8	26.8
	Baseline + all	8.2	13.8	15.1	15.6	27.1

下载: 导出CSV

表 3 与经典航拍图像目标检测方法对比

Table 3. Comparison with aerial image object detection

数据集/方法	mAP/%
数据集/方法	ClusDet^[6]	DMNet^[7]	GLSAN^[8]	CDMNet^[9]	本文方法 (5/10/30-shot)
VisDrone-DET	26.7	28.2	30.7	29.2	18.4/18.8/21.8
UAVDT	13.7	14.7	17.0	16.8	10.1/13.6/14.6

下载: 导出CSV

表 4 与经典通用目标检测方法对比

Table 4. Comparison with classical general object detection methods

数据集/方法	mAP/%
数据集/方法	FCOS^[28]	ATSS^[29]	GFLV2^[30]	VFNET^[31]	本文方法 (5/10/30-shot)
VisDrone-DET	27.2	28.8	29.3	30.2	18.4/18.8/21.8
CARPK_PUCPR	57.0	57.2	57.9	57.7	56.4/54.2/58.0

下载: 导出CSV

参考文献(31)

[1]	DU D W, WEN L Y, ZHU P F, et al. VisDrone-DET2020: the vision meets drone object detection in image challenge results[C]//Proceedings of the Computer Vision – ECCV 2020 Workshops. Berlin: Springer, 2020: 692-712.
[2]	DU D W, QI Y K, YU H Y, et al. The unmanned aerial vehicle benchmark: object detection and tracking[M]//Computer Vision – ECCV 2018. Berlin: Springer, 2018: 375-391.
[3]	HSIEH M R, LIN Y L, HSU W H. Drone-based object counting by spatially regularized regional proposal network[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2017: 4165-4173.
[4]	RAZAKARIVONY S, JURIE F. Vehicle detection in aerial imagery: a small target detection benchmark[J]. Journal of Visual Communication and Image Representation, 2016, 34: 187-203. doi: 10.1016/j.jvcir.2015.11.002
[5]	LI L H, ZHANG P C, ZHANG H T, et al. Grounded language-image pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 10955-10965.
[6]	YANG F, FAN H, CHU P, et al. Clustered object detection in aerial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 8310-8319.
[7]	LI C L, YANG T, ZHU S J, et al. Density map guided object detection in aerial images[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE Press, 2020: 737-746.
[8]	DENG S T, LI S, XIE K, et al. A global-local self-adaptive network for drone-view object detection[J]. IEEE Transactions on Image Processing, 2020, 30: 1556-1569.
[9]	DUAN C Z, WEI Z W, ZHANG C, et al. Coarse-grained density map guided object detection in aerial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE Press, 2021: 2789-2798.
[10]	宋闯, 赵佳佳, 王康, 等. 面向智能感知的小样本学习研究综述[J]. 航空学报, 2020, 41(S1): 723756. SONG C, ZHAO J J, WANG K, et al. A review of small sample learning for intelligent perception[J]. Acta Aeronautica et Astronautica Sinica, 2020, 41(S1): 723756(in Chinese).
[11]	段立娟, 袁蓥, 王文健, 等. 基于多模态联合语义感知的零样本目标检测[J]. 北京亚洲成人在线一二三四五六区学报, 2024, 50(2): 368-375. DUAN L J, YUAN Y, WANG W J, et al. Zero-shot object detection based on multi-modal joint semantic perception[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(2): 368-375(in Chinese).
[12]	YAN X P, CHEN Z L, XU A N, et al. Meta R-CNN: towards general solver for instance-level low-shot learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 9576-9585.
[13]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031
[14]	FAN Q, ZHUO W, TANG C K, et al. Few-shot object detection with attention-RPN and multi-relation detector[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 4012-4021.
[15]	SUN B, LI B H, CAI S C, et al. FSCE: few-shot object detection via contrastive proposal encoding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 7348-7358.
[16]	HAN G X, MA J W, HUANG S Y, et al. Few-shot object detection with fully cross-transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 5311-5320.
[17]	DENG J Y, LI X, FANG Y. Few-shot object detection on remote sensing images[EB/OL]. (2020-06-14)[2023-02-01]. http://arxiv.org/abs/2006.07826v2.
[18]	CHENG G, YAN B W, SHI P Z, et al. Prototype-CNN for few-shot object detection in remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 5604610.
[19]	HUANG X, HE B K, TONG M, et al. Few-shot object detection on remote sensing images via shared attention module and balanced fine-tuning strategy[J]. Remote Sensing, 2021, 13(19): 3816. doi: 10.3390/rs13193816
[20]	李红光, 王玉峰, 杨丽春. 基于元学习的小样本遥感图像目标检测[J]. 北京亚洲成人在线一二三四五六区学报, 2024, 50(8): 2503-2513. LI H G, WANG Y F, YANG L C. Meta-learning-based few-shot object detection for remote sensing images[J]. Journal of Beijing University of Aeronautics and Astronautics, 2024, 50(8): 2503-2513(in Chinese).
[21]	ZHOU Y, HU H, ZHAO J Q, et al. Few-shot object detection via context-aware aggregation for remote sensing images[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 6509605.
[22]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. (2021-02-26)[2023-02-01]. http://arxiv.org/abs/2103.00020.
[23]	ZHONG Y W, YANG J W, ZHANG P C, et al. RegionCLIP: region-based language-image pretraining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 16772-16782.
[24]	ZHAO S Y, ZHANG Z X, SCHULTER S, et al. Exploiting unlabeled data with vision and language models for object detection[M]//Computer Vision – ECCV 2022. Berlin: Springer, 2022: 159-175.
[25]	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2019-05-24)[2023-02-01]. http://arxiv.org/abs/1810.04805v2.
[26]	DAI X Y, CHEN Y P, XIAO B, et al. Dynamic head: unifying object detection heads with attentions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 7369-7378.
[27]	ZHOU K Y, YANG J K, LOY C C, et al. Conditional prompt learning for vision-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 16795-16804.
[28]	TIAN Z, SHEN C H, CHEN H, et al. FCOS: a simple and strong anchor-free object detector[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(4): 1922-1933.
[29]	ZHANG S F, CHI C, YAO Y Q, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 9756-9765.
[30]	LI X, WANG W H, HU X L, et al. Generalized focal loss V2: learning reliable localization quality estimation for dense object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 11627-11636.
[31]	ZHANG H Y, WANG Y, DAYOUB F, et al. VarifocalNet: an IoU-aware dense object detector[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 8510-8519.