基于动态嵌入特征的鲁棒半监督视频目标分割

陈亚当; 赵翊冰; 吴恩华

doi:10.13700/j.bh.1001-5965.2023.0354

基于动态嵌入特征的鲁棒半监督视频目标分割

doi: 10.13700/j.bh.1001-5965.2023.0354

陈亚当^{1, 2, ,},
赵翊冰^{1, 2},
吴恩华³

1.
南京信息工程大学计算机学院，南京 210044
2.
南京信息工程大学无锡研究院，无锡 214100
3.
中国科学院软件研究所，北京 100190

基金项目:

国家自然科学基金(62473201,62477026)；无锡市产业创新研究院先导技术预研项目

详细信息

通讯作者:
E-mail：adamchen@nuist.edu.cn

中图分类号: TP391
计量
- 文章访问数: 360
- HTML全文浏览量: 72
- PDF下载量: 10
- 被引次数: 0
出版历程
- 收稿日期: 2023-06-13
- 录用日期: 2023-12-01
- 网络出版日期: 2024-01-10
- 整期出版日期: 2025-07-14

Robust semi-supervised video object segmentation based on dynamic embedding feature

CHEN Yadang^{1, 2
, ,},
ZHAO Yibing^{1, 2},
WU Enhua³

1.
School of Computer Science，Nanjing University of Information Science and Technology，Nanjing 210044，China
2.
Wuxi Research Institute，Nanjing University of Information Science and Technology，Wuxi 214100，China
3.
Institute of Software，Chinese Academy of Sciences，Beijing 100190，China

Funds:

National Natural Science Foundation of China (62473201,62477026); Preliminary Research Project on Leading Technologies by Wuxi Industrial Innovation Research Institute

More Information

Corresponding author: E-mail：adamchen@nuist.edu.cn

摘要

摘要:
针对半监督视频目标分割(VOS)方法存在推理时内存占用不断增加及仅依赖低级像素特征训练困难的问题，提出一种基于动态嵌入特征和辅助损失函数的半监督视频目标分割方法。使用动态嵌入特征建立恒定大小的记忆库；通过时空聚合方法，利用历史信息生成和更新动态嵌入特征；使用内存更新感应器来自适应控制记忆库的更新间隔，适应不同视频的运动模式；使用辅助损失函数，在高级语义特征层面上给网络提供辅助指导，并通过在多重特征层面多方面指导，提高模型精度和训练效率；针对视频前背景中相似目标误匹配的问题，设计一种时空约束模块，以利用视频的时间连续性特性更好地捕获前一帧掩码信息与当前帧之间的关联。实验结果表明：所提方法在DAVIS 2017验证集上达到84.5% J&F的精度，在YouTube-VOS 2019验证集达到82.4% J&F的精度。
- 视频目标分割 /
- 时空记忆网络 /
- 时空约束 /
- 内存更新感应 /
- 动态嵌入特征
Abstract:
A semi-supervised video object segmentation (VOS) method was proposed to address the issues of increasing memory consumption during inference and the difficulty of training relying solely on low-level pixel features. The method is based on dynamic embedding features and an auxiliary loss function. First, a dynamic embedding feature was employed to establish a constant-sized memory bank. Through spatiotemporal aggregation, historical information was utilized to generate and update dynamic embedding features. Simultaneously, a memory update sensor was employed to adaptively control the update interval of the memory bank, accommodating different motion patterns in various videos. Second, an auxiliary loss function was utilized to provide the network with guidance at the high semantic feature level, enhancing model accuracy and training efficiency by offering diverse guidance across multiple feature levels. Finally, to address the issue of misalignment between similar objects in the foreground and background of videos, a spatial constraint module was designed, which leveraged the temporal continuity of videos to better capture the correlation between the mask from the previous frame and the current frame. Experimental results demonstrate that the proposed method achieves an accuracy of 84.5% J&F on the DAVIS 2017 validation set and 82.4% J&F on the YouTube-VOS 2019 validation set.
- video object segmentation /
- spatiotemporal memory network /
- spatiotemporal constraint /
- memory update sensing /
- dynamic embedding feature

HTML全文

图 1 本文方法主体框架

Figure 1. Main framework of the proposed method

下载: 全尺寸图片幻灯片

图 2 时空模块的内部构造

Figure 2. Internal structure of spatiotemporal constraint module

下载: 全尺寸图片幻灯片

图 3 时空约束模块注意力可视化结果

Figure 3. Results of attention visualization of spatiotemporal constraint module

下载: 全尺寸图片幻灯片

图 4 本文方法的局限性

Figure 4. Limitations of the proposed method

下载: 全尺寸图片幻灯片

图 5 DAVIS验证集和YouTube-VOS 2019验证集上与其他方法的对比

Figure 5. Comparison with other methods on DAVIS and YouTube-VOS validation datasets

下载: 全尺寸图片幻灯片

图 6 DAVIS验证集和YouTube-VOS 2019验证集上更多的定性结果

Figure 6. More qualitative results on DAVIS and YouTube-VOS validation datasets

下载: 全尺寸图片幻灯片

图 7 STCN^[11]和本文方法在速度上的对比

Figure 7. Speed comparison of STCN^[11] and the proposed method

下载: 全尺寸图片幻灯片

表 1 DAVIS 2017验证集上的实验结果

Table 1. Experimental results on DAVIS 2017 validation set %

方法	OL	CC	J&F	J	F
MiVOS^[12]		×	84.5	81.7	87.4
QDMN^[24]		×	84.6	81.8	87.3
HMMN^[8]		×	84.7	81.9	87.5
STCN^[11]		×	85.3	82.0	88.6
XMem^[25]		×	86.2	82.9	89.5
OnAVOS^[2]	√	√	65.3	61.6	69.1
OSVOS^[3]	√	√	60.2	56.6	63.9
GCNet^[16]		√	71.4	69.3	73.5
文献[18]		√	74.6	73.0	76.1
PReMVOS^[21]	√	√	77.8	73.9	81.7
SwiftNet^[17]		√	81.1	78.3	83.9
SSTVOS^[22]		√	82.5	79.9	85.1
文献[23]		√	82.7	80.2	85.3
SITVOS^[26]		√	83.5	80.4	86.5
RDE^[9]		√	84.2	80.8	87.5
本文		√	84.5	81.2	87.7
注：OL表示基于在线微调的方法，“√”表示使用在线微调的方法。CC表示该方法占用恒定内存，“√”表示内存占用保持不变，“×”表示内存占用会逐渐增加。

下载: 导出CSV

表 2 DAVIS 2017测试集上的实验结果

Table 2. Experimental results on DAVIS 2017 test set %

方法	600p	CC	J&F	J	F
KMN^[7]	√	×	77.2	74.1	80.3
STCN^[11]	×	×	76.1	72.7	79.6
MiVOS^[12]	×	×	76.0	72.6	79.3
QDMN^[24]	×	×	77.7	74.2	81.2
XMem^[25]	×	×	81.0	77.4	84.5
CFBI^[4]	×	√	74.8	71.1	78.5
文献[23]	×	√	75.2	72,0	78.3
CFBI+^[5]	×	√	75.6	71.6	79.6
RDE^[9]	×	√	77.4	73.6	81.2
本文	×	√	78.0	74.3	81.6
注： 600p表示使用600分辨率视频进行推理，“√”表示使用，“×”表示不使用。CC表示该方法占用恒定内存，其中“√”表示内存占用保持不变，“×”表示内存占用会逐渐增加。

下载: 导出CSV

表 3 YouTube-VOS 2019验证集上的实验结果

Table 3. Experimental results on YouTube-VOS 2019 validation set %

方法	CC	J_seen/unseen&F_seen/unseen	$ {F_{{\mathrm{unseen}}}} $	$ {F_{{\mathrm{seen}}}} $	$ {J_{{\mathrm{unseen}}}} $	$ {J_{{\mathrm{seen}}}} $
STM^[10]	×	79.2	79.6	83.6	73.0	80.6
MiVOS^[12]	×	80.3	79.3	83.7	75.3	82.8
STCN^[11]	×	82.7	81.1	85.4	78.2	85.9
XMem^[25]	×	85.5	84.3	88.6	80.3	88.6
CFBI+^[5]	√	81.0	80.6	85.1	75.2	83.0
SSTVOS^[22]	√	81.8	80.9		76.6
RDE^[9]	√	81.9	81.1	85.5	76.2	84.8
本文	√	82.4	81.3	85.4	77.2	85.7
注： CC表示该方法占用恒定内存，“√”表示内存占用保持不变，“×”表示内存占用会逐渐增加。

下载: 导出CSV

表 4 DAVIS 2016验证集上的实验结果

Table 4. Experimental results on DAVIS 2016 validation set %

方法	CC	J&F	J	F
STM^[10]	×	89.3	88.7	89.9
KMN^[7]	×	90.5	89.5	91.5
HMMN^[8]	×	90.8	89.6	92.0
MiVOS^[12]	×	91.0	89.7	92.4
QDMN^[24]	×	91.0	90.2	91.7
STCN^[11]	×	91.7	90.4	93.0
XMem^[25]	×	91.5	90.4	92.7
GCNet^[16]	√	86.6	87.6	85.7
CFBI+^[5]	√	89.9	88.7	91.1
SwiftNet^[17]	√	90.4	90.7	90.3
SITVOS^[26]	√	90.5	89.5	91.4
RDE^[9]	√	91.1	89.7	92.5
本文	√	91.2	89.8	92.5
注：CC表示该方法占用恒定内存，“√”表示内存占用保持不变，“×”表示内存占用会逐渐增加。

下载: 导出CSV

表 5 时空约束模块和高层语义辅助损失消融实验结果

Table 5. Ablation experiment results on spatial constraint modules and high-level semantic auxiliary losses %

消融项	J&F	J	F
不使用时空约束	84.0	80.6	87.3
不使用辅助损失	83.3	80.2	86.4
同时使用时空约束和辅助损失	84.5	81.2	87.7

下载: 导出CSV

表 6 内存更新感应器消融实验结果

Table 6. Ablation experiment results on memory update sensor %

内存更新方式	J&F	J_seen/unseen&F_seen/unseen
固定间隔(3帧)	84.5	81.9
固定间隔(4帧)	84.2	82.4
使用内存更新感应	84.5	82.4
注：J_seen/unseen&F_seen/unseen表示在YouTube-VOS 2019验证集上的结果。

下载: 导出CSV

参考文献(29)

[1]	PERAZZI F, KHOREVA A, BENENSON R, et al. Learning video object segmentation from static images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 3491-3500.
[2]	VOIGTLAENDER P, LEIBE B. Online adaptation of convolutional neural networks for video object segmentation[EB/OL]. (2017-08-01)[2023-06-01]. http://arxiv.org/abs/1706.09364v2.
[3]	CAELLES S, MANINIS K K, PONT-TUSET J, et al. One-shot video object segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 5320-5329.
[4]	YANG Z X, WEI Y C, YANG Y. Collaborative video object segmentation by foreground-background integration[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 332-348.
[5]	YANG Z X, WEI Y C, YANG Y. Collaborative video object segmentation by multi-scale foreground-background integration[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 4701-4712.
[6]	ZHANG P, HU L, ZHANG B, et al. Spatial consistent memory network for semi-supervised video object segmentation[C]//Proceedings of the DAVIS Challenge on Video Object Segmentation. Piscataway: IEEE Press, 2020: 1-4.
[7]	SEONG H, HYUN J, KIM E. Kernelized memory network for video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 629-645.
[8]	SEONG H, OH S W, LEE J Y, et al. Hierarchical memory matching network for video object segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 12869-12878.
[9]	LI M X, HU L, XIONG Z W, et al. Recurrent dynamic embedding for video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 1322-1331.
[10]	OH S W, LEE J Y, XU N, et al. Video object segmentation using space-time memory networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 9226-9235.
[11]	CHENG H K, TAI Y W, TANG C K. Rethinking space-time networks with improved memory coverage for efficient video object segmentation[EB/OL]. (2021-10-08)[2023-06-01]. http://arxiv.org/abs/2106.05210?context=cs.CV.
[12]	CHENG H K, TAI Y W, TANG C K. Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 5555-5564.
[13]	XIE H Z, YAO H X, ZHOU S C, et al. Efficient regional memory network for video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 1286-1295.
[14]	季传俊, 陈亚当, 车洵. 融合视觉词与自注意力机制的视频目标分割[J]. 中国图象图形学报, 2022, 27(8): 2444-2457. JI C J, CHEN Y D, CHE X. Visual words and self-attention mechanism fusion based video object segmentation method[J]. Journal of Image and Graphics, 2022, 27(8): 2444-2457(in Chinese).
[15]	征煜, 陈亚当, 郝川艳. 特征一致性约束的视频目标分割[J]. 中国图象图形学报, 2020, 25(8): 1558-1566. doi: 10.11834/jig.190571 ZHENG Y, CHEN Y D, HAO C Y. Video object segmentation algorithm based on consistent features[J]. Journal of Image and Graphics, 2020, 25(8): 1558-1566(in Chinese). doi: 10.11834/jig.190571
[16]	LI Y, SHEN Z R, SHAN Y. Fast video object segmentation using the global context module[C]//Proceedings of the European Conferenceon Computer Vision. Berlin: Springer, 2020: 735-750.
[17]	WANG H C, JIANG X L, REN H B, et al. SwiftNet: real-time video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 1296-1305.
[18]	LIANG Y Q, LI X, JAFARI N, et al. Video object segmentation with adaptive feature bank and uncertain-region refinement[EB/OL]. (2020-10-15)[2023-06-01]. http://arxiv.org/abs/2010.07958v1.
[19]	CHEN Y D, HAO C Y, YANG Z X, et al. Fast target-aware learning for few-shot video object segmentation[J]. Science China Information Sciences, 2022, 65(8): 182104. doi: 10.1007/s11432-021-3396-7
[20]	CHO S, LEE H, LEE M, et al. Tackling background distraction in video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2022: 446-462.
[21]	LUITEN J, VOIGTLAENDER P, LEIBE B. PReMVOS: proposal-generation, refinement and merging for video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2019: 565-580.
[22]	DUKE B, AHMED A, WOLF C, et al. SSTVOS: sparse spatiotemporal Transformers for video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 5908-5917.
[23]	GE W B, LU X K, SHEN J B. Video object segmentation using global and instance embedding learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 16831-16840.
[24]	LIU Y, YU R, YIN F, et al. Learning quality-aware dynamic memory for video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2022: 468-486.
[25]	CHENG H K, SCHWING A G. XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2022: 640-658.
[26]	LAN M, ZHANG J, HE F X, et al. Siamese network with interactive Transformer for video object segmentation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(2): 1228-1236. doi: 10.1609/aaai.v36i2.20009
[27]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184
[28]	陈亚当, 陈柳任, 余文斌, 等. 多尺度特征融合的知识蒸馏异常检测方法[J]. 计算机辅助设计与图形学学报, 2022, 34(10): 1542-1549. CHEN Y D, CHEN L R, YU W B, et al. Knowledge distillation anomaly detection with multi-scale feature fusion[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(10): 1542-1549(in Chinese).
[29]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 3-19.