Robust semi-supervised video object segmentation based on dynamic embedding feature

CHEN Yadang; ZHAO Yibing; WU Enhua

doi:10.13700/j.bh.1001-5965.2023.0354

Volume 51 Issue 7

Jul. 2025

Turn off MathJax

Article Contents

Journal of Beijing University of Aeronautics and Astronautics > 2025 > 51(7): 2253-2261.

CHEN Y D，ZHAO Y B，WU E H. Robust semi-supervised video object segmentation based on dynamic embedding feature[J]. Journal of Beijing University of Aeronautics and Astronautics，2025，51（7）：2253-2261 （in Chinese） doi: 10.13700/j.bh.1001-5965.2023.0354

Citation:

PDF( 2618 KB)

Robust semi-supervised video object segmentation based on dynamic embedding feature

doi: 10.13700/j.bh.1001-5965.2023.0354

CHEN Yadang^{1, 2
,
,},
ZHAO Yibing^{1, 2},
WU Enhua³

1.
School of Computer Science，Nanjing University of Information Science and Technology，Nanjing 210044，China
2.
Wuxi Research Institute，Nanjing University of Information Science and Technology，Wuxi 214100，China
3.
Institute of Software，Chinese Academy of Sciences，Beijing 100190，China

Funds:

National Natural Science Foundation of China (62473201,62477026); Preliminary Research Project on Leading Technologies by Wuxi Industrial Innovation Research Institute

More Information

Corresponding author: E-mail：adamchen@nuist.edu.cn
Received Date: 13 Jun 2023
Accepted Date: 01 Dec 2023

Available Online: 12 Jan 2024

Publish Date: 10 Jan 2024

Abstract

Abstract

A semi-supervised video object segmentation (VOS) method was proposed to address the issues of increasing memory consumption during inference and the difficulty of training relying solely on low-level pixel features. The method is based on dynamic embedding features and an auxiliary loss function. First, a dynamic embedding feature was employed to establish a constant-sized memory bank. Through spatiotemporal aggregation, historical information was utilized to generate and update dynamic embedding features. Simultaneously, a memory update sensor was employed to adaptively control the update interval of the memory bank, accommodating different motion patterns in various videos. Second, an auxiliary loss function was utilized to provide the network with guidance at the high semantic feature level, enhancing model accuracy and training efficiency by offering diverse guidance across multiple feature levels. Finally, to address the issue of misalignment between similar objects in the foreground and background of videos, a spatial constraint module was designed, which leveraged the temporal continuity of videos to better capture the correlation between the mask from the previous frame and the current frame. Experimental results demonstrate that the proposed method achieves an accuracy of 84.5% J&F on the DAVIS 2017 validation set and 82.4% J&F on the YouTube-VOS 2019 validation set.
- video object segmentation,
- spatiotemporal memory network,
- spatiotemporal constraint,
- memory update sensing,
- dynamic embedding feature

FullText(HTML)

References(29)

References

[1]	PERAZZI F, KHOREVA A, BENENSON R, et al. Learning video object segmentation from static images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 3491-3500.
[2]	VOIGTLAENDER P, LEIBE B. Online adaptation of convolutional neural networks for video object segmentation[EB/OL]. (2017-08-01)[2023-06-01]. http://arxiv.org/abs/1706.09364v2.
[3]	CAELLES S, MANINIS K K, PONT-TUSET J, et al. One-shot video object segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 5320-5329.
[4]	YANG Z X, WEI Y C, YANG Y. Collaborative video object segmentation by foreground-background integration[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 332-348.
[5]	YANG Z X, WEI Y C, YANG Y. Collaborative video object segmentation by multi-scale foreground-background integration[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 4701-4712.
[6]	ZHANG P, HU L, ZHANG B, et al. Spatial consistent memory network for semi-supervised video object segmentation[C]//Proceedings of the DAVIS Challenge on Video Object Segmentation. Piscataway: IEEE Press, 2020: 1-4.
[7]	SEONG H, HYUN J, KIM E. Kernelized memory network for video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2020: 629-645.
[8]	SEONG H, OH S W, LEE J Y, et al. Hierarchical memory matching network for video object segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 12869-12878.
[9]	LI M X, HU L, XIONG Z W, et al. Recurrent dynamic embedding for video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2022: 1322-1331.
[10]	OH S W, LEE J Y, XU N, et al. Video object segmentation using space-time memory networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 9226-9235.
[11]	CHENG H K, TAI Y W, TANG C K. Rethinking space-time networks with improved memory coverage for efficient video object segmentation[EB/OL]. (2021-10-08)[2023-06-01]. http://arxiv.org/abs/2106.05210?context=cs.CV.
[12]	CHENG H K, TAI Y W, TANG C K. Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 5555-5564.
[13]	XIE H Z, YAO H X, ZHOU S C, et al. Efficient regional memory network for video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 1286-1295.
[14]	季传俊, 陈亚当, 车洵. 融合视觉词与自注意力机制的视频目标分割[J]. 中国图象图形学报, 2022, 27(8): 2444-2457. JI C J, CHEN Y D, CHE X. Visual words and self-attention mechanism fusion based video object segmentation method[J]. Journal of Image and Graphics, 2022, 27(8): 2444-2457(in Chinese).
[15]	征煜, 陈亚当, 郝川艳. 特征一致性约束的视频目标分割[J]. 中国图象图形学报, 2020, 25(8): 1558-1566. doi: 10.11834/jig.190571 ZHENG Y, CHEN Y D, HAO C Y. Video object segmentation algorithm based on consistent features[J]. Journal of Image and Graphics, 2020, 25(8): 1558-1566(in Chinese). doi: 10.11834/jig.190571
[16]	LI Y, SHEN Z R, SHAN Y. Fast video object segmentation using the global context module[C]//Proceedings of the European Conferenceon Computer Vision. Berlin: Springer, 2020: 735-750.
[17]	WANG H C, JIANG X L, REN H B, et al. SwiftNet: real-time video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 1296-1305.
[18]	LIANG Y Q, LI X, JAFARI N, et al. Video object segmentation with adaptive feature bank and uncertain-region refinement[EB/OL]. (2020-10-15)[2023-06-01]. http://arxiv.org/abs/2010.07958v1.
[19]	CHEN Y D, HAO C Y, YANG Z X, et al. Fast target-aware learning for few-shot video object segmentation[J]. Science China Information Sciences, 2022, 65(8): 182104. doi: 10.1007/s11432-021-3396-7
[20]	CHO S, LEE H, LEE M, et al. Tackling background distraction in video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2022: 446-462.
[21]	LUITEN J, VOIGTLAENDER P, LEIBE B. PReMVOS: proposal-generation, refinement and merging for video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2019: 565-580.
[22]	DUKE B, AHMED A, WOLF C, et al. SSTVOS: sparse spatiotemporal Transformers for video object segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 5908-5917.
[23]	GE W B, LU X K, SHEN J B. Video object segmentation using global and instance embedding learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2021: 16831-16840.
[24]	LIU Y, YU R, YIN F, et al. Learning quality-aware dynamic memory for video object segmentation[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2022: 468-486.
[25]	CHENG H K, SCHWING A G. XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2022: 640-658.
[26]	LAN M, ZHANG J, HE F X, et al. Siamese network with interactive Transformer for video object segmentation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(2): 1228-1236. doi: 10.1609/aaai.v36i2.20009
[27]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. doi: 10.1109/TPAMI.2017.2699184
[28]	陈亚当, 陈柳任, 余文斌, 等. 多尺度特征融合的知识蒸馏异常检测方法[J]. 计算机辅助设计与图形学学报, 2022, 34(10): 1542-1549. CHEN Y D, CHEN L R, YU W B, et al. Knowledge distillation anomaly detection with multi-scale feature fusion[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(10): 1542-1549(in Chinese).
[29]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 3-19.

Relative Articles

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(7) / Tables(6)

Get Citation

PDF

XML

Article Metrics

Article views(364) PDF downloads(10)

Robust semi-supervised video object segmentation based on dynamic embedding feature

doi: 10.13700/j.bh.1001-5965.2023.0354

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Robust semi-supervised video object segmentation based on dynamic embedding feature

doi: 10.13700/j.bh.1001-5965.2023.0354

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Export File

Citation

Format

Content