2025 Vol. 51, No. 7

Display Method:
Volume 7 Issue E-journal
Volume 51 Issue72025
iconDownload (96974) 438 iconPreview
Image super-resolution reconstruction network based on multi-scale spatial attention guidance
CHENG Deqiang, WANG Peijie, DONG Yanqiang, KOU Qiqi, JIANG He
2025, 51(7): 2185-2195. doi: 10.13700/j.bh.1001-5965.2023.0547
Abstract:

Aiming at the problem that the attention-mechanism-based image super-resolution reconstruction network ignores the heterogeneity of attentional features and treats features at different levels uniformly by directly incorporating the attention mechanism into the network model. This study designs a novel multi-scale spatial attention guidance network, namely SAGN, which makes the following key contributions. Firstly, an enhanced feature extraction residual block (ERB) is proposed to enhance the representation capacity of local information. Secondly, to record spatial attention features at various scales, a multi-scale spatial attention (MSA) module is incorporated. Lastly, an attention-guided module (AGM) is introduced to assign individualized weights to different features, facilitating effective fusion of contextual global features and suppression of redundant information. On four benchmark datasets, extensive experimental results show that SAGN outperforms standard attention structures in terms of both subjective visual perception and objective evaluation criteria. Notably, SAGN achieves an average 0.05 dB higher than that of the suboptimal model in peak signal-to-noise ratio (PSNR) for 4 times reconstruction results, further underscoring its efficacy in recovering image geometric structures and fine details.

Fuzzy logic and adaptive strategy for infrared and visible light image fusion
YANG Yong, LIU Jiaxiang, HUANG Shuying, WANG Xiaozheng, XIA Yukun
2025, 51(7): 2196-2208. doi: 10.13700/j.bh.1001-5965.2023.0383
Abstract:

Due to different imaging mechanisms, infrared imaging can capture target information under special conditions where the targetis obstructed, while visible light imaging can capture the texture details of the scenarios. Therefore, to obtain a fusion image containing both target information and texture details, infrared imaging and visible light imaging are generally combined to facilitate visual perception and machine recognition. Based on fuzzy logic theory, an infrared and visible light image fusion method was proposed,combining multistage fuzzy discrimination and adaptive parameter fusion strategy (MFD-APFS). First, the infrared and visible light images were decomposed into structural patches, obtaining the contrast-detail image set reconstructed by the signal intensity component. Second, the source image stand contrast-detail image set were processed through a designed fuzzy discrimination system, generating saliency maps for each set. A second-stage fuzzy discrimination was then applied to produce a unified saliency map. Finally, the guided filtering technique was used, with the saliency map guiding the source image to obtain multiple decision graphs. The final fusion image was obtained through the adaptive parameter fusion strategy. The proposed MFD-APFS method was experimentally evaluated on publicly available infrared and visible light datasets. Compared to the seven mainstream fusion methods, the proposed method shows improvements in objective metrics. On the TNO dataset, SSIM-F and QAB/F were improved by 0.169 and 0.1403, respectively, and on the RoadScenes dataset, they were improved by 0.1753 and 0.0537, respectively. Furthermore, the subjective visual analysis indicates that the proposed method is capable of generating fusion images with clear targets and enriched details while retaining infrared target information and visible texture information.

Spatial information-enhanced indoor multi-task RGB-D scene understanding
SUN Guodong, XIONG Chenyun, LIU Junjie, ZHANG Yang
2025, 51(7): 2209-2217. doi: 10.13700/j.bh.1001-5965.2023.0391
Abstract:

To explore 3D space, mobile robots need to obtain a large amount of scene information, which includes semantic, instance objects, and positional relationships. The accuracy and computational complexity of scene analysis are the two focuses of mobile terminals. Therefore, a spatial information-enhanced multi-task learning method for indoor scene understanding was proposed. This method consists of an encoder with a channel-spatial attention fusion module and a decoder with multi-task heads for semantic segmentation, panoptic segmentation (instance), and orientation estimation. The channel-spatial attention fusion module aims to enhance the modal characteristics of RGB and depth, and the spatial attention mechanism, composed of simple convolutions, can reduce the convergence speed. After fusing with the channel attention mechanism, it further strengthens the position features of global information. The context module of the semantic branch is located after the decoder, providing strong support for pixel-level semantic classification and helping to reduce the model size. A loss function based on hard parameter sharing was designed, enabling balanced training tasks. The influence of an appropriate lightweight backbone network and the number of tasks on improving the performance of scene understanding was discussed. Finally, on the NYUv2 and SUN RGB-D indoor datasets with newly added label annotations, the effectiveness of the proposed multi-task learning method was evaluated. Results show that the comprehensive panoramic segmentation accuracy is improved by 2.93% and 4.87%, respectively.

Coordinate-aware attention-based multi-frame self-supervised monocular depth estimation
CHENG Deqiang, FAN Shuming, QIAN Jiansheng, JIANG He, KOU Qiqi
2025, 51(7): 2218-2228. doi: 10.13700/j.bh.1001-5965.2023.0417
Abstract:

A novel multi-frame self-supervised single-image depth estimation technique based on coordinate-aware attention has been presented to tackle the issue of hazy depth prediction near object edges in single-image depth estimation methods. Firstly, a coordinate-aware attention module is proposed to enhance the output features of the bottom layer of the encoder and improve the feature utilization of the cost volume. To improve the object edges in depth prediction results, a new pixel-shuffle-based depth prediction decoder is also suggested. This decoder can efficiently separate the multi-object fusion features in low-resolution encoder features. Experimental results on the KITTI and Cityscapes datasets demonstrate that the proposed method is superior to current mainstream methods, significantly improving subjective visual effects and objective evaluation indicators, especially with better depth prediction performance in object edge details.

Railway panoramic segmentation based on recursive gating enhancement and pyramid prediction
CHEN Yong, ZHOU Fangchun, ZHANG Jiaojiao
2025, 51(7): 2229-2239. doi: 10.13700/j.bh.1001-5965.2023.0492
Abstract:

A recursive gated enhancement and pyramid prediction railway panoramic segmentation network is proposed to address the issues of insufficient target feature extraction and blurred edge contour segmentation in high-speed railway scene panoramic segmentation. On the basis of the DETR panoramic segmentation model, firstly, an improved multi-scale cascaded CSP-DarkNet53 feature network is constructed to enhance the ability to extract target features from railway scenes of different scales. Then, to improve the ability to extract and segment edge contour information and acquire richer edge feature information, a recursive gating and class feature augmentation module is proposed. Then, deformable attention is introduced into the coding backbone network to further capture context information and reduce the loss of segmentation details. Finally, by improving the pyramid prediction and pixel category segmentation module, the segmentation output of the railway panoramic view is achieved. The suggested approach enhances the original DETR model’s panorama segmentation quality index PQ by 7.4%, the foreground instance target’s panoramic quality index PQTh by 9.7%, and the background-filled area’s panoramic quality index PQSt by 6.6%, according to experimental data. The proposed method has good performance in panoramic image segmentation in railway scenes, and its subjective evaluation is superior to the comparison method.

Cascaded object drift determination network for long-term visual tracking
HOU Zhiqiang, ZHAO Jiaxin, CHEN Yu, MA Sugang, YU Wangsheng, FAN Jiulun
2025, 51(7): 2240-2252. doi: 10.13700/j.bh.1001-5965.2023.0504
Abstract:

Aiming at the problems of artificially selecting the threshold and poor determination performance in the existing object drift determination criteria, this paper proposes a cascaded object drift determination network with adaptive threshold selection. Firstly, through the cascade design of the two sub-networks, determine whether the tracking results drift. The results are then jointly determined by the proposed network using the static template, long-term template, and short-term template. A long-term and short-term template update strategy is then designed to guarantee the quality of the template and adapt it to the object’s changing appearance during the determination process. Finally, the proposed network is combined with the short-term tracker TransT and the global re-detection method GlobalTrack to build a long-term tracking algorithm TransT_LT. The proposed algorithm’s performance test on four datasets (UAV20L, LaSOT, VOT2018-LT, and VOT2020-LT) demonstrates that it performs better over the long term in tracking, particularly on the UAV20L dataset, where it outperforms the benchmark algorithm by 7.7% and 10.3%, respectively, in tracking success rate and accuracy. The determination speed of the proposed network is 100 frames per second, which has little effect on the speed of the long-term tracking algorithm.

Robust semi-supervised video object segmentation based on dynamic embedding feature
CHEN Yadang, ZHAO Yibing, WU Enhua
2025, 51(7): 2253-2261. doi: 10.13700/j.bh.1001-5965.2023.0354
Abstract:

A semi-supervised video object segmentation (VOS) method was proposed to address the issues of increasing memory consumption during inference and the difficulty of training relying solely on low-level pixel features. The method is based on dynamic embedding features and an auxiliary loss function. First, a dynamic embedding feature was employed to establish a constant-sized memory bank. Through spatiotemporal aggregation, historical information was utilized to generate and update dynamic embedding features. Simultaneously, a memory update sensor was employed to adaptively control the update interval of the memory bank, accommodating different motion patterns in various videos. Second, an auxiliary loss function was utilized to provide the network with guidance at the high semantic feature level, enhancing model accuracy and training efficiency by offering diverse guidance across multiple feature levels. Finally, to address the issue of misalignment between similar objects in the foreground and background of videos, a spatial constraint module was designed, which leveraged the temporal continuity of videos to better capture the correlation between the mask from the previous frame and the current frame. Experimental results demonstrate that the proposed method achieves an accuracy of 84.5% J&F on the DAVIS 2017 validation set and 82.4% J&F on the YouTube-VOS 2019 validation set.

Analysis of image and text sentiment method based on joint and interactive attention
HU Huijun, DING Ziyi, ZHANG Yaofeng, LIU Maofu
2025, 51(7): 2262-2270. doi: 10.13700/j.bh.1001-5965.2023.0365
Abstract:

The image and text sentiment in social media is an important factor affecting public opinion and is receiving increasing attention in the field of natural language processing (NLP). Currently, the analysis of image and text sentiment in social media has mainly focused on single image and text pairs, while little attention has been given to image and text pairs of atlas that are non-chronological and diverse. To explore the sentiment consistency between images and texts in the atlas, a method for analyzing image and text sentiment in social media based on joint and interactive attention (SA-JIA) was proposed. The method used RoBERTa and bidirectional gated recurrent unit (Bi-GRU) to extract textual expression features and ResNet50 to obtain image visual features. Joint attention was employed to identify salient regions where image and text sentiment align, obtaining new textual and image visual features. Interactive attention was utilized to focus on inter-modal feature interactions and multimodal feature fusion, finally obtaining the sentiment categories. Experimental validation was conducted on the IsTS-CN dataset and the CCIR20-YQ dataset, showing that the proposed method can enhance the performance of analyzing image and text sentiment in social media.

Semantic information-guided multi-label image classification
HUANG Jun, FAN Haodong, HONG Xudong, LI Xue
2025, 51(7): 2271-2281. doi: 10.13700/j.bh.1001-5965.2023.0382
Abstract:

Multi-label image classification aims to predict a set of labels for a given input image. Existing studies based on semantic information either use the correlation between semantic and visual space to guide the feature extraction process to generate effective feature representations or use the correlation between semantic and label spaces to learn weighted classifiers that capture label correlation. Most of these works use semantic information as auxiliary information for exploiting the visual space or label space, and few studies have exploited semantic, visual, and label space correlations simultaneously. However, these approaches fail to model the correlations across semantic, visual, and label spaces simultaneously. To solve this problem, a semantic information-guided multi-label image classification (SIG-MLIC) method was proposed. SIG-MLIC could simultaneously utilize semantic, visual, and label spaces, generating semantically specific feature representations via the association of image regions with labels reinforced by a semantic-guided attention (SGA) mechanism. Besides, the semantic information of labels was used to generate a semantic dictionary with label relevance constraints to reconstruct visual features, obtaining normalized representation coefficients as the probability of label occurrence. Experimental results on three standard multi-label image classification datasets show that both the attention mechanism and dictionary learning in SIG-MLIC can effectively improve classification performance, and the effectiveness of the proposed method has been verified.

Multi-object tracking algorithm based on dual-branch feature enhancement and multi-level trajectory association
MA Sugang, DUAN Shuaipeng, HOU Zhiqiang, YU Wangsheng, PU Lei, YANG Xiaobao
2025, 51(7): 2282-2289. doi: 10.13700/j.bh.1001-5965.2023.0472
Abstract:

Insufficient target feature extraction and target occlusion situations frequently occur in single-stage multipal object tracking(MOT) algorithms, resulting in a large number of identity switches and degraded tracking performance. A multi-target tracking algorithm based on dual-branch feature enhancement and multi-level trajectory association(MTA) is proposed to solve this problem. In order to alleviate the excessive competition between detection and tracking branches and fully extract the target improvement features for detection and tracking, respectively, a dual-branch feature learning network is utilized to learn the specificity and relevance of both detection and tracking tasks. An association matrix(AM) is introduced to predict a more accurate offset vector for data association by learning the similarity relationship between two frames. In order to recover lost trajectories and achieve long-term target association, a hierarchical trajectory association strategy is used to divide detection frames into high-score frames and low-score frames, and different matching methods are used to associate with the trajectories. On the typical multi-target tracking datasets MOT17 and MOT20, six related algorithms, including CenterTrack and QuasiDense, are tested. The algorithms’ multipal object tracking accuracy(MOTA)and identity F1 score (IDF1) values on MOT17 and MOT20 are 68.2% and 68.5%, respectively, 2.1% and 4.3% higher than those of the benchmark algorithm CenterTrack. On MOT20, the MOTA and IDF1 values are 52.7% and 48.2%, respectively, 1.4% and 7.9% higher. The algorithm solves the identity switching problem better and achieves excellent tracking performance in complex scenarios.

Fast multi-slice MRI reconstruction algorithm based on transform learning
DUAN Jizhong, LIU Huan
2025, 51(7): 2290-2303. doi: 10.13700/j.bh.1001-5965.2023.0561
Abstract:

Due to the significant correlation between neighboring slices in two-dimensional (2D) multi-slice magnetic resonance data, higher quality slice pictures can be reconstructed by taking use of the redundancy between slices. However, 2D multi-slice magnetic resonance imaging requires an amount of time. To improve the reconstruction quality and speed of 2D multi-slice (MRI) images, proposes a fast 2D multi-slice MRI reconstruction (FMS-JTLHTC) algorithm, which introduces the joint transform learning regular term into the multi-slice hankel tensor completion (MS-HTC) model. Prior to introducing the fast iterative shrinkage-thresholding procedure to accelerate convergence and utilize the graphics processing unit to speed up the procedure, the alternating direction method of multipliers is used to solve the objective issue. Experiments using four brain datasets in two different sampling modes show that the peak signal-to-noise ratio (PSNR) of the FMS-JTLHTC algorithm is improved by an average of 4.04 dB, 3.67 dB, and 2.07 dB compared to the simultaneous atuo-calibrating and K-space estimation (SAKE), low-rank modeling of local K-space neighborhoods with parallel imaging data (PLORAKS) and MS-HTC algorithms, respectively, the reconstruction speed is improved by a factor of 14 compared to the MS-HTC algorithm.

Image sentiment analysis by combining contextual correlation
LUO Gaifang, ZHANG Hao, XU Dan
2025, 51(7): 2304-2313. doi: 10.13700/j.bh.1001-5965.2023.0345
Abstract:

Image sentiment analysis aims to analyze the emotions conveyed by visual content. A key challenge in this field is to bridge the affective gap between latent visual features and abstract emotions. Existing deep learning models attempt to address this issue by directly learning discriminative high-level emotional representations globally at once but overlook the hierarchical relationship between features at each layer of the deep model, resulting in a lack of correlation between contextual features. Therefore, this paper proposed a context-hierarchical interaction network (CHINet) to model the correlation between contextual information and sentiment within the hierarchy. The model consists of two branches: a bottom-up branch, which first directly learns the global emotional representation at the high-level semantic level; then, for different feature level within the branch, it extracts the style representation and localizes potential emotion activation regions by shallow style encoder and emotion activation attention mechanism. The extracted features are then cascaded into a pyramid structure as top-down branches, modeling contextual hierarchical dependencies and providing shallow visual features for emotion representation. Finally, global and local learning integrate shallow image styles with high-level semantics. Experiments show that the proposed model improves emotion recognition accuracy on the FI dataset compared with related methods, including multi-level feature fusion methods and approaches incorporating local emotional regions.

Edge-intelligent transmission optimization of emergency surveillance video based on IcD-FDRL
LI Yan, WAN Zheng, DENG Chengzhi, WANG Shengqian
2025, 51(7): 2314-2329. doi: 10.13700/j.bh.1001-5965.2023.0378
Abstract:

Emergency surveillance video transmission is a key technical means to improve emergency handling capability under circumstances such as emergency monitoring, public security incident handling, and post-disaster reconstruction. It has gradually become a key focus of research and development in the construction of the national smart emergency system. With the continuous development of 5G technology and decision-making artificial intelligence technology in recent years, an edge-intelligent transmission architecture for emergency surveillance video was established, aimed at public safety and emergency rescue monitoring in local areas. This model seeks to achieve adaptive and high-quality transmission of emergency surveillance video. Furthermore, the importance measurement method of emergency surveillance video was designed, and an intra-clustered dynamic federated deep reinforcement learning algorithm was proposed. The proposed optimization method based on intra-clustered dynamic federated deeps reinforcement learning (IcD-FDRL) enhances the edge-intelligent transmission of emergency surveillance video, breaks monitoring data silos, improves algorithm learning efficiency, and realizes low-delay, low-cost, high-quality, and priority transmission of important emergency surveillance video. Finally, a simulation experiment was performed and its results were compared, verifying the effectiveness of the proposed model and algorithms.

Path planning for agents based on adaptive polymorphic ant colony optimization
XING Na, DI Haotian, YIN Wenjie, HAN Yajun, ZHOU Yang
2025, 51(7): 2330-2337. doi: 10.13700/j.bh.1001-5965.2023.0432
Abstract:

In the realm of intelligent agent path planning, the ant colony algorithm stands as a prominent path-solving strategy that has garnered extensive adoption. However, conventional ant colony algorithms exhibit issues pertaining to local optima and excessive inflection points. The adaptive polymorphic ant colony optimization algorithm, through the mechanism of multi-colony partitioning and collaboration, remarkably enhances the search and convergence velocities, thereby bolstering global search capabilities and circumventing entrapment within local optima. To significantly increase the accuracy of path planning, this research proposes an enhanced pheromone updating technique and path selection record table building. Lastly, the adoption of trice B-Spline smoothing curves facilitates an effective reduction in inflection points, achieving path smoothness. Substantiated by MATLAB and robot operating system (ROS)-Gazebo simulations, the results demonstrate the algorithm's sound feasibility in complex environments. In conclusion, the global search of the intelligent agent is greatly optimized and enhanced by the suggested enhanced adaptive polymorphic ant colony method.

Few-shot object detection of aerial image based on language guidance vision
ZHANG Zhi, YI Huahui, ZHENG Jin
2025, 51(7): 2338-2348. doi: 10.13700/j.bh.1001-5965.2023.0491
Abstract:

This paper presents a few-shot object detection of aerial images based on language guidance vision in response to the problem of decreased detection accuracy in existing aerial image object detection brought on by a lack of training data and changes in the aerial image dataset, such as changes in shooting angle, image quality, lighting conditions, background environment, and significant changes in object appearance and the addition of new object categories. First, a word-region alignment branch replaces the classification branch in conventional object detection networks, and a word-region alignment classification score with both language and visual information is obtained as a prediction classification result. Then, object detection and phrase grounding are unified into a single task, leveraging language information to enhance visual detection accuracy. Furthermore, to address the challenge of accuracy fluctuations caused by changes in textual prompts during few-shot object detection, a language visual bias network is designed to mine the association between language features and visual features. This network aims to improve the alignment between language and vision, mitigate accuracy fluctuations, and further enhance the accuracy of few-shot detection. Extensive experimental results on UAVDT, Visdrone, AeriaDrone, VEDAI, and CARPK_PUCPR demonstrate the superior performance of the proposed method. Particularly, on the UAVDT dataset, the method achieves an impressive mAP of 14.6% at 30-shot. Compared with aerial image detection algorithms clustered detection (ClusDet), density map guided object detection network (DMNet), global-local self-adaptive network (GLSAN) and coarse-grained density map network (CDMNet), the detection accuracy is 0.9%, −0.1%, −2.4% and −2.2% higher during full data training. The mAP of the approach can achieve 58.0% on the PUCPR dataset at 30-shot, which is 1.0%, 0.8%, 0.1% and 0.3% higher than the accuracy of full data trained generic object detection methods fully convolutional one-stage object detector (FCOS), adaptive training sample selection (ATSS), generalized focal loss V2 (GFLV2) and VarifocalNet (VFNET), respectively. These results highlight the robust few-shot generalization and transfer capabilities of the proposed method.

An object detection algorithm based on feature enhancement and adaptive threshold non-maximum suppression
MENG Weijun, AN Wen, MA Sugang, YANG Xiaobao
2025, 51(7): 2349-2359. doi: 10.13700/j.bh.1001-5965.2023.0534
Abstract:

To further solve the problems of object omission and repeated detection and improve the accuracy of object detection, this paper proposes an object detection algorithm based on feature enhancement and adaptive threshold non-maximum suppression(NMS). The attention-guided multi-scale context module(AMCM) is applied to the neck of the detector. Based on improving the semantic information of features by dilated convolution, the cross-channel location information is captured by the attention mechanism, so as to enhance the feature expression ability of the network. The dynamic suppression threshold is adaptively applied to instances of the scenes through the adaptive density threshold of NMS(ADT-NMS), which lowers the false detection rate for objects. In comparison to the baseline algorithm YOLOv4, the suggested approach’s false detection rate on the PASCAL VOC dataset is 13.7%, a 1% decrease. The recall rate and detection accuracy increase by 0.9% and 1.7%, respectively, to 96.6% and 83.7%. The false detection rate of the proposed algorithm on the KITTI dataset achieves 22.1%, reduced by 1.3%. The detection accuracy and recall rate achieved 83.6% and 91.8%, improved by 1.8% and 2.3%, respectively. The experimental results show that the algorithm can better solve the problems of object omission and repeated detection.

Learning Harris Hawks optimization algorithm with signal-to-noise ratio
ZHANG Lin, SHEN Jiaying, HU Chuanlu, ZHU Donglin
2025, 51(7): 2360-2373. doi: 10.13700/j.bh.1001-5965.2023.0433
Abstract:

Aiming at the problem of insufficient population learning and adaptability of the Harris hawks optimization (HHO) algorithm, this paper proposes a learning Harris hawks optimization based on the signal-to-noise ratio(SLHHO)algorithm. By using the signal-to-noise ratio as a metric to assess individual position information, the algorithm creates a coordinated learning strategy that can more realistically update the positions of individuals within the population. It then reworks the escape distance to enhance the algorithm’s capacity for adaptation and optimization seeking. With 12 benchmark functions as the standard, the proposed algorithm was tested for its performance with the variants of the Harris hawk algorithm and other algorithms, and compared and analyzed in the evaluation indexes such as time complexity, diversity, exploration and development, etc. The results show that SLHHO algorithm is highly competitive and feasible, and finally, its practicability is verified in the pressure vessel design problem.

Adaptive search window reconstruction for low-delay video compressive sensing
SUN Renhui, LIU Hao, DENG Kailian, YAN Shuai
2025, 51(7): 2374-2383. doi: 10.13700/j.bh.1001-5965.2023.0333
Abstract:

For distributed video compressive sensing, the inter-frame multi-hypothesis prediction offers low computational complexity at the encoding end and good restoration quality for non-key frames at the decoding end. In recent years, many optimization algorithms related to it have been proposed. However, in existing algorithms, the search window of the hypothesis set is a square area whose size is empirically fixed. To further improve the quality of the hypothesis set and reduce the delay at the decoding end, this paper proposed a reconstruction algorithm that adaptively adjusted the position and size of the search window. The proposed algorithm first quickly determined the motion vector between adjacent non-key frames using the optical flow method. Then, by combining the motion vector and the motion information between the forward adjacent non-key frame and the key frame, the central block position of the search window in the key frame was determined. Finally, based on the relative position relationship between the current reconstruction block and the central block of the search window, a rectangular search window that aligned with motion changes could be determined adaptively. In the paper, several video sequences were experimentally analyzed within a low-delay framework. The results show that the proposed algorithm can effectively improve the restoration quality of non-key frames and reduce the runtime.

Efficient weakly-supervised video moment retrieval algorithm without multimodal fusion
JIANG Xun, XU Xing, SHEN Fumin, WANG Guoqing, YANG Yang
2025, 51(7): 2384-2393. doi: 10.13700/j.bh.1001-5965.2023.0379
Abstract:

The weakly-supervised video moment retrieval (WSVMR) task retrieves in and out points of specific events from untrimmed videos based on natural language query text, using a deep learning algorithm model trained through video and natural language text matching relationships. Most existing WSVMR algorithms use multimodal fusion mechanisms to understand video content for moment retrieval. However, the cross-modal interactions required for effective fusion are quite complex, and this process can only be initiated once a clear user query is received. These hinder the operational efficiency of these algorithms, limiting their deployment in multimedia applications. To address this issue, a novel fusion-free multimodal alignment network (FMAN) algorithm for achieving rapid WSVMR was proposed. By restricting complex cross-modal interaction computation in the training phase, this algorithm enables offline encoding of both video and text data by the model, thereby significantly improving the inference speed of video moment retrieval. Experimental results on Charades-STA and ActivityNet-Captions datasets show FMAN’s superior retrieval performance and efficiency compared to existing methods. The experiment on retrieval performance metrics, specifically recall rates (R1 and R5), shows an average improvement of 2.66% and 1.57% on Charades-STA, as well as 0.19% and 3.35% on ActivityNet-Captions, respectively. Additionally, the proposed algorithm reduces online floating-point operations per second to no more than 1% of the original algorithm.

Lightweight neural network design for infrared small ship detection
TANG Wenting, LI Bo, JI Mengqi
2025, 51(7): 2394-2403. doi: 10.13700/j.bh.1001-5965.2024.0747
Abstract:

A lightweight neural network design method is proposed to efficiently represent small ships in infrared remote sensing images. To improve the representation effect of infrared dim and small targets, a method for simulating the visual receptive field adjustment mechanism that incorporates multi-scale receptive field perception and selection processes is proposed. This method is inspired by the visual attention-driven receptive field adjustment mechanism. A lightweight feature selection operator is devised to enhance the receptive field selection, and feature reuse and convolution kernel decomposition are used to optimize the multi-scale receptive field perception process in order to further increase efficiency. Experimental results on an infrared dim and small ship detection dataset show that the network detection accuracy increased by 2%, with a reduction of 2.3×106 parameters and 9.1×109 computations compared to general lightweight networks. In complex scenarios with similar ground interference, this method effectively reduces false alarms and suppresses missed detections.

Omnidirectional image quality assessment based on adaptive multi-viewport fusion
FENG Chenxi, ZHANG Di, LIN Gan, YE Long
2025, 51(7): 2404-2414. doi: 10.13700/j.bh.1001-5965.2023.0381
Abstract:

Existing omnidirectional image quality assessment (OIQA) models extract local features from each viewport independently, increasing computational complexity and making it difficult to describe the correlations between viewports using an end-to-end fusion model. To solve these issues, a quality assessment method was proposed based on feature sharing and adaptive fusion of multiple viewports. By utilizing shared backbone networks, the method transformed the viewport segmentation and computation that were independent of each other to the feature domain, enabling local feature extraction of the image through one-shot feed-forward computation. In addition, a viewport segmentation method in the feature domain using spherical uniform sampling was employed to guarantee consistent pixel density between view space and observation space, with semantic information guiding the adaptive fusion of local quality features of viewpoints. The Pearson linear correlation coefficient (PLCC) and Spearman rank order correlation coefficient (SRCC) on the compressed virtual reality image quality (CVIQ) and OIQA datasets were both above 0.96, showing superior performance compared with mainstream evaluation methods. Compared with the traditional evaluation method structural similarity index measure (SSIM), its average PLCC and average SRCC on the above two datasets were improved by 9.52% and 8.7%, respectively; compared with the latest evaluation method multi-perceptual features image quality assessment (MPFIQA), its average PLCC and average SRCC were improved by 1.71% and 1.44%, respectively.

Multi-source remote sensing image classification based on wavelet transform and parallel attention
WANG Jiayi, GAO Feng, ZHANG Tiange, GAN Yanhai
2025, 51(7): 2415-2422. doi: 10.13700/j.bh.1001-5965.2023.0329
Abstract:

Exploring the dependency relationships of multi-source remote sensing image data features to leverage the complementary advantages between different modalities has become a prominent research direction in the field of remote sensing. Existing joint classification tasks of hyperspectral and synthetic aperture radar (SAR) data face two key challenges: insufficient feature extraction and representation in images, resulting in the loss of high-frequency information, which hinders subsequent classification tasks, and limited interaction among multi-source image features and weak correlation between multimodal features. To address these challenges, research was conducted on robust representation of image features and efficient correlation of multi-source features, and a multi-source remote sensing image classification method based on wavelet transform and parallel attention mechanism (WPANet) was proposed. The feature extractor based on wavelet transform could effectively utilize frequency domain analysis techniques, capturing coarse- and fine-grained features during the process of reversible downsampling. The feature fuser based on the parallel attention mechanism comprehensively integrated the consistency and differences of multimodal remote sensing data, accomplishing the fusion and generation of highly correlated features to enhance classification accuracy. Experimental results on two real multi-source remote sensing datasets demonstrate the significant advantages of the proposed classification method. The overall accuracy on the Augsburg and Berlin datasets reaches 90.40% and 76.23%, respectively, with at least a 2.66% and 12.22% improvement in overall accuracy compared to mainstream methods like depthwise feature interaction network (DFINet) on the two datasets.

Saliency-aware triple-regularized correlation filter algorithm for UAV object tracking
HE Bing, WANG Fasheng, WANG Xing, SUN Fuming
2025, 51(7): 2423-2436. doi: 10.13700/j.bh.1001-5965.2023.0362
Abstract:

Object tracking in unmanned aerial vehicle (UAV) scenarios has been widely applied in many real-world tasks. Different from general object tracking, UAV object tracking is more easily affected by complex environmental interferences and computational limitations. In this paper, a saliency-aware triple-regularized correlation filter (TRCF) for UAV object tracking was proposed. An efficient saliency detection algorithm was used to dynamically generate a dual-spatial regularizer to suppress boundary effects and apply a penalty to irrelevant background noise coefficients. Temporal regularization was introduced to address the filter degradation problem caused by appearance variation, providing a more robust appearance model. In addition, a lightweight deep network CF-VGG was employed to extract deep features, which were linearly fused with hand-crafted features to describe the object’s semantic information with improved tracking accuracy. Experiments were conducted on five publicly available UAV benchmark datasets. Results show that compared to the baseline methods, the proposed method demonstrates improvements on the five datasets, proving its effectiveness and robustness. The method also shows a real-time tracking speed of about 21 frames per second, making it suitable for UAV object-tracking tasks.

Aerial image stitching algorithm based on unsupervised deep learning
LIANG Zhenfeng, XIA Haiying, TAN Yumei, SONG Shuxiang
2025, 51(7): 2437-2449. doi: 10.13700/j.bh.1001-5965.2023.0366
Abstract:

Traditional image stitching approaches predominantly depend on accurate feature localization and distribution, which leads to suboptimal robustness in intricate aerial photography contexts. Consequently, a comprehensive unsupervised deep learning framework for aerial image stitching was devised, encompassing an unsupervised deep homography estimation network and an unsupervised image fusion network. First, the deep homography estimation network was employed to provide precise alignment data for subsequent stitching by ascertaining the homographic transformation between reference and target images. Subsequently, the image fusion network was utilized to learn deformation patterns of aerial image stitching, generating the final stitched output. Additionally, a real dataset for unsupervised aerial image stitching was introduced to facilitate the training of the learning framework. Comparative analysis was conducted on the suggested unmanned aerial vehicle aerial image dataset, incorporating scale-invariant feature transform (SIFT) + Ransac, accelerated-nonlinear diffusion-based feature detection and matching (AKAZE) + boosted efficient binary local image descriptor (BEBLID), oriented brief (ORB) + Ransac, and deep-learning-based image stitching algorithms. Experiments show that the value of structural similarity (SSIM) is increased by 39.94%; the peak signal-to-noise ratio (PSNR) is increased by 36.55%, and the root mean square error (RMSE) is reduced by 66.09%. Moreover, the proposed method demonstrates superior visual stitching performance and robustness in authentic aerial scenarios compared to existing deep-learning-based and traditional image stitching methods.

An adaptive automatic construction algorithm for sentiment dictionaries based on semantic rules
WEI Qinglan, HE Yu, SONG Jinbao
2025, 51(7): 2450-2459. doi: 10.13700/j.bh.1001-5965.2023.0367
Abstract:

Although text sentiment analyses using dictionaries are efficient and unsupervised, their accuracy relies heavily on the dictionary quality. The quality of existing Chinese sentiment dictionaries in cross-domain applications needs to be improved, as the manually constructed Chinese dictionaries fail to automatically discover new words and include emotionally ambiguous ones. In this paper, a domain-adaptive automatic construction algorithm for Chinese sentiment dictionaries was proposed by adopting semantic rules. The Chinese sentiment-fixed dictionary was constructed. It eliminated sentimental ambiguity effectively. A novel domain-adaptive method for discovering new Chinese words was proposed, which enabled the automatic expansion of the Chinese dictionary for general domains. An innovative unsupervised framework for recognizing words’ sentiments based on part-of-speech filtering and semantic rules helps realize higher sentiment detection accuracy. The experiment proves that in the common computer corpus, sentiment analysis using the sentiment-fixed dictionary achieves an average accuracy improvement of 9.31%, precision improvement of 12.77%, and F1 value improvement of 7.43%, compared to other general Chinese dictionaries. Meanwhile, on two datasets, the hotel and Chinese sentiment analysis corpora, the proposed algorithm for the automatic construction of sentiment dictionaries improves the accuracy by an average of 7.41%, the recall rate by 12.23%, and the F1 value by 9.08%, compared to the advanced methods.

Continual learning method based on differential feature distillation for multimodal network
HE Chiyuan, CHENG Shaoxu, XU Linfeng, MENG Fanman, WU Qingbo
2025, 51(7): 2460-2467. doi: 10.13700/j.bh.1001-5965.2023.0369
Abstract:

Continual learning has become a new research hotspot in recent years. However, in the continual learning of multimodal architecture, the data are generally not fully utilized, resulting in catastrophic forgetting and learning obstruction. To address these issues, a multimodal continual learning method based on feature distillation was proposed. By focusing on differences in task performance between different modalities, this method chose to retain more or less old knowledge of the modality, so as to stimulate each modality’s potential in exploring discriminative features from an overall perspective. Experiments on the multimodal behavior recognition dataset UESTC-MMEA-CL validated the effectiveness of this method. When approaching the eighth task, the proposed method achieved an improved accuracy by an average of 22.0% and 20.1% based on fine-tuning and learning without forgetting (LwF), respectively. Compared with the classic knowledge distillation method, the proposed method better utilized sensor modalities, thereby significantly alleviating the catastrophic forgetting issue of multimodal networks.

Cover selection method for batch image steganography based on multivariate optimization
WANG Yangguang, YAO Yuanzhi, YU Nenghai
2025, 51(7): 2468-2477. doi: 10.13700/j.bh.1001-5965.2023.0380
Abstract:

Batch image steganography provides an effective means for covert communication on social networks by embedding secret messages into multiple cover images through cover selection. Compared with traditional image steganography, a key challenge of batch image steganography lies in designing an effective cover selection method while ensuring good undetectability performance. In this paper, a multivariate optimization-based cover selection method for batch image steganography was proposed. By jointly analyzing embedding distortion, image correlation, and embedding capacity, this method modelled the cover selection for batch image steganography as a multivariate optimization problem. Additionally, to enhance the batch image steganography robustness against the possible compression of images after steganography by social networks, a secret message fragmentation and reassembly strategy was designed. Experimental results demonstrate that the proposed method for batch image steganography achieves satisfactory performance in undetectability, embedding capacity, and robustness, providing technical support for covert communication on social networks.

Pedestrian attribute recognition algorithm based on multi-label adversarial domain adaptation
HU Qiangliang, CHEN Lin, SHANG Mingsheng
2025, 51(7): 2478-2487. doi: 10.13700/j.bh.1001-5965.2023.0386
Abstract:

Current unsupervised domain adaption algorithms usually consider only single-label learning, failing to adapt to multi-label classification tasks in pedestrian attribute recognition. To address this issue, a multi-label adversarial domain adaptation algorithm was proposed for pedestrian attribute recognition. First, to adapt to the domain transfer task of multiple attribute labels, a multi-label feature disentanglement module was employed to effectively disentangle attributes of deep features extracted from the backbone network with specific attributes based on category-specific semantics. Secondly, to reduce the gap of attribute feature distributions in different domains, a multi-label domain discrimination module based on classifier reuse was proposed to achieve both multi-attribute domain alignment and multi-label classification, which effectively exploited the predicted discrimination information to capture the multi-mode structures of feature distribution. Experimental results show that the proposed algorithm achieves superior results than the baseline model, with improvements of 4.49%, 5.5%, 11.44% and 5.89% on mean accuracy, accuracy, recall and F1 values, respectively. The proposed algorithm provides a new insight for multi-label domain adaptation learning.

Dual-channel vision Transformer-based image style transfer
JI Zongxing, BEI Jia, LIU Runze, REN Tongwei
2025, 51(7): 2488-2497. doi: 10.13700/j.bh.1001-5965.2023.0392
Abstract:

Image style transfer aims to adjust the visual properties of a content image based on a style reference image, preserving the original content while presenting specific styles to generate visually appealing stylized images. Most existing representative methods focus on extracting local image features without considering the encoding differences between different image domains or the importance of global contextual information. To address this issue, Bi-Trans, a novel image style transfer method based on a dual-channel vision Transformer was proposed. This method encoded the content and style image domains independently, extracting style parameter vectors to discretely represent the image style. By using a cross-attention mechanism and conditional instance normalization (CIN), the content image was calibrated to the target style domain, generating the stylized image. Experimental results demonstrate that the proposed method is superior to existing methods in terms of both content preservation and style restoration.

A rotated content-aware retina network for SAR ship detection
WANG Ziyi, YIN Jiahao, HUANG Bobin, GAO Feng
2025, 51(7): 2498-2505. doi: 10.13700/j.bh.1001-5965.2023.0394
Abstract:

Current synthetic aperture radar (SAR) ship detection methods primarily encounter two challenges: 1) the variability of target sizes and the abundance of interfering factors; 2) multiple orientations of targets and a limited quantity of training samples. To address these issues, this paper introduces a Rotating Target-Aware Network for SAR image ship detection, RCAR-Net. Firstly, the backbone network employs the PVTv2 based on a multi-scale Transformer architecture, which better preserves the local continuity of feature maps while enhancing the integration of multi-scale image features. In conjunction, a combination of rotating bounding boxes with RetinaNet effectively reduces background redundancy and noise interference. The model's generalizability and robustness are further enhanced by the introduction of the Cutout data augmentation technique, which uses partial occlusion of current samples to enlarge the dataset. Finally, the effective CARAFE operator is used to upsample low-resolution feature maps, improving the multi-scale fusion effect, reducing computational and memory costs while maintaining detection accuracy. RCAR-Net achieves an average precision of 93.63% and 90.37% on the SSDD and HRSID SAR ship detection datasets, respectively, significantly outperforming current methods such as DPAN and PANet, demonstrating strong adaptability to changes in target size and noise interference.

Improved YOLOv7 method for aerial small target detection in aerial photography
LIU Yinuo, ZHANG Qi, WANG Rong, LI Chong
2025, 51(7): 2506-2512. doi: 10.13700/j.bh.1001-5965.2023.0411
Abstract:

This paper proposes an improved YOLOv7-based aerial small target detection method to address the high rates of missed and false detections in current detection technologies for aerial small target detection tasks. First, a CBAM fusion attention mechanism is incorporated into the backbone network, allocates weights reasonably in both spatial and channel-wise of the feature map, suppresses background interference and improves detection accuracy. The second is the SPD-Conv module, which removes the original convolutional module's cross-convolutional and pooling layers, improves feature representation learning efficiency, and mitigates fine-grained information loss in low-resolution images and small targets refinement detection. Finally, the improved YOLOv7 is evaluated on a processed DOTA aerial dataset. According to the results, it outperforms the original YOLOv7 by 3.1%, achieving 83.7% precision, 78.2% recall, and 81.5% average accuracy on the dataset. The improved algorithm effectively reduces missed and false detections, demonstrating a strong performance.

A dense pedestrian tracking method based on fusion features under multi-vision
HUANG Yujie, CHEN Kai, WANG Ziyuan, WANG Ziteng
2025, 51(7): 2513-2525. doi: 10.13700/j.bh.1001-5965.2023.0416
Abstract:

Many multi-object pedestrian tracking algorithms have been proposed in computer vision, and great progress has been made in tracking efficiency and accuracy recently. Practical applications are severely hampered by the fact that the majority of tracking techniques now in use are still unable to address the issues of object occlusion and reappearance in camera perspectives. To tackle the above problems in dense crowds under multi-vision, the multi-target pedestrian tracking method is based on fusion feature correlation. The feature pool was updated based on GMM to reduce feature pollution caused by dense people. To ensure the tracking universality, the similarity threshold of target features was calculated dynamically based on K-means. The similarity of fused features is used to associate the pedestrian features, with the homography constraint check to determine the addition and reappearance of pedestrians, which reduces error and miss tracking. The results of experiments using several algorithms on the public dataset Shelf indicate that the suggested method's average accuracy is 16.05% and 7.39% higher than that of other methods, while its average success rate is 16.04% and 4.16% higher. The average error tracking rate under the complete video is 10.11%, which achieves significant results in controlling mistracking and effectively associates with the original ID after the pedestrian’s reappearance.

Semi-supervised image retrieval based on triplet hash loss
SHAO Weizhi, XIONG Siyu, PAN Lili
2025, 51(7): 2526-2537. doi: 10.13700/j.bh.1001-5965.2023.0451
Abstract:

Currently, most of the image retrieval methods based on deep learning are supervised techniques, which require massive labeled data. However, it is very difficult and expensive to label so much data in real applications. Furthermore, the network learned picture similarity poorly since the triple loss functions that were in place were computed using Euclidean distance. In this work, a novel semi-supervised hash image retrieval model (SSITL) is proposed that mixes the pseudo-labels with entropy minimization, triplet hash loss and semi-supervised learning. The multi-stage model union and sharpening technique are used to generate pseudo-labels, and the pseudo-labels are processed with entropy minimization to improve their confidence. The triplet hash loss based on the channel weight matrix is utilized to assist SSITL in learning the similarity of images, while the triples are chosen concurrently depending on the clustering outcomes of labeled and unlabeled data. In order to generate a better hash code, Mix Up is used to shuffle between two Hamming embeddings to obtain a new Hamming embedding for image retrieval. The abundant experimental results show that compared with other methods, SSITL improves the average retrieval accuracy by 1.2% and 0.7% respectively on CIFAR-10 and NUS-WIDE datasets under similar time cost, which strongly demonstrates that SSITL is an excellent semi-supervised hash framework for image retrieval.

Identification of induced information for personalized recommendations based on knowledge graph
NI Wenkai, PENG Shufan, DU Yanhui
2025, 51(7): 2538-2552. doi: 10.13700/j.bh.1001-5965.2023.0475
Abstract:

In the age of intelligent information, managing the recommendations of Internet information service algorithms is a crucial step in creating a national Internet governance framework. Personalized recommendation algorithm is one of the important technologies for Internet information service algorithm recommendation. The knowledge graph is widely used in personalized recommendation algorithms. At the same time, the knowledge graph and recommendation algorithm are vulnerable to data poisoning attacks by attackers, which in turn affects the recommendation results and induces information dissemination. There is a lack of effective models for identifying this type of induced information. Based on this, this article conducts research on the induced information identification model. Based on the analysis of user historical behavior records and the evolution process of user preferences, we study induction based on user interests and group perception. Information detection method, perform group preference modeling on the historical preferences of similar user groups, perform outlier analysis on abnormally exposed information within groups with common characteristics, and construct set node2vec-side item representation, GMM group division, and LUNAR induction for anomaly detection User preference modifications and recommendation result evolution reasoning are the basis for the realization of induced information recognition by the information recognition model NGL (Induced information detection model that incorporates node2vec-side item representation, GMM group division, and LUNAR, NGL). Induced information recognition experiments were conducted on RippleNet and MKR recommendation systems. The results show that the NGL model proposed in this article is better than the existing anomaly detection model.

Camouflaged object detection network based on human visual mechanisms
ZHANG Dongdong, WANG Chunping, FU Qiang
2025, 51(7): 2553-2561. doi: 10.13700/j.bh.1001-5965.2023.0511
Abstract:

Camouflaged object identification is a new visual detection job that has many applications in several fields. Its goal is to identify camouflaged targets that are completely disguised in their environment. To address the problem of current camouflaged object detection algorithms failing to accurately and completely identify the object's structure and boundaries, this paper designs a bio-heuristic framework based on the human visual perception process when observing camouflaged images, and names its Positioning and Refinement Network (PRNet). Res2Net is used to extract the original features of the image and mine the edge cues of the target from multi-level information. A feature enhancement module is specially designed to expand the perceptual field while enriching the global contextual information. Then, the positioning module utilizes the dual-attention mechanism to locate the approximate position of the target from both channel and spatial dimensions. Lastly, the refinement module leverages multi-type information to further improve the target structure and edges by concentrating on target cues in both the foreground and background. Extensive experimental results on three widely used benchmark datasets for camouflage target detection demonstrate that the overall performance of the proposed network significantly outperforms 14 comparative algorithms and performs well in a variety of complex scenarios.

A lightweight semantic VSLAM approach based on adaptive thresholding and speed optimization
QI Hao, FU Yuexin, HU Zhuhua, WU Jiaqi, ZHAO Yaochi
2025, 51(7): 2562-2572. doi: 10.13700/j.bh.1001-5965.2023.0552
Abstract:

Visual simultaneous localization and mapping (VSLAM) is a technology that utilizes visual and other sensory sensors to acquire information about unknown environments. It is widely applied in fields such as autonomous driving, robotics, augmented reality, and more. However, pixel-level semantic segmentation of dynamic objects entails high computing costs for indoor visual SLAM, and variations in lighting make dynamic items appear more difficult to see, potentially causing occlusions or confusion with the static surroundings. To address these challenges, a lightweight semantic VSLAM model is proposed, which is based on adaptive thresholding and velocity optimization. Initially, a lightweight one-stage object detection network, YOLOv7-tiny, is utilized in conjunction with the optical flow algorithm to effectively detect dynamic regions within images and filter out unstable feature points. Additionally, the feature point extraction algorithm dynamically adjusts the threshold based on the contrast information of the input images. Moreover, the combination of a binary bag-of-words method with a simplified optimization technique for local mapping threads improves the system's loading and matching speed in indoor dynamic scenarios. Experimental results show that the proposed algorithm can effectively eliminate dynamic feature points in indoor high-dynamic scenes, improving the positioning accuracy of the camera. The average processing speed reaches 19.8 frames per second (FPS), meeting real-time requirements in practical scenarios.