Video captioning based on vision transformer and reinforcement learning.

Researchers

Journal

Modalities

Models

Abstract

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.© 2022 Zhao et al.

Show Full Text

Video captioning based on vision transformer and reinforcement learning.

Researchers

Journal

Modalities

Models

Abstract

CONNET: Accurate Genome Consensus in Assembling Nanopore Sequencing Data via Deep Learning.

Enriching query semantics for code search with reinforcement learning.

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays.

Spectral Super-Resolution via Deep Low-Rank Tensor Representation.

1D Gradient-Weighted Class Activation Mapping, Visualizing Decision Process of Convolutional Neural Network-Based Models in Spectroscopy Analysis.

Solving Fokker-Planck equation using deep learning.

Leave a Reply Cancel reply

Researchers

Journal

Modalities

Models

Abstract

Similar Posts

Leave a Reply Cancel reply