ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Temple University, Philadelphia, PA, U.S.A¹
CompVis @ LMU Munich, MCML, Germany²

Image-Text Emdbedding Space

ActAlign significantly improves zero-shot finegrained action recognition by modeling them as structured language sequences. By aligning sub-action descriptions with video frames (green vs. red paths), we achieve more accurate predictions without requiring any video-text training data.

Abstract

We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

Methodology

Our ActAlign Method Overview.
(1) Subaction Generation: Given fine-grained actions (e.g. Basketball Tactics), we prompt an LLM to decompose each action (e.g. Hookshot, JumpShot, Dunk) into a temporal sequence of sub-actions.
(2) Temporal Alignment: Video frames are encoded by a frozen pretrained vision encoder and smoothed via a moving‐average filter. Simultaneously, each subaction is encoded by the text encoder. We compute a cosine‐similarity matrix between frame and subaction embeddings, then apply Dynamic Time Warping (DTW) to find the optimal alignment path and normalized alignment score.
(3) Class Prediction: We repeat this process for each candidate action m, compare normalized alignment scores, and select the action sequence with the highest score as the final prediction.

Quantitative Results

Zero-shot classification results on ActionAtlas (Salehi et al. 2024) under context-rich (T=0.2) prompting.
ActAlign achieves state-of-the-art Top-1, Top-2, and Top-3 accuracy, outperforming all baselines and billion-parameter video–language models without any video–text supervision.
These results highlight the effectiveness of structured sub-action alignment over flat representations such as mean-pooling and the open-set recognition capability of image-text models.

BibTeX

@misc{aghdam2025actalignzeroshotfinegrainedvideo, title={ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment}, author={Amir Aghdam and Vincent Tao Hu}, year={2025}, eprint={2506.22967}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.22967}, }