ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Temple University, Philadelphia, PA, U.S.A1
CompVis @ LMU Munich, MCML, Germany2

Image-Text Emdbedding Space

Teaser figure for ActAlign

ActAlign significantly improves zero-shot finegrained action recognition by modeling them as structured language sequences. By aligning sub-action descriptions with video frames (green vs. red paths), we achieve more accurate predictions without requiring any video-text training data.

Abstract

We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

Methodology

Method overview

Our ActAlign Method Overview.
(1) Subaction Generation: Given fine-grained actions (e.g. Basketball Tactics), we prompt an LLM to decompose each action (e.g. Hookshot, JumpShot, Dunk) into a temporal sequence of sub-actions.
(2) Temporal Alignment: Video frames are encoded by a frozen pretrained vision encoder and smoothed via a moving‐average filter. Simultaneously, each subaction is encoded by the text encoder. We compute a cosine‐similarity matrix between frame and subaction embeddings, then apply Dynamic Time Warping (DTW) to find the optimal alignment path and normalized alignment score.
(3) Class Prediction: We repeat this process for each candidate action m, compare normalized alignment scores, and select the action sequence with the highest score as the final prediction.

Alignment Example

Example alignment

Comparison of similarity heatmaps and DTW alignment paths for a correct classification (right) versus an incorrect prediction (middle). The correct class exhibits clearer segmentation and higher alignment quality. Please refer to the paper for scripts.

Quantitative Results

Quantitative Results Table

Zero-shot classification results on ActionAtlas (Salehi et al. 2024) under context-rich (T=0.2) prompting.
ActAlign achieves state-of-the-art Top-1, Top-2, and Top-3 accuracy, outperforming all baselines and billion-parameter video–language models without any video–text supervision.
These results highlight the effectiveness of structured sub-action alignment over flat representations such as mean-pooling and the open-set recognition capability of image-text models.

BibTeX

If you find our work useful, please consider citing:
@misc{aghdam2025actalignzeroshotfinegrainedvideo,
      title={ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment}, 
      author={Amir Aghdam and Vincent Tao Hu},
      year={2025},
      eprint={2506.22967},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.22967},
      }