TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Published in Under Review, 2025

We propose a method of reuse of visual features that can enhance the basic performance of Vision-Language-Action (VLA) models without incurring additional training costs. This approach considers the application of the locality principle in the VLA field and validates the feasibility of visual reuse in VLA models through experiments.

Key Contributions:

Proposed visual feature reuse method for VLA models
Enhanced basic performance without additional training costs
Validated feasibility through experiments considering locality principle

Personal Contribution: Fourth author, focused on application of locality principle and experimental validation.

Download paper here

Recommended citation: Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Songfang Huang, Huiling Duan. (2025). "TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models." Under Review. (Fourth Author)
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)