TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models
Published in Under Review, 2025
We propose a method of reuse of visual features that can enhance the basic performance of Vision-Language-Action (VLA) models without incurring additional training costs. This approach considers the application of the locality principle in the VLA field and validates the feasibility of visual reuse in VLA models through experiments.
Key Contributions:
- Proposed visual feature reuse method for VLA models
- Enhanced basic performance without additional training costs
- Validated feasibility through experiments considering locality principle
Personal Contribution: Fourth author, focused on application of locality principle and experimental validation.
Recommended citation: Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Songfang Huang, Huiling Duan. (2025). "TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models." Under Review. (Fourth Author)
Download Paper
