Mitigating Attention collapse via Mean-Deviation Constrained Optimization

1AI Graduate School, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea *Cho Chun Shik Graduate School of Mobility, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

Abstract

Attention mechanisms are widely used in deep learning to compute contextual representations, but they are prone to collapse when attention weights concentrate excessively on a few tokens, potentially de- grading model performance. We propose an Mean-Deviation Constrained Attention (MDCA), an optimization-based attention mechanism that constrains the mean-deviation of attention weights to mitigate attention collapse. The constraint is formulated as an inequality condition and is efficiently handled using the Augmented Lagrangian Method (ALM), enabling explicit control over attention concentration. Unlike heuristic approaches such as dropout or temperature scaling, our method intro- duces a principled regularization framework grounded in constrained op- timization. We evaluate the proposed method on two tasks: (i) selective attention for handwriting classification using the Badge-MNIST dataset, in comparison with standard baselines including vanilla attention, en- tropy regularization, and temperature scaling; and (ii) imitation learning on the nuPlan dataset, compared with a representative state-of-the-art planner. On Badge-MNIST, our method improves attention selectivity and accuracy across seeds. On nuPlan, it yields safety driving in reactive closed loop and openloop evaluation while maintaining modest.