CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally Paper • 2502.03566 • Published Feb 5 • 2