arxiv:2306.13643

LightGlue: Local Feature Matching at Light Speed

Published on Jun 23, 2023

Authors:

Marc Pollefeys

Abstract

We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at https://github.com/cvg/LightGlue.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Jul 21, 2023

Introduces LightGlue, a series of modifications to SuperGlue (local feature matching using GNNs), to make it faster (memory and compute efficient), more accurate, and easier to train; now uses transformers efficiently. Faster inference on easy to match images. Attention (residual) MLP concatenates state with a message (learned weighted average of all states of image - self or other/cross). Self attention: Project state to key and query using linear transformations; self-attention score between points (pixel correspondences) involves using rotary encodings of the relative position between points. Cross attention: Project to only key for each element and take product (among one image and the other) for attention score (unexpectedly equal to bidirectional attention). Predict assignment scores by linear projecting points from both images, matchability computed using sigmoid over linear (for each point); partial assignment matrix is product of matchabilities and assignment scores (similarity of matchable pairs should be higher than any other point in both images). Exit criterion based on a confidence classifier; confident and unmatchable points are discarded. Two stage training: first train to predict correspondences, then confidence classifier. Training correspondence through ground truth labels (two-view transformations), minimize log-likelihood of assignment at each layer (predict correct matches early). Lower loss and higher recall than SuperGlue (ease of training); trained with synthetic homographies on 1M images, fine-tuned on MegaDepth (no IMC scenes). Better than LoFTR on HPatches (RANSAC and DLT) homography estimation, improves relative pose estimation (MegaDepth) using DISK and SuperPoint (over NN+mutual, SuperGlue, and other matchers), outperforms LoFTR and MatchFormer (dense matchers) with the help of LO-RANSAC, only ASpanFormer is better (but slower). More results (including IMC), implementation details, local features (for training), etc. in appendix. From ETHz (PE Sarlin), Microsoft.