MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Abstract
This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.
Community
It would be interesting to see a comparison to small encoder-decoder models like instructionRoBERTa or flan-T5.
As a GPU poor I find this paper interesting and I am excited to try them out.
My questions are:
Have you guys considered Knowledge distilling Phi-2-2.7B model into smaller 350M model?
How does the design change affect the in-context learning ability of these models?
Does existing tool-chain PEFT, LORA and optimization techniques like AWQ, EXL2 and GPTQ work on these models?
Why not distilling from a larger model?
The model weights are now publicly available: https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Head-wise Shareable Attention for Large Language Models (2024)
- Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers (2024)
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs (2024)
- BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models (2024)
- Rethinking Optimization and Architecture for Tiny Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
If it can be downloaded I would like to test it on my device
Looking forward to trying it! Layer sharing saves only the memory, not the computation, so here is a thought on combining it with LORA: fine tune the shared layers with a low-rank update. Then you have different weights for each layer but increase little parameter number.
Interesting. If the findings hold true for all small LLMs, then it is very possible to cut down encoder-decoder model size by applying layer sharing to the decoder part of the model. Model size has always been an issue for encoder-decoder models.
Could someone reproduce a model config that would duplicate the number of parameters with number of layers, heads, key-value heads and embedding dimension, given in the paper?
I used Llama config with additionally setting tie_word_embeddings=True, but I don't get the same number of parameters. Probably I am missing something?
Secondly, the authors didn't mention the pretraining dataset they used. IMHO, controlling for that would be a better setup to measure the effect of model parameters.
MobileLLM: Revolutionizing Efficient Language Models for Smartphones
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Good news! The MobileLLM model weights are now publicly available: https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95
Models citing this paper 14
Browse 14 models citing this paperDatasets citing this paper 0
No dataset linking this paper