llama.cpp support when?
Great model and use case. I want to try locally on Mac!
Thank you for your interest in the Phi-4-multimodal!
Our Huggingface friend has just shared that the model can run with llama.cpp
All you need is:
brew install llama.cpp
Followed by:
llama-cli -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF:Q8_0
That's another one model. Not multimodal.
Yes, the multimodal need support from llama.cpp community.
Does llama.cpp even support multimodal input?
Does llama.cpp even support multimodal input?
No they currently support either a text model or visual model but not multi-modal. You have to run visual models through a custom script that llama.cpp creates for each vision model that comes out.
In the meantime, until llama.cpp's multimodal support is fully implemented, running via onnx [1] may be an option.
Edit: mistral.rs [2] seems to also support it.
[1] https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx
[2] https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI4MM.md
Another option here for self hosting the model
Here's an example with a chat completion endpoint
https://github.com/anastasiosyal/phi4-multimodal-instruct-server
llama is refactoring their vision api
https://github.com/ggml-org/llama.cpp/pull/11292
I got the first PoC on llama.cpp: https://github.com/ggml-org/llama.cpp/pull/12274
And dear microsoft phi-4 team, the python code is very frustrating to look at. Your image_token_compression
code makes no sense and all the comments are wrong, I had to reinvent it myself. This is the worst python code that I have seen since I graduated.
Edit: sorry for being quite noisy & impolite. As much as I appreciate the research spent on this model, I also expected an equal quality for transparency reflected via the code. Please be more thoughtful about writing code next time 🙏