llama.cpp support when?

#7
by alanzhuly - opened

Great model and use case. I want to try locally on Mac!

Microsoft org

Thank you for your interest in the Phi-4-multimodal!
Our Huggingface friend has just shared that the model can run with llama.cpp

All you need is:

brew install llama.cpp

Followed by:

llama-cli -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF:Q8_0

That's another one model. Not multimodal.

Microsoft org

Yes, the multimodal need support from llama.cpp community.

Does llama.cpp even support multimodal input?

Does llama.cpp even support multimodal input?

No they currently support either a text model or visual model but not multi-modal. You have to run visual models through a custom script that llama.cpp creates for each vision model that comes out.

In the meantime, until llama.cpp's multimodal support is fully implemented, running via onnx [1] may be an option.
Edit: mistral.rs [2] seems to also support it.

[1] https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx
[2] https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI4MM.md

Another option here for self hosting the model
Here's an example with a chat completion endpoint
https://github.com/anastasiosyal/phi4-multimodal-instruct-server

llama is refactoring their vision api
https://github.com/ggml-org/llama.cpp/pull/11292

I got the first PoC on llama.cpp: https://github.com/ggml-org/llama.cpp/pull/12274

And dear microsoft phi-4 team, the python code is very frustrating to look at. Your image_token_compression code makes no sense and all the comments are wrong, I had to reinvent it myself. This is the worst python code that I have seen since I graduated.

Edit: sorry for being quite noisy & impolite. As much as I appreciate the research spent on this model, I also expected an equal quality for transparency reflected via the code. Please be more thoughtful about writing code next time 🙏

Sign up or log in to comment