microsoft/Phi-4-multimodal-instruct · llama.cpp support when?

llama.cpp support when?

by alanzhuly - opened 11 days ago

Discussion

alanzhuly

11 days ago

Great model and use case. I want to try locally on Mac!

nguyenbh

Microsoft org 9 days ago

Thank you for your interest in the Phi-4-multimodal!
Our Huggingface friend has just shared that the model can run with llama.cpp

All you need is:

brew install llama.cpp

Followed by:

llama-cli -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF:Q8_0

FSA-36

9 days ago

That's another one model. Not multimodal.

nguyenbh

Microsoft org 8 days ago

Yes, the multimodal need support from llama.cpp community.

Tamnac

8 days ago

Does llama.cpp even support multimodal input?

richardr1126

8 days ago

Does llama.cpp even support multimodal input?

No they currently support either a text model or visual model but not multi-modal. You have to run visual models through a custom script that llama.cpp creates for each vision model that comes out.

234r89r23u89023rui90

7 days ago

•

edited 7 days ago

In the meantime, until llama.cpp's multimodal support is fully implemented, running via onnx [1] may be an option.
Edit: mistral.rs [2] seems to also support it.

[1] https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx
[2] https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI4MM.md

anastasiosyal

6 days ago

•

edited 6 days ago

Another option here for self hosting the model
Here's an example with a chat completion endpoint
https://github.com/anastasiosyal/phi4-multimodal-instruct-server

llama is refactoring their vision api
https://github.com/ggml-org/llama.cpp/pull/11292

ngxson

about 21 hours ago

•

edited about 10 hours ago

I got the first PoC on llama.cpp: https://github.com/ggml-org/llama.cpp/pull/12274

And dear microsoft phi-4 team, the python code is very frustrating to look at. Your image_token_compression code makes no sense and all the comments are wrong, I had to reinvent it myself. This is the worst python code that I have seen since I graduated.

ngxson

about 10 hours ago

Edit: sorry for being quite noisy & impolite. As much as I appreciate the research spent on this model, I also expected an equal quality for transparency reflected via the code. Please be more thoughtful about writing code next time 🙏

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment