Issue to run the model on Ollama.

#1
by vperrinfr - opened

I try to run the granite-3b-code-instruct-GGUF model via ollama, and I have an error during the execution.

Error: llama runner process has terminated: signal: abort trap error:done_getting_tensors: wrong number of tensors; expected 514, got 418

In the server.log, I can see a weird error, mentioning invalid character :
tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...

Any idea ? Could it be linked to the model generation in GGUF format ?

Thanks

Server.log content :

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBR9B03F/VMtH3VWyPUFB62BLM4TflaZi/IeFPFb9Lpt
2024/05/31 12:11:36 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-31T12:11:36.215+02:00 level=INFO source=images.go:729 msg="total blobs: 0"
time=2024-05-31T12:11:36.216+02:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-05-31T12:11:36.218+02:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.1.39)"
time=2024-05-31T12:11:36.220+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/s2/7qnwtxp15mngkms4lmj0v0qc0000gn/T/ollama2163464816/runners
time=2024-05-31T12:11:36.316+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-05-31T12:11:36.316+02:00 level=INFO source=types.go:71 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="32.0 GiB" available="0 B"
[GIN] 2024/05/31 - 12:12:03 | 200 | 746.844µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/31 - 12:12:14 | 201 | 7.380824655s | 127.0.0.1 | POST "/api/blobs/sha256:5bd783ab3925f425f17764fd34c1f7119fb64a023ccf9dd48654c3c3f252a8ff"
[GIN] 2024/05/31 - 12:12:22 | 200 | 7.957783124s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/05/31 - 12:12:43 | 200 | 38.846µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/31 - 12:12:43 | 200 | 939.012µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/31 - 12:12:43 | 200 | 378.844µs | 127.0.0.1 | POST "/api/show"
time=2024-05-31T12:12:44.277+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="0 B" memory.required.full="2.8 GiB" memory.required.partial="268.9 MiB" memory.required.kv="640.0 MiB" memory.weights.total="2.0 GiB" memory.weights.repeating="1.9 GiB" memory.weights.nonrepeating="98.4 MiB" memory.graph.full="152.0 MiB" memory.graph.partial="204.4 MiB"
time=2024-05-31T12:12:44.282+02:00 level=INFO source=server.go:338 msg="starting llama server" cmd="/var/folders/s2/7qnwtxp15mngkms4lmj0v0qc0000gn/T/ollama2163464816/runners/cpu_avx2/ollama_llama_server --model /Users/vperrin/.ollama/models/blobs/sha256-5bd783ab3925f425f17764fd34c1f7119fb64a023ccf9dd48654c3c3f252a8ff --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 53541"
time=2024-05-31T12:12:44.295+02:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-31T12:12:44.295+02:00 level=INFO source=server.go:526 msg="waiting for llama runner to start responding"
time=2024-05-31T12:12:44.296+02:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=2986 commit="74f33adf" tid="0x7ff84b8c3100" timestamp=1717150364
INFO [main] system info | n_threads=4 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff84b8c3100" timestamp=1717150364 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="53541" tid="0x7ff84b8c3100" timestamp=1717150364
time=2024-05-31T12:12:44.799+02:00 level=INFO source=server.go:564 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 26 key-value pairs and 514 tensors from /Users/vperrin/.ollama/models/blobs/sha256-5bd783ab3925f425f17764fd34c1f7119fb64a023ccf9dd48654c3c3f252a8ff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = granite-3b-code-instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 2048
llama_model_loader: - kv 4: llama.embedding_length u32 = 2560
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 10240
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: llama.vocab_size u32 = 49152
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 80
llama_model_loader: - kv 13: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 14: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = refact
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,49152] = ["<|endoftext|>", "", "<f...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,49152] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 24: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q4_K: 192 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 19/49152 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 49152
llm_load_print_meta: n_merges = 48891
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2560
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 80
llm_load_print_meta: n_embd_head_k = 80
llm_load_print_meta: n_embd_head_v = 80
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2560
llm_load_print_meta: n_embd_v_gqa = 2560
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 10240
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 3.48 B
llm_load_print_meta: model size = 1.98 GiB (4.89 BPW)
llm_load_print_meta: general.name = granite-3b-code-instruct
llm_load_print_meta: BOS token = 0 '<|endoftext|>'
llm_load_print_meta: EOS token = 0 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: PAD token = 0 '<|endoftext|>'
llm_load_print_meta: LF token = 145 'Ä'
llm_load_print_meta: EOT token = 0 '<|endoftext|>'
llm_load_tensors: ggml ctx size = 0.23 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 514, got 418
llama_load_model_from_file: exception loading model
libc++abi: terminating due to uncaught exception of type std::runtime_error: done_getting_tensors: wrong number of tensors; expected 514, got 418
time=2024-05-31T12:12:45.049+02:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:done_getting_tensors: wrong number of tensors; expected 514, got 418"
[GIN] 2024/05/31 - 12:12:45 | 500 | 1.327839226s | 127.0.0.1 | POST "/api/chat"

The 3B and 8B (instruct) models are not yet supported in Ollama. You have to wait until the next release or try this Llamafile I created: https://huggingface.co/sroecker/granite-3b-code-instruct-llamafile/tree/main
Just make it executable (chmod +x) and run it.

IBM Granite org

yeah currently this model is only working with llama.cpp
its not working with LM Studio or ollama.
maybe it should fix when they update to new llama.cpp release? (not sure)

Sign up or log in to comment