allenai/olmOCR-7B-0225-preview · Using this model in LM Studio does not complete (stuck at PromptProcessing: 99.9517)

For this image

And this prompt:

Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally.
Do not hallucinate.
RAW_TEXT_START
Page dimensions: 612.0x792.0
[50x111]at
[50x111]https://molmo.allenai.org
[50x123]lect model weights, inference code, and demo are available
[50x171]like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic
[50x195]forms others in the class of open weight and data models
[50x207]class 72B model within the Molmo family not only outper-
[50x243]for the model architecture details, a well-tuned training
[50x279]we also introduce a diverse dataset mixture for fine-tuning
[50x291]descriptions. To enable a wide array of user interactions,
[50x303]lected entirely from human annotators using speech-based
[50x327]are state-of-the-art in their class of openness. Our key inno-
[50x339]from scratch. We present Molmo, a new family of VLMs that
[50x350]dational knowledge about how to build performant VLMs
[50x362]open ones. As a result, the community is still missing foun-
[50x386]on synthetic data from proprietary VLMs to achieve good
[50x398]prietary. The strongest open-weight models rely heavily
[56x492]Noah A. Smith
[62x147]We will be releasing all of our model weights, captioning
[62x410]Today's most advanced multimodal models remain pro-
[64x84]∗
[65x604]Jae Sung Park
[69x81]Equal contribution
[78x548]Arnavi Chheda
[79x506]Caitlin Wittlif
[81x562]Rose Hendrix
[87x520]Jon Borchardt
[104x590]Jiasen Lu
[128x496]†
[132x496]ψ
[132x496]Hannaneh Hajishirzi
[132x609]Mohammadreza Salehi
[137x538]†
[137x538]Pete Walsh
[146x510]Carissa Schoenick
[147x566]†
[147x566]Favyen Bastani
[150x595]Taira Anderson
[151x552]†
[152x581]Ajay Patel
[155x524]†
[156x623]ψ
[156x623]Christopher Clark
[166x468]†
[170x464]Allen Institute for AI
[176x645]for State-of-the-Art Multimodal Models
[204x538]†
[204x538]Chris Newell
[206x663]Open Weights and Open Data
[221x581]†
[221x581]Mark Yatskar
[233x552]Sam Skjonsberg
[233x552]†
[239x681]Molmo and PixMo
[240x566]Eli VanderBilt
[240x566]†
[243x595]†
[246x496]†
[250x496]ψ
[253x510]†
[253x510]Oscar Michel
[254x524]†
[254x524]Jen Dumas
[263x623]Sangho Lee
[263x623]∗†
[264x609]Niklas Muennighoff
[280x538]Piper Wolters
[280x538]†
[304x464]University of Washington
[305x581]†
[305x581]Chris Callison-Burch
[309x117]and
[309x117]with released model weights
[309x81]achieved with a simple training pipeline in which we con-
[309x105]training data without any reliance on synthetic data from
[309x169]community is still missing foundational knowledge about
[309x181]tively
[309x192]image captions. The resulting VLMs, therefore, are effec-
[309x204]which uses GPT-4V \[25\] to generate a large set of detailed
[309x216]e.g
[309x228]reliance on
[309x240](
[309x252]less open data: the training data may either be proprietary
[309x300]open
[309x300]capabilities in
[309x340]model weights, data, nor code being publicly released.
[309x352]models (VLMs), however, remain proprietary with neither
[309x376]age descriptions and accurately answering complex visual
[309x400]images in addition to text have resulted in impressive mul-
[309x412]Extensions to large language models (LLMs) that process
[312x240]e.g
[316x129]anguage
[321x216]., models are trained on datasets like ShareGPT4V \[7\]
[321x141]Molmo
[321x141]In this work, we present the
[324x240]., \[5\]) or, in cases where it is released, there is a heavy
[328x566]Nathan Lambert
[329x595]Kiana Ehsani
[330x552]Michael Schmitz
[330x552]†
[333x496]†
[337x496]Ali Farhadi
[337x510]Ranjay Krishna
[337x510]†
[343x623]†
[343x623]Rohun Tripathi
[351x681]:
[355x228]data generated by proprietary systems,
[368x129]del) family of state-of-the-art open VLMs
[369x300]models. Early works, exemplified by
[382x609]Kyle Lo
[406x524]†
[406x524]Sophie Lebrecht
[407x496]†
[411x496]ψ
[425x566]†
[426x581]Andrew Head
[427x117]released vision-language
[431x552]†
[431x552]Aaron Sarnat
[435x510]ψ
[435x510]Luca Weihs
[436x623]†
[436x623]Yue Yang
[436x141](
[441x609]Luca Soldaini
[441x609]†
[443x538]Kuo-Hao Zeng
[443x538]†
[487x595]†
[504x524]†
[511x566]†
[513x510]†
[513x552]†
[531x141]pen
[533x496]ψ

RAW_TEXT_END

Input token length = 2053

LM Studio shows this logging:

2025-03-08 11:43:53 [DEBUG] 
Sampling params:	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.100
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
2025-03-08 11:43:53 [DEBUG] 
sampling: 
logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 2108
2025-03-08 11:43:53 [DEBUG] 
About to embed image
2025-03-08 11:43:56 [DEBUG] 
BeginProcessingPrompt
2025-03-08 11:43:57 [DEBUG] 
PromptProcessing: 24.6747
2025-03-08 11:43:57 [DEBUG] 
PromptProcessing: 49.3494
2025-03-08 11:43:57 [DEBUG] 
PromptProcessing: 74.0241
2025-03-08 11:43:58 [DEBUG] 
PromptProcessing: 98.6988
2025-03-08 11:43:58 [DEBUG] 
Failed to find image for token at index 20
2025-03-08 11:43:58 [DEBUG] 
PromptProcessing: 99.9518

And it's stuk at 99.xxxx, it never completes?