Anthracite

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Delta-Vector new activity 13 days ago

anthracite-org/magnum-v4-12b:repetitive

Doctor-Shotgun new activity 22 days ago

anthracite-org/magnum-v4-12b:Question about the chat_template

intervitens new activity about 1 month ago

anthracite-org/pixmo-point-explanations-images:[bot] Conversion to Parquet

View all activity

anthracite-org's activity

Nitral-AI

posted an update 4 days ago

Post

3140

That moment when you spend 5 days up babysitting trains, only for colab pro + to randomly disconnect the environment at every chance with 0 error indication of any kind (it just disconnects without an error). Nuke the session from the interface, but continue to eat my colab credits while it reports to wandb. 0 way of saving the models when this happens since it nukes the code preset up to auto-execute. And since the sessions 'exist' but also at the same time doesn't exist i cant close it. And have to wait till they auto timeout after 24hrs. Guess, i won't be using colab for 'quick' test trains anymore. Thanks google for scheming the very little model training budget i had for the month.

3 replies

grimjim

posted an update 6 days ago

Post

1401

I've arrived at an interesting result on the current Open LLM leaderboard.
open-llm-leaderboard/open_llm_leaderboard
After I narrowed down the filter of models to be between 8-9B parameters, my recent merge of o1 reasoning models achieved the highest MATH eval result of any Llama 3.x 8B model currently on the board, hitting 33.99%, placing 973/2795.
grimjim/HuatuoSkywork-o1-Llama-3.1-8B

Unfortunately, I need more information to evaluate the parent models used in the merge.
The Skywork/Skywork-o1-Open-Llama-3.1-8B model scored 0% on the MATH eval, which I suspect was due to output formatting that was baked too hard into the model, and placed 2168/2795; the merge achieved a significant uplift in every benchmark across the board.
Unfortunately, FreedomIntelligence/HuatuoGPT-o1-8B was not currently benched as of this post, so I am unable to assess relative benchmarks. Nevertheless, it is intriguing that an ostensibly medical o1 model appears to have resulted in a sizable MATH boost.

grimjim

posted an update 10 days ago

Post

2622

I'm (finally) releasing a Python script that trims excess weights in Gemma2 full-weight models that bloated by ~1B parameters due to an early mergekit bug.
https://github.com/jim-plus/Gemma2-mergekit-remediation

I'd noticed something was off when merges of Gemma2 9B models ended up having ~10B parameters. The current mergekit package is fine, but there are still bloated models on HF that could stand to be fixed.

The script assumes that it will be run from the same directory as the model weights, and will trim the unnecessary lm_head.weight tensor and corresponding index entry.

2 replies

Delta-Vector

in anthracite-org/magnum-v4-12b 13 days ago

repetitive

#9 opened 19 days ago by

Utochi

Doctor-Shotgun

in anthracite-org/magnum-v4-12b 22 days ago

Question about the chat_template

#8 opened 22 days ago by

ZZ12112

grimjim

posted an update 26 days ago

Post

1401

A reminder that literal base models are valid choices for base model in task arithmetic mergers. Each Instruct or fine-tuned model then becomes a vector against the base model. Example merge formula used can be found via this model page.
grimjim/Magnolia-v3-12B

intervitens

in anthracite-org/pixmo-point-explanations-images about 1 month ago

[bot] Conversion to Parquet

#1 opened about 1 month ago by

parquet-converter

lucyknada

in anthracite-org/magnum-v4-22b about 1 month ago

It's really good.

#4 opened about 1 month ago by

FistfulSteel

intervitens

updated 2 datasets about 1 month ago

anthracite-org/pixmo-point-explanations-images

Viewer • Updated Dec 7, 2024 • 71.7k • 42 • 1

anthracite-org/pixmo-cap-qa-images

Viewer • Updated Dec 1, 2024 • 269k • 259

intervitens

in anthracite-org/pixmo-cap-images about 1 month ago

[bot] Conversion to Parquet

#1 opened about 1 month ago by

parquet-converter

intervitens

updated a dataset about 1 month ago

anthracite-org/pixmo-cap-images

Viewer • Updated Nov 30, 2024 • 708k • 293 • 1

lucyknada

in anthracite-org/magnum-v4-12b about 1 month ago

What class does this model belong to?

#7 opened about 1 month ago by

FrescoHF

Delta-Vector

in anthracite-org/magnum-v4-72b-exl2 about 1 month ago

2.75bpw?

#2 opened about 1 month ago by

Darkknight535

grimjim

posted an update about 2 months ago

Post

1035

Speculative decoding only requires that the tokenizers for the two LLMs used line up; the model architectures do not have to be otherwise compatible. As proof of concept, I used exllamav2 to run Llama 3.2 1B Instruct (at 6bpw, for speed) as the draft model to accelerate the target model of a Llama 3 8B merge of Instruct models (at 8bpw, for accuracy). The difference between tokenizers was minor enough to allow this. With 8k context length allocated for each model, both fit in under 13GB VRAM.
https://github.com/turboderp/exllamav2
meta-llama/Llama-3.2-1B-Instruct
grimjim/llama-3-Nephilim-v3-8B

The proof-of-concept Python script compared a zero-shot creative task of writing a story limited to 500 tokens. Speculative decoding improved performance by approximately one third (e.g., increasing from 31 tokens/sec to 46 tokens/sec) over conventional decoding, and was consistent over a few runs. While not statistically significant, this implies that smaller models aimed at edge computing can serve effectively as draft models in the general case.

It is straightforward to consult literature to affirm that fine-tuning draft models can be a way of inducing behavioral change in target models, in a manner not unlike how samplers can be used to induce changes. I speculate that the impact of a fine-tuned draft model would be on part with a LoRA (Low-Rank Adaptation), as the target model retains veto power. The small size of draft model candidates means that more people can perform local full fine-tuning.

It is intuitively obvious that a distilled model can be used as a draft model for the larger teacher model so long as tokenizers line up; e.g., a distilled 8B model can draft for a 70B teacher model. Perhaps Llama-3.1-SuperNova-Lite 8B could effectively draft for the original Llama-3.1-405B-Instruct model.
arcee-ai/Llama-3.1-SuperNova-Lite
meta-llama/Llama-3.1-405B-Instruct