Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide
(Updated: "INDEX", and added "Generation Steering" section ; notes on Roleplay/Simulation added, Screenshots of parameters/samplers added in quick reference section.)
This document includes detailed information, references, and notes for general parameters, samplers and advanced samplers to get the most out of your model's abilities including notes / settings for the most popular AI/LLM app in use (LLAMACPP, KoboldCPP, Text-Generation-WebUI, LMStudio, Sillytavern, Ollama and others).
These settings / suggestions can be applied to all models including GGUF, EXL2, GPTQ, HQQ, AWQ and full source/precision.
It also includes critical settings for Class 3 and Class 4 models at this repo - DavidAU - to enhance and control generation for specific as a well as outside use case(s) including role play, chat and other use case(s).
The settings discussed in this document can also fix a number of model issues (any model, any repo) such as:
- "Gibberish"
- Generation length (including out of control generation)
- Chat quality / Multi-Turn convos.
- Multi-turn / COT / and other multi prompt/answer generation
- Letter, word, phrase, paragraph repeats
- Coherence
- Instruction following
- Creativeness or lack there of or .. too much - purple prose.
- Low quant (ie q2k, iq1s, iq2s) issues.
- General output quality.
- Role play related issues.
Likewise ALL the setting (parameters, samplers and advanced samplers) below can also improve model generation and/or general overall "smoothness" / "quality" of model operation:
- all parameters and samplers available via LLAMACPP (and most apps that run / use LLAMACPP - including Lmstudio, Ollama, Sillytavern and others.)
- all parameters (including some not in Lllamacpp), samplers and advanced samplers ("Dry", "Quadratic", "Microstat") in oobabooga/text-generation-webui including llamacpp_HF loader (allowing a lot more samplers)
- all parameters (including some not in Lllamacpp), samplers and advanced samplers ("Dry", "Quadratic", "Microstat") in SillyTavern / KoboldCPP (including Anti-slop filters)
Even if you are not using my models, you may find this document useful for any model (any quant / full source / any repo) available online.
If you are currently using model(s) - from my repo and/or others - that are difficult to "wrangle" then you can apply "Class 3" or "Class 4" settings to them.
This document will be updated over time too and is subject to change without notice.
Please use the "community tab" for suggestions / edits / improvements.
IMPORTANT:
Every parameter, sampler and advanced sampler here affects per token generation and overall generation quality.
This effect is cumulative especially with long output generation and/or multi-turn (chat, role play, COT).
Likewise because of how modern AIs/LLMs operate the previously generated (quality) of the tokens generated affect the next tokens generated too.
You will get higher quality operation overall - stronger prose, better answers, and a higher quality adventure.
PS: Running a 70B model?
You may want to see this document:
INDEX
How to Use this document:
Review quant(s) information to select quant(s) to download, then review "Class 1,2,3..." for specific information on models followed by "Source Files...APPS to run LLMs/AIs".
"TESTING / Default / Generation Example PARAMETERS AND SAMPLERS" are the basic defaults for parameters, and samplers - the bare minimums. You should always set these first.
The optional section "Generational Control And Steering of a Model / Fixing Model Issues on the Fly" covers methods to manually steer / edit / modify generation (as well as fixes) for any model.
"Quick reference" will state the best parameter settings for each "Class" of model(s) to get the best operation and/or good defaults to use to get started. If you came to this page from a repo card on my repo -DavidAU- the "class" of the model would have been stated just before you came to this page.
The detailed sections about parameters - Section 1 a,b,c and section 2 will help tune the model(s) operation.
The "DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS" section after this covers and links to more information about "tuning" your model(s). These cover theory, hints, tips and tricks, and observations and how to fine control CLASS 3/4 models directly.
All information about parameters, samplers and advanced samplers applies to ALL models, regardless of repo(s) you download them from.
QUANTS: - QUANTS Detailed information. - IMATRIX Quants - QUANTS GENERATIONAL DIFFERENCES: - ADDITIONAL QUANT INFORMATION - ARM QUANTS / Q4_0_X_X - NEO Imatrix Quants / Neo Imatrix X Quants - CPU ONLY CONSIDERATIONSClass 1, 2, 3 and 4 model critical notes
SOURCE FILES for my Models / APPS to Run LLMs / AIs: - TEXT-GENERATION-WEBUI - KOBOLDCPP - SILLYTAVERN - Lmstudio, Ollama, Llamacpp, Backyard, and OTHER PROGRAMS - Roleplay and Simulation Programs/Notes on models.
TESTING / Default / Generation Example PARAMETERS AND SAMPLERS - Basic settings suggested for general model operation.
Generational Control And Steering of a Model / Fixing Model Issues on the Fly - Multiple Methods to Steer Generation on the fly - On the fly Class 3/4 Steering / Generational Issues and Fixes (also for any model/type) - Advanced Steering / Fixing Issues (any model, any type) and "sequenced" parameter/sampler change(s) - "Cold" Editing/Generation
Quick Reference Table / Parameters, Samplers, Advanced Samplers - Quick setup for all model classes for automated control / smooth operation. - Screenshots for multiple LLM/AI apps of parameters/samplers - Section 1a : PRIMARY PARAMETERS - ALL APPS - Section 1b : PENALITY SAMPLERS - ALL APPS - Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS - Section 2: ADVANCED SAMPLERS
DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS: - DETAILS on PARAMETERS / SAMPLERS - General Parameters - The Local LLM Settings Guide/Rant - LLAMACPP-SERVER EXE - usage / parameters / samplers - DRY Sampler - Samplers
- Creative Writing - Benchmarking-and-Guiding-Adaptive-Sampling-DecodingADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)
QUANTS:
Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
There is an exception to this , see "Neo Imatrix" below and "all quants" (cpu only operation).
IMATRIX:
Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
Recommended Quants - ALL:
This covers both Imatrix and regular quants.
Imatrix can be applied to any quant - "Q" or "IQ" - however, IQ1s to IQ3_S REQUIRE an imatrix dataset / imatrixing process before quanting.
This chart shows the order in terms of "BPW" for each quant (mapped below with relative "strength" to one another) with "IQ1_S" with the least, and "Q8_0" (F16 is full precision) with the most:
IQ1_S | IQ1_M IQ2_XXS | IQ2_XS | Q2_K_S | IQ2_S | Q2_K | IQ2_M IQ3_XXS | Q3_K_S | IQ3_XS | IQ3_S | IQ3_M | Q3_K_M | Q3_K_L Q4_K_S | IQ4_XS | IQ4_NL | Q4_K_M Q5_K_S | Q5_K_M Q6_K Q8_0 F16
More BPW mean better quality, but higher VRAM requirements (and larger file size) and lower tokens per second. The larger the model in terms of parameters the lower the size of quant you can run with less quality losses. Note that "quality losses" refers to both instruction following and output quality.
Differences (quality) between quants at lower levels are larger relative to higher quants differences.
The Imatrix process has NO effect on Q8 or F16 quants.
F16 is full precision, just in GGUF format.
QUANTS GENERATIONAL DIFFERENCES:
Higher quants will have more detail, nuance and in some cases stronger "emotional" levels. Characters will also be more "fleshed out" too. Sense of "there" will also increase.
Likewise for any use case -> higher quants nuance (both instruction following AND output generation) will be higher.
"Nuance" is critical for both understanding, as well as the quality of the output generation.
To put this another way, "nuance" is lost as the full precision model is more and more compressed (lower and lower quants).
Some of this can be counteracted by parameters and/or Imatrix (as noted earlier).
IQ4XS / IQ4NL quants:
Due to the unusual nature of this quant (mixture/processing), generations from it will be different then other quants.
These quants can also be "quanted" with or without an Imatrix.
You may want to try it / compare it to other quant(s) output.
Special note on Q2k/Q3 quants:
You may need to use temp 2 or lower with these quants (1 or lower for q2k). Just too much compression at this level, damaging the model.
IQ quants (and Imatrix versions of q2k/q3) perform better at these "BPW" levels.
Rep pen adjustments may also be required to get the most out a model at this/these quant level(s).
ADDITONAL QUANT INFORMATION:
Click here for details
A great write up with charts showing various performances is provided by Artefact2 here
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
ARM QUANTS / Q4_0_X_X:
These are new quants that are specifically for computers/devices that can run "ARM" quants. If you try to run these on a "non arm" machine/device, the token per second will be VERY SLOW.
Q4_0_X_X information
These are NOT for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons on the original pull request
To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!).
If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
Click to view benchmarks on an AVX2 system (EPYC7702)
model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
---|---|---|---|---|---|---|---|
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
NEO Imatrix Quants / Neo Imatrix X Quants
NEO Imatrix quants are specialized and specifically "themed" datasets used to slightly alter the weights in a model. All Imatrix datasets do this to some degree or another, however NEO Imatrix datasets are content / theme specific and have been calibrated to have maximum effect on a model (relative to standard Imatrix datasets). Calibration was made possible after testing 50+ standard Imatrix datasets, and carefully modifying them and testing the resulting changes to determine the exact format and content which has the maximum effect on a model via the Imatrix process.
Please keep in mind that the Imatrix process (at it strongest) only "tints" a model and/or slightly changes its bias(es).
Here are some Imatrix Neo Models:
[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ]
[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ] (this is an X-Quant)
[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-SI-FI-GGUF ]
[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-WEE-HORROR-GGUF ]
[ https://huggingface.co/DavidAU/L3-8B-Stheno-v3.2-Ultra-NEO-V1-IMATRIX-GGUF ]
Suggestions for Imatrix NEO quants:
- The LOWER the quant the STRONGER the Imatrix effect is, and therefore the stronger the "tint" so to speak
- Due to the unique nature of this project, quants IQ1s to IQ4s are recommended for maximum effect with IQ4_XS the most balanced in terms of power and bits.
- Secondaries are Q2s-Q4s. Imatrix effect is still strong in these quants.
- Effects diminish quickly from Q5s and up.
- Q8/F16 there is no change (as the Imatrix process does not affect this quant), and therefore not included.
CPU ONLY CONSIDERATIONS:
This section DOES NOT apply to most "Macs" because of the difference in O/S Memory, Vram and motherboard VS other frameworks.
Running quants on CPU will be a lot slower than running them on a video card(s).
In this special case however it may be preferred to run AS SMALL a quant as possible for token per second generation reasons.
On a top, high end (and relatively new) CPU expect token per second speeds to be 1/4 (or less) a standard middle of the road video card.
Older machines/cpus will be a lot slower - but models will STILL run on these as long as you have enough ram.
Here are some rough comparisons:
On my video card (Nvidia 16GB 4060TI) I get 160-190 tokens per second with 1B LLama 3.2 Instruct, CPU speeds are 50-60 token per second.
On my much older machine (8 years old)(2 core), token per second speed (same 1B model) is in the 10ish token per second (CPU).
Roughly 8B-12B models are limit for CPU only operation (in terms of "usable" tokens/second) - at the moment.
This is changing as new cpus come out, designed for AI usage.
Class 1, 2, 3 and 4 model critical notes:
Some of the models at my repo are custom designed / limited use case models. For some of these models, specific settings and/or samplers (including advanced) are recommended for best operation.
As a result I have classified the models as class 1, class 2, class 3 and class 4.
Each model is "classed" on the model card itself for each model.
Generally all models (mine and other repos) fall under class 1 or class 2 and can be used when just about any sampler(s) / parameter(s) and advanced sampler(s).
Class 3 requires a little more adjustment because these models run closer to the ragged edge of stability. The settings for these will help control them better, especially for chat / role play and/or other use case(s). Generally speaking, this helps them behave better overall.
Class 4 are balanced on the very edge of stability. These models are generally highly creative, for very narrow use case(s), and closer to "human prose" than other models and/or operate in ways no other model(s) operate offering unique generational abilities. With these models, advanced samplers are used to "bring these bad boys" inline which is especially important for chat and/or role play type use cases AND/OR use case(s) these models were not designed for.
For reference here are some Class 3/4 models:
[ https://huggingface.co/DavidAU/L3-Stheno-Maid-Blackroot-Grand-HORROR-16B-GGUF ]
(note Grand Horror Series contain class 2,3 and 4 models)
[ https://huggingface.co/DavidAU/L3-DARKEST-PLANET-16.5B-GGUF ]
(note Dark Planet Series contains Class 1, 2 and Class 3/4 models)
[ https://huggingface.co/DavidAU/MN-DARKEST-UNIVERSE-29B-GGUF ]
(this model has exceptional prose abilities in all areas)
[ https://huggingface.co/DavidAU/MN-GRAND-Gutenberg-Lyra4-Lyra-23.5B-GGUF ]
(note Grand Guttenberg Madness/Darkness (12B) are class 1 models, but compressed versions of 23.5B)
Although Class 3 and Class 4 models will work when used within their specific use case(s), standard parameters and settings on the model card, I recognize that users want either a smoother experience and/or want to use these models for other than intended use case(s) and that is in part why I created this document.
The goal here is to use parameters to raise/lower the power of the model and samplers to "prune" (and/or in some cases enhance) operation.
With that being said, generation "examples" (at my repo) are created using the "Primary Testing Parameters" (top of this document) settings regardless of the "class" of the model and no advanced settings, parameters, or samplers.
However, for ANY model regardless of "class" or if it is at my repo, you can now take performance to the next level with the information contained in this document.
Side note:
There are no "Class 5" models published... yet.
SOURCE FILES for my Models / APPS to Run LLMs / AIs:
Source files / Source models of my models are located here (also upper right menu on this page):
You will need the config files to use "llamacpp_HF" loader ("text-generation-webui") [ https://github.com/oobabooga/text-generation-webui ]
You can also use the full source in "text-generation-webui" too.
As an alternative you can use GGUFs directly in "KOBOLDCPP" / "SillyTavern" without the "config files" and still use almost all the parameters, samplers and advanced samplers.
Parameters, Samplers and Advanced Samplers
In section 1 a,b, and c, below are all the LLAMA_CPP parameters and samplers.
I have added notes below each one for adjustment / enhancement(s) for specific use cases.
TEXT-GENERATION-WEBUI
In section 2, will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui AND/OR https://github.com/LostRuins/koboldcpp ("KOBOLDCPP").
The "llamacpp_HF" (for "text-generation-webui") only requires the GGUF you want to use plus a few config files from "source repo" of the model.
(this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
This allows access to very advanced samplers in addition to all the parameters / samplers here.
KOBOLDCPP:
Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
You can use almost all parameters, samplers and advanced samplers using "KOBOLDCPP" without the need to get the source config files (the "llamacpp_HF" step).
Note: This program has one of the newest samplers called "Anti-slop" which allows phrase/word banning at the generation level.
SILLYTAVERN:
Note that https://github.com/SillyTavern/SillyTavern also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
You can use almost all parameters, samplers and advanced samplers using "SILLYTAVERN" without the need to get the source config files (the "llamacpp_HF" step).
For CLASS3 and CLASS4 the most important setting is "SMOOTHING FACTOR" (Quadratic Smoothing) ; information is located on this page:
https://docs.sillytavern.app/usage/common-settings/
Critical Note:
Silly Tavern allows you to "connect" (via API) to different AI programs/apps like Koboldcpp, Llamacpp (server), Text Generation Webui, Lmstudio, Ollama ... etc etc.
You "load" a model in one of these, then connect Silly Tavern to the App via API. This way you can use any model, and Sillytavern becomes the interface between the AI model and you directly. Sillytavern opens an interface in your browser.
In Sillytavern you can then adjust parameters, samplers and advanced samplers ; there are also PRESET parameter/samplers too and you can save your favorites too.
Currently, at time of this writing, connecting Silly Tavern via KoboldCPP or Text Generation Webui will provide the most samplers/parameters.
However for some, connecting to Lmstudio, LlamaCPP, or Ollama may be preferred.
You may also want to check out how to connect SillyTavern to local AI "apps" running on your pc here:
https://docs.sillytavern.app/usage/api-connections/
Lmstudio, Ollama, Llamacpp, and OTHER PROGRAMS
Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
In most cases all llama_cpp parameters/samplers are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Sillytavern", "Olama", and "LMStudio" (as well as other apps too).
You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
https://github.com/ggerganov/llama.cpp
(scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
Special note:
It appears "DRY" / "XTC" samplers has been added to LLAMACPP and SILLYTAVERN.
It is available (Llamacpp) via "server.exe / llama-server.exe". Likely this sampler will also become available "downstream" in applications that use LLAMACPP in due time.
[ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ]
Operating Systems:
Most AI/LLM apps operate on Windows, Mac, and Linux.
Mobile devices (and O/S) are in many cases also supported.
Roleplay and Simulation Programs/Notes on models.
Text Generation Webui, KoboldCPP, and Silly Tavern (and AI/LLM apps connected via Silly Tavern) can all do roleplay / simulation AS WELL as "chat" and other creative activities.
LMStudio (the app here directly), Ollama and other LLM/AI apps are for general usage, however they can be connected to Silly Tavern via API too.
Backyard ( https://backyard.ai/ ) is software that is dedicated primarily to Roleplay / Simulation, however it can not be (at time of this writing) connected via API to Silly Tavern at this time.
If you are using Backyard app, see special notes for "roleplay / simulation" and where applicable, "BACKYARD APP" for specific notes on using these app.
Models that are Class 3/4 :
Some of my models that are rated Class 3 or 4 maybe a little more challenging to operate with roleplay, especially if you can not access / control certain samplers.
How to handle this issue is addressed in "Generational Steering" section (you control it) as well as Quick Reference, and Detailed Parameters, Samplers and Advanced Samplers Sections (automated control).
Also, some of my models are available in multiple "classes", IE Dark Planet, and Grand Gutenberg.
In these cases, Dark Planet 8B versions and Grand Gutenberg 12B ("Darkness" / "Madness") are class 1 - any use case, including role play and simulation.
Likewise Darkest Planet 16.5B and Grand Gutenberg 23/23.5B are class 3 - great at roleplay/simulation, but need a bit more steering and/or parameter/samplers adjustments to work flawlessly for this use case.
Note: Dark Planet 8B (class 1) is also a compressed version of Grand Horror 16B (a full on class 4)
TESTING / Generation Example PARAMETERS AND SAMPLERS
Primary Testing Parameters I use, including use for output generation examples at my repo:
Ranged Parameters:
temperature: 0 to 5 ("temp")
repetition_penalty : 1.02 to 1.15 ("rep pen")
Set parameters:
top_k:40
min_p:0.05
top_p: 0.95
repeat-last-n: 64 (also called: "repetition_penalty_range" / "rp range" )
I do not set any other settings, parameters or have samplers activated when generating examples.
Everything else is "zeroed" / "disabled".
IMPORTANT:
These parameters/settings are considered both safe and default and in most cases available to all users in all AI/LLM apps.
You should set these as noted first. I would say these are the minimum settings to use to get good model operation.
Note for Class 3/Class 4 models settings/samplers (discussed below) "repeat-last-n" is a CRITICAL setting.
BACKYARD APP:
In "Backyard" app, "repetition_penalty_range" is called "Repeat Penalty Tokens" (set on the "character card").
For class 3/4 models (if using with Backyard app), set this to 64 OR LESS.
Generational Control And Steering of a Model / Fixing Model Issues on the Fly
Multiple Methods to Steer Generation on the fly
Now that you have the basic parameters and samplers from the previous section, I will cover Generational Control and Steering.
This section is optional and covers how to manually STEER generation(s) - ANY MODEL, ANY TYPE.
This section (in part) will also cover how to deal with Class 3/4 model issues directly, as well as general issues than can happen with any "class" of model during generation IF you want to control them manually as the "Quick Reference" and/or "Detailed Parameters, Samplers, and Advanced Samplers" will cover how to deal with any generation issue(s) automatically.
There is a very important concept that must be covered first:
The output/generation/answer to your prompt/instructions BECOMES part of your "prompt" after you click STOP, and then click on "CONTINUE".
Likewise is true in multi-turn chat, role play, or in a "chat window" so to speak.
Your prompts AND the model's "answers"/"generation" all become part of the "ROADMAP" for the model to use in whatever journey you are on.
When you hit "REGEN" this nullifies only the last "generation" - not the prompt before it, nor the prompt(s)/generation(s) in the same chat.
The part I will cover here is once a generation has started, from a single prompt (no other prompts/generations in the chat).
So lets start with a prompt (NOTE: this prompt has no "steering" in the instructions):
Start a 1000 word scene (vivid horror, 1st person, include thoughts) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode...
Generation starts ... and then ends.
It could be 500 words, to 4000+...
Then you hit regen however many times to get a "good" generation.
There is a better way.
Generation starts... 200 words in you think... this is not going in the right direction.
Do you hit stop? Then regen?
There are a lot more options:
1 - Hit Stop.
2 - Select "EDIT" -> Edit out the part(s) you don't want AND/OR add in STEERING "text" (statement, phrase, paragraph, even a single word) (anywhere in the "generation" text).
3 - Hit Continue.
Once you hit "continue" the change(s) you made will now steer the models choices.
The LAST edit (bottom of the generation) will have the most impact. However ALL EDITS will affect generation as these become part of the generational "ROADMAP".
You can repeat this process at will.
Eventually the model will come to a "natural" stopping point.
If you want to model to continue past this model, delete a few lines AND "steer" it.
These methods apply to all generation types - not just a "scene" or "story", but "programming code", "article", "conclusions", "analytics", ... you name it.
Notes:
- For Text Generation Webui, you can transfer your "chat" to "notebook" for easy Stop/Edit/Continue function.
- For KoboldCPP -> This is built in.
- For Silly Tavern -> This is built in.
- For LMStudio -> This is built in.
- For API (direct control) you have to send the "chat" elements back to the "server" with the "edits" (send the whole "revised" chat as a json payload).
On the fly Class 3/4 Steering / Generational Issues and Fixes (also for any model/type):
Generational issues can occur such as letter(s), word(s), phrase(s), paragraph repeat(s), "rants" etc etc which can occur at any point during generation.
This can happen to ANY model, any type ; however with Class 3/4 models there is a higher chance this will occur because of how these models operate.
The "Quick Reference" and Detailed Parameters, Samplers and Advanced Samplers (below) cover how to set the model "controls" to do this automatically.
However, sometimes these settings MAY trim too much (ie creativity, "madness", nuance, emotion, even the "right answer(s) etc etc) sometimes, so I will show you how to address these issues directly.
If you have a letter(s) and/or word(s) repeat:
- Stop generation, edit out this, and back ONE OR TWO lines (delete)
- Hit continue.
- Better: Do these steps, and add "steering" (last line -> word, phrase, sentence)
If you have single or multiple paragraph repeat(s):
- Stop generation, edit out all the paragraph(s), and back ONE OR TWO lines OR last NON repeating paragraph (delete)
- Hit continue.
- Better: Do these steps, and add "steering" (last line -> word, phrase, sentence or paragraph)
In each case we are BREAKING the "condition(s)" that lead (or lead into) to the repeat(s).
If you have "rants" and/or "model has lost its mind":
- Stop generation, edit out all the paragraph(s), and back AS FAR as possible to where is appears the rant/mind loss occured (delete ALL) and delete one additional paragraph / 2 or more sentences.
- Hit continue.
- Better: Do these steps, and add "steering" (last line -> word, phrase, sentence or paragraph).
Class 3/4 model additional note:
With these classes of model, you MAY need to "edit" / "revise" further back than one or two lines / one paragraph - they sometimes need just a little more editing.
Another option is using "Cold" Editing/Generation explained below.
Advanced Steering / Fixing Issues (any model, any type) and "sequenced" parameter/sampler change(s)
This will drastically (depending on changes you make) change up "Continue(d)" generation(s):
- Do the edits above (steering and/or "steering fixes"), but before you click "Continue" (after your "Edit(s)"), adjust the parameter(s), sampler(s) and advanced sampler(s) settings.
- Once you do this BEFORE hitting "Continue" your new settings will be applied to all generation from your new "Continue" point.
- You can repeat this process at will.
- You can also hit "stop", make NO EDIT(S), adjust the parameter(s), sampler(s) and advanced sampler(s) settings and hit "Continue" and the new settings will take effect from the "stop point" going forward.
"Cold" Editing/Generation
Let say you have a generation, but you want to edit it later IN A NEW CHAT.
Sometimes you can just copy/paste the generation and the model MAY get the "IDEA" and continue the generation without a prompt or direction.
However this does not always work.
So you need something along these lines (adjust accordingly):
Instructions: Continue this scene, using vivid and graphic details.
SCENE:
(previous generation)
Note the structure, layout and spacing.
If it was programming code:
Instructions: Continue this javascript, [critical instructions here for "code" goals]
JAVASCRIPT:
(previous generation)
You may want to include the ENTIRE prior prompt (with some modifications) used in the first generation:
Instructions: Continue the scene below (vivid horror, 1st person, include thoughts) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode...
SCENE:
(previous generation)
NOTE:
You may want to modify the instructions to provide a "steering" continue point and/or "goal" for the generation to the model has some idea how to proceed.
Quick Reference Table - Parameters, Samplers, Advanced Samplers
Compiled by: "EnragedAntelope" ( https://huggingface.co/EnragedAntelope || https://github.com/EnragedAntelope )
This section will get you started - especially with class 3 and 4 models - and the detail section will cover settings / control in more depth below.
Please see sections below this for advanced usage, more details, settings, notes etc etc.
IMPORTANT NOTES:
Not all parameters, samplers and advanced samplers are listed in this quick reference section. Scroll down to see all of them in following sections.
Likewise there may be some "name variation(s)" - in other LLM/AI apps - this is addressed in the detailed sections.
I have added Screenshots of settings for Class 1-2, Class 3 and Class 4 are below this chart for Koboldcpp, SillyTavern and Text Gen Webui.
# LLM Parameters Reference Table| Parameter | Description |
|----------- |-------------|
| Primary Parameters |
| temperature | Controls randomness of outputs (0 = deterministic, higher = more random). Range: 0-5 |
| top-p | Selects tokens with probabilities adding up to this number. Higher = more random results. Default: 0.9 |
| min-p | Discards tokens with probability smaller than this value × probability of most likely token. Default: 0.1 |
| top-k | Selects only top K most likely tokens. Higher = more possible results. Default: 40 |
| Penalty Samplers |
| repeat-last-n | Number of tokens to consider for penalties. Critical for preventing repetition. Default: 64 (Class 3/4 - but see notes) |
| repeat-penalty | Penalizes repeated token sequences. Range: 1.0-1.15. Default: 1.0 |
| presence-penalty | Penalizes token presence in previous text. Range: 0-0.2 for Class 3, 0.1-0.35 for Class 4 |
| frequency-penalty | Penalizes token frequency in previous text. Range: 0-0.25 for Class 3, 0.4-0.8 for Class 4 |
| penalize-nl | Penalizes newline tokens. Generally unused. Default: false |
| Secondary Samplers |
| mirostat | Controls perplexity during sampling. Modes: 0 (off), 1, or 2 |
| mirostat-lr | Mirostat learning rate. Default: 0.1 |
| mirostat-ent | Mirostat target entropy. Default: 5.0 |
| dynatemp-range | Range for dynamic temperature adjustment. Default: 0.0 |
| dynatemp-exp | Exponent for dynamic temperature scaling. Default: 1.0 |
| tfs | Tail free sampling - removes low-probability tokens. Default: 1.0 |
| typical | Selects tokens more likely than random given prior text. Default: 1.0 |
| xtc-probability | Probability of token removal. Range: 0-1 |
| xtc-threshold | Threshold for considering token removal. Default: 0.1 |
| Advanced Samplers |
| dry_multiplier | Controls DRY (Don't Repeat Yourself) intensity. Range: 0.8-1.12+ Class 3 (Class 4 is higher) |
| dry_allowed_length | Allowed length for repeated sequences in DRY. Default: 2 |
| dry_base | Base value for DRY calculations. Range: 1.15-1.75+ for Class 4 |
| smoothing_factor | Quadratic sampling intensity. Range: 1-3 for Class 3, 3-5+ for Class 4 |
| smoothing_curve | Quadratic sampling curve. Range: 1 for Class 3, 1.5-2 for Class 4 |
Notes
- For Class 3 and 4 models, using both DRY and Quadratic sampling is recommended (see advanced/detailed samplers below on how to control the model here directly)
- Lower quants (Q2K, IQ1s, IQ2s) may require stronger settings due to compression damage
- Parameters interact with each other, so test changes one at a time
- Always test with temperature at 0 first to establish a baseline
SCREENSHOTS (right click-> open in new window) of Class 1-2, Class 3, and Class 4 for KoboldCPP, Silly Tavern and Text Gen Webui.
NOTES:
These cover basic/default settings PER CLASS. See "quick range" of settings above, and full range with details on how to "fine tune" model operation below.
This is especially important for fine control of Class 3 and Class 4 models ; sometimes you can use class 2 or 3 settings for class 3 and even class 4 models.
It is your use CASE(s) / smooth operation requirements that determine which settings will work best.
You should not apply class 3 or class 4 settings on a class 1 or class 2 model - this might limit model operation and usually class 1/2 models do not require this level of control.
CLASS 3/4 MODELS:
If you are using a class 3 or class 4 model for use case(s) such as role play, multi-turn, chat etc etc, it is suggested to activate / set all samplers for class 3 but may be required for class 4 models.
Likewise for fine control of a class 3/4 via "DRY" and "Quadratic" samplers is detailed below. These allow you to dial up or dial down the model's raw power directly.
ROLEPLAY / SIMULATION NOTES:
If you are using a model (regardless of "class") for these uses cases, you may need to LOWER "temp" to get better instruction following.
Instruction following issues can cascade over the "adventure" if the temp is set too high for the specific model(s) you are using.
Likewise you may want to set MAXIMUM output tokens (a hard limit how much the model can output) to much lower values such as 128 to 300.
(This will assist with steering, and stop the model from endlessly "yapping")
MICROSTAT Sampler - IMPORTANT:
Make sure to review MIROSTAT sampler settings below, due to behaviour of this specific sampler / affect on parameters/other samplers which varies from app to app too.
Section 1a : PRIMARY PARAMETERS - ALL APPS:
These parameters will have SIGNIFICANT effect on prose, generation, length and content; with temp being the most powerful.
Keep in mind the biggest parameter / random "unknown" is your prompt.
A word change, rephrasing, punctation , even a comma, or semi-colon can drastically alter the output, even at min temp settings. CAPS also affect generation too.
Likewise the size, and complexity of your prompt impacts generation too ; especially clarity and direction.
Special note:
Pre-prompts / system role are not discussed here. Many of the model repo cards (at my repo) have an optional pre-prompt you can use to aid generation (and can impact instruction following too).
Some of my newer models repo cards use a limited form of this called a "prose control" (discussed and shown by example).
Roughly a pre-prompt / system role is embedded during each prompt and can act as a guide and/or set of directives for processing the prompt and/or containing generation instructions.
A prose control is a simplified version of this, which precedes the main prompt(s) - but the idea / effect is relatively the same (pre-prompt/system role does have a slightly higher priority however).
I strongly suggest you research these online, as they are a powerful addition to your generation toolbox.
They are especially potent with newer model archs due to newer model types having stronger instruction following abilities AND increase context too.
PRIMARY PARAMETERS:
temp / temperature
temperature (default: 0.8)
Primary factor to control the randomness of outputs. 0 = deterministic (only the most likely token is used). Higher value = more randomness.
Range 0 to 5. Increment at .1 per change.
Too much temp can affect instruction following in some cases and sometimes not enough = boring generation.
Newer model archs (L3,L3.1,L3.2, Mistral Nemo, Gemma2 etc) many times NEED more temp (1+) to get their best generations.
ROLEPLAY / SIMULATION NOTE:
If you are using a model (regardless of "class") for these uses cases, you may need to LOWER temp to get better instruction following.
top-p
top-p sampling (default: 0.9, 1.0 = disabled)
If not set to 1, select tokens with probabilities adding up to less than this number. Higher value = higher range of possible random results.
Dropping this can simplify word choices but this works in conjunction with "top-k"
I use default of: .95 ;
min-p
min-p sampling (default: 0.1, 0.0 = disabled)
Tokens with probability smaller than (min_p) * (probability of the most likely token) are discarded.
I use default: .05 ;
Careful adjustment of this parameter can result in more "wordy" or "less wordy" generation but this works in conjunction with "top-k".
top-k
top-k sampling (default: 40, 0 = disabled)
Similar to top_p, but select instead only the top_k most likely tokens. Higher value = higher range of possible random results.
Bring this up to 80-120 for a lot more word choice, and below 40 for simpler word choices.
As this parameter operates in conjunction with "top-p" and "min-p" all three should be carefully adjusted one at a time.
NOTE - "CORE" Testing with "TEMP":
For an interesting test, set "temp" to 0 ; this will give you the SAME generation for a given prompt each time.
Then adjust a word, phrase, sentence etc in your prompt, and generate again to see the differences.
(you should use a "fresh" chat for each generation)
Keep in mind this will show model operation at its LEAST powerful/creative level and should NOT be used to determine if the model works for your use case(s).
Then test your prompt(s) "at temp" to see the model in action. (5-10 generations recommended)
You can also use "temp=0" to test different quants of the same model to see generation differences. (roughly minor "BIAS" changes which reflect math changes due to compress/mixtures differences between quants).
Another option is testing different models (at temp=0 AND of the same quant) to see how each handles your prompt(s).
Then test "at temp" with your prompt(s) to see the MODELS in action. (5-10 generations recommended)
Section 1b : PENALITY SAMPLERS - ALL APPS:
These samplers "trim" or "prune" output in real time.
The longer the generation, the stronger overall effect but that all depends on "repeat-last-n" setting.
For creative use cases, these samplers can alter prose generation in interesting ways.
Penalty parameters affect both per token and part of OR entire generation (depending on settings / output length).
CLASS 4: For these models it is important to activate / set all samplers as noted for maximum quality and control.
PRIMARY:
repeat-last-n
last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) ("repetition_penalty_range" in oobabooga/text-generation-webui , "rp_range" in kobold)
THIS IS CRITICAL.
Too high you can get all kinds of issues (repeat words, sentences, paragraphs or "gibberish"), especially with class 3 or 4 models.
Likewise if you change this parameter it will drastically alter the output.
This setting also works in conjunction with all other "rep pens" below.
This parameter is the "RANGE" of tokens looked at for the samplers directly below.
BACKYARD APP:
In "Backyard" app, "repetition_penalty_range" is called "Repeat Penalty Tokens" (set on the "character card").
For class 3/4 models (if using with Backyard app), set this to 64 OR LESS.
SECONDARIES:
repeat-penalty
penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) (commonly called "rep pen")
Generally this is set from 1.0 to 1.15 ; smallest increments are best IE: 1.01... 1,.02 or even 1.001... 1.002.
This affects creativity of the model over all, not just how words are penalized.
presence-penalty
repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 512-1024 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.05 to .2 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.1 to 0.35 may assist generation BUT SET "repeat-last-n" to 64.
frequency-penalty
repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 512 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.25 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.4 to 0.8 may assist generation BUT SET "repeat-last-n" to 64.
penalize-nl
penalize newline tokens (default: false)
Generally this is not used.
Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS:
In some AI/LLM apps, these may only be available via JSON file modification and/or API.
For "text-gen-webui", "Koboldcpp" these are directly accessible ; other programs/app this varies.
Sillytavern:
If the apps support (Sillytavern is connected to via API) these parameters/samplers then you can access them via Silly Tavern's parameter/sampler panel. So if you are using Text-Gen-Webui, Koboldcpp, LMStudio, Llamacpp, Ollama (etc) you can set/change/access all or most of these.
i) OVERALL GENERATION CHANGES (affect per token as well as over all generation):
mirostat
Use Mirostat sampling. "Top K", "Nucleus", "Tail Free" (TFS) and "Locally Typical" (TYPICAL) samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
Paper: https://arxiv.org/abs/2007.14966
CRITICAL:
If you activate Mirostat when using "LLAMAcpp SERVER" and/or some LLAMA_CPP based apps this will VOID/DISABLE all parameters (excluding "penalties", "logit_bias" ) AND all other SAMPLERS except "temp" parameter plus the following:
V1: n_vocab(model) (this is set internally by llamacpp), seed, mirostat_tau, mirostat_eta
V2: seed, mirostat_tau, mirostat_eta
For Koboldcpp:
"DRY" sampling is NOT blocked, and a version of "top_k" (3000) is used (but Mirostat does NOT block "Anti-Slop" , BUT does block "penalities" parameters (unlike Llamacpp - which does not) ).
For Text Generation UI:
No blocking occurs. Note that ONLY Mirostat 2 is available. (other parameters/samplers should work without issue)
Note this is subject to change by LLAMAcpp, Koboldcpp, Text Generation UI and other AI/LLM app makers at any time.
("seed" is usually a random value. (default) ; this parameter can be set in some AI/LLM apps to control Mirostat output more closely.)
"mirostat-lr"
Mirostat learning rate, parameter eta (default: 0.1) " mirostat_tau "
mirostat_tau: 5-8 is a good value.
"mirostat-ent"
Mirostat target entropy, parameter tau (default: 5.0) " mirostat_eta "
mirostat_eta: 0.1 is a good value.
Activates the Mirostat sampling technique. It aims to control perplexity during sampling. See the paper. ( https://arxiv.org/abs/2007.14966 )
This is the big one ; activating this will help with creative generation. It can also help with stability. Also note which samplers are disabled/ignored here, and that "mirostat_eta" is a learning rate.
This is both a sampler (and pruner) and enhancement all in one.
It also has two modes of generation "1" and "2" - test both with 5-10 generations of the same prompt. Make adjustments, and repeat.
CLASS 3: models it is suggested to use this to assist with generation (min settings).
CLASS 4: models it is highly recommended with Microstat 1 or 2 + mirostat_tau @ 6 to 8 and mirostat_eta at .1 to .5
Dynamic Temperature
"dynatemp-range "
dynamic temperature range (default: 0.0, 0.0 = disabled)
"dynatemp-exp"
dynamic temperature exponent (default: 1.0)
In: oobabooga/text-generation-webui (has on/off, and high / low) :
Activates Dynamic Temperature. This modifies temperature to range between "dynatemp_low" (minimum) and "dynatemp_high" (maximum), with an entropy-based scaling. The steepness of the curve is controlled by "dynatemp_exponent".
This allows the model to CHANGE temp during generation. This can greatly affect creativity, dialog, and other contrasts.
For Koboldcpp a converter is available and in oobabooga/text-generation-webui you just enter low/high/exp.
CLASS 4 only: Suggested this is on, with a high/low of .8 to 1.8 (note the range here of "1" between high and low); with exponent to 1 (however below 0 or above work too)
To set manually (IE: Api, lmstudio, Llamacpp, etc) using "range" and "exp" ; this is a bit more tricky: (example is to set range from .8 to 1.8)
1 - Set the "temp" to 1.3 (the regular temp parameter)
2 - Set the "range" to .500 (this gives you ".8" to "1.8" with "1.3" as the "base")
3 - Set exp to 1 (or as you want).
This is both an enhancement and in some ways fixes issues in a model when too little temp (or too much/too much of the same) affects generation.
ii) PER TOKEN CHANGES:
tfs
Tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
Tries to detect a tail of low-probability tokens in the distribution and removes those tokens. The closer to 0, the more discarded tokens. ( https://www.trentonbricken.com/Tail-Free-Sampling/ )
typical
Locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
If not set to 1, select only tokens that are at least this much more likely to appear than random tokens, given the prior text.
XTC
"xtc-probability"
xtc probability (default: 0.0, 0.0 = disabled)
Probability that the removal will actually happen. 0 disables the sampler. 1 makes it always happen.
"xtc-threshold"
xtc threshold (default: 0.1, 1.0 = disabled)
If 2 or more tokens have probability above this threshold, consider removing all but the last one.
XTC is a new sampler, that adds an interesting twist in generation. Suggest you experiment with this one, with other advanced samplers disabled to see its affects.
l, logit-bias TOKEN_ID(+/-)BIAS
modifies the likelihood of token appearing in the completion,
i.e. --logit-bias 15043+1
to increase likelihood of token ' Hello', or --logit-bias 15043-1
to decrease likelihood of token ' Hello'
This may or may not be available. This requires a bit more work.
Note: +- range is 0 to 100.
IN "oobabooga/text-generation-webui" there is "TOKEN BANNING":
This is a very powerful pruning method; which can drastically alter output generation.
I suggest you get some "bad outputs" ; get the "tokens" (actual number for the "word" / part word) then use this.
Careful testing is required, as this can have unclear side effects.
SECTION 2: ADVANCED SAMPLERS - "text-generation-webui" / "KOBOLDCPP" / "SillyTavern" (see note 1 below):
Additional Parameters / Samplers, including "DRY", "QUADRATIC" and "ANTI-SLOP".
Note #1 :
You can use these samplers via Sillytavern IF you use either of these APPS (Koboldcpp/Text Generation Webui/App supports them) to connect Silly Tavern to their API.
Other Notes:
Hopefully ALL these samplers / controls will be LLAMACPP and available to all users via AI/LLM apps soon.
"DRY" sampler has been added to Llamacpp as of the time of this writing (and available via SERVER/LLAMA-SERVER.EXE) and MAY appear in other "downstream" apps that use Llamacpp.
INFORMATION ON THESE SAMPLERS:
For more info on what they do / how they affect generation see:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
(also see the section above "Additional Links" for more info on the parameters/samplers)
ADVANCED SAMPLERS - PART 1:
Keep in mind these parameters/samplers become available (for GGUFs) in "oobabooga/text-generation-webui" when you use the llamacpp_HF loader.
Most of these are also available in KOBOLDCPP too (via settings -> samplers) after start up (no "llamacpp_HF loader" step required).
I am not going to touch on all of samplers / parameters, just the main ones at the moment.
However, you should also check / test operation of (these are in Text Generation WebUI, and may be available via API / In Sillytavern (when connected to Text Generation Webui)):
a] Affects per token generation:
- top_a
- epsilon_cutoff - see note #4
- eta_cutoff - see note #4
- no_repeat_ngram_size - see note #1.
b] Affects generation including phrase, sentence, paragraph and entire generation:
- no_repeat_ngram_size - see note #1.
- encoder_repetition_penalty "Hallucinations filter" - see note #2.
- guidance_scale (with "Negative prompt" ) => this is like a pre-prompt/system role prompt - see note #3.
- Disabling (BOS TOKEN) this can make the replies more creative.
- Custom stopping strings
Note 1:
"no_repeat_ngram_size" appears in both because it can impact per token OR per phrase depending on settings. This can also drastically affect sentence, paragraph and general flow of the output.
Note 2:
This parameter if set to LESS than 1 causing the model to "jump" around a lot more , whereas above 1 causes the model to focus more on the immediate surroundings.
If the model is crafting a "scene", a setting of less than 1 causes the model to jump around the room, outside, etc etc ; if less than 1 then it focuses the model more on the moment, the immediate surroundings, the POV character and details in the setting.
Note 3:
This is a powerful method to send instructions / directives to the model on how to process your prompt(s) each time. See [ https://arxiv.org/pdf/2306.17806 ]
Note 4:
These control selection of tokens, in some case providing more relevant and/or more options. See [ https://arxiv.org/pdf/2210.15191 ]
MAIN ADVANCED SAMPLERS PART 2 (affects per token AND overall generation):
What I will touch on here are special settings for CLASS 3 and CLASS 4 models (for the first TWO samplers).
For CLASS 3 you can use one, two or both.
For CLASS 4 using BOTH are strongly recommended, or at minimum "QUADRATIC SAMPLING".
These samplers (along with "penalty" settings) work in conjunction to "wrangle" the model / control it and get it to settle down, important for Class 3 but critical for Class 4 models.
For other classes of models, these advanced samplers can enhance operation across the board.
For Class 3 and Class 4 the goal is to use the LOWEST settings to keep the model inline rather than "over prune it".
You may therefore want to experiment to with dropping the settings (SLOWLY) for Class3/4 models from suggested below.
DRY:
Dry ("Don't Repeat Yourself") affects repetition (and repeat "penalty") at the word, phrase, sentence and even paragraph level. Read about "DRY" above, in the "Additional Links" links section above.
Class 3:
dry_multiplier: .8
dry_allowed_length: 2
dry_base: 1
Class 4:
dry_multiplier: .8 to 1.12+
dry_allowed_length: 2 (or less)
dry_base: 1.15 to 1.75+
Dial the "dry_muliplier" up or down to "reign in" or "release the madness" so to speak from the core model.
For Class 4 models this is used to control some of the model's bad habit(s).
For more information on "DRY":
https://github.com/oobabooga/text-generation-webui/pull/5677
https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
QUADRATIC SAMPLING: AKA "Smoothing"
This sampler alters the "score" of ALL TOKENS at the time of generation and as a result affects the entire generation of the model. See "Additional Links" links section above for more information.
Class 3:
smoothing_factor: 1 to 3
smoothing_curve: 1
Class 4:
smoothing_factor: 3 to 5 (or higher)
smoothing_curve: 1.5 to 2.
Dial the "smoothing factor" up or down to "reign in" or "release the madness" so to speak.
In Class 3 models, this has the effect of modifying the prose closer to "normal" with as much or little (or a lot!) touch of "madness" from the root model.
In Class 4 models, this has the effect of modifying the prose closer to "normal" with as much or little (or a lot!) touch of "madness" from the root model AND wrangling in some of the core model's bad habits.
For more information on Quadratic Samplings:
https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
ANTI-SLOP - Kolbaldcpp only
Hopefully this powerful sampler will soon appear in all LLM/AI apps.
You can access this in the KoboldCPP app, under "context" -> "tokens" on the main page of the app after start up.
You can also access in SillyTavern if you use KoboldCPP as your "API" connected app too.
This sampler allows banning words and phrases DURING generation, forcing the model to "make another choice".
This is a game changer in custom real time control of the model.
For more information on ANTI SLOP project (owner runs EQBench):
https://github.com/sam-paech/antislop-sampler
FINAL NOTES:
Keep in mind that these settings/samplers work in conjunction with "penalties" ; which is especially important for operation of CLASS 4 models for chat / role play and/or "smoother operation".
For Class 3 models, "QUADRATIC" will have a slightly stronger effect than "DRY" relatively speaking.
If you use Mirostat sampler, keep in mind this will interact with these two advanced samplers too.
And...
Smaller quants may require STRONGER settings (all classes of models) due to compression damage, especially for Q2K, and IQ1/IQ2s.
This is also influenced by the parameter size of the model in relation to the quant size.
IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.
DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:
Most AI / LLM apps allow saving a "profile" parameters and samplers - "favorite" settings.
Text Generation Web Ui, Koboldcpp, Silly Tavern all have this feature and also "presets" (parameters/samplers set already) too.
Other AI/LLM apps also have this feature to varying degrees too.
DETAILS on PARAMETERS / SAMPLERS:
For additional details on these samplers settings (including advanced ones) you may also want to check out:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
(NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader in "text-generation-webui" )
Additional Links (on parameters, samplers and advanced samplers):
A Visual Guide of some top parameters / Samplers in action which you can play with and see how they interact:
https://artefact2.github.io/llm-sampling/index.xhtml
General Parameters:
https://arxiv.org/html/2408.13586v1
The Local LLM Settings Guide/Rant (covers a lot of parameters/samplers - lots of detail)
https://rentry.org/llm-settings
LLAMACPP-SERVER EXE - usage / parameters / samplers:
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
DRY
- https://github.com/oobabooga/text-generation-webui/pull/5677
- https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
- https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
Samplers:
https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
https://huggingface.co/LWDCLS/LLM-Discussions/discussions/2
https://huggingface.co/Virt-io/SillyTavern-Presets
Creative Writing :
https://www.reddit.com/r/LocalLLaMA/comments/1c36ieb/comparing_sampling_techniques_for_creative/
Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
https://github.com/ZhouYuxuanYX/Benchmarking-and-Guiding-Adaptive-Sampling-Decoding-for-LLMs
NOTE:
I have also added notes too in the sections below for almost all parameters, samplers, and advanced samplers as well.
OTHER:
Depending on the AI/LLM "apps" you are using, additional reference material for parameters / samplers may also exist.
ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)
1 - Set temp to 0 (zero) and set your basic parameters, and use a prompt to get a "default" generation. A creative prompt will work better here.
2 - If you want to test basic parameter changes, test ONE at a time, then compare output (answer quality, word choice, sentence size/construction, general output qualities) to your "default" generation.
3 - Then start testing TWO parameters at a time, and comparing again. Keep in mind parameters (all) interact with each other.
4 - Samplers -> Reset your basic parameters, (temp still at zero) and test each one of these, one at a time. Then adjust settings, test again.
5 - Once you have an "idea" of how each affects your "test prompt" , now test at "temp" (not zero). It may take five to ten generation to get a rough idea.
Yes, testing is a lot of work - but once you get all the parameter(s) and/or sampler(s) dialed in - it is worth it.
IMPORTANT: Use a "fresh chat" PER TEST (you will contaminate the results otherwise). Never use the same chat for multiple tests -> exception: Regens.
Keep in mind that parameters, samplers and advanced samplers can affect the model on a per token generation basis AND/OR on a multi-token / phrase / sentence / paragraph and even complete generation basis.
Everything is cumulative here regardless if the parameter/sampler affects per token or multi-token basis because of how models "look back" to see what was generated in some cases.
And of course... each model will be different too.
All that being said, it is a good idea to have specific generation quality "goals" in mind.
Likewise, at my repo, I post example generations so you can get an idea (but not complete picture) of a model's generation abilities.
The best way to control generation is STILL with your prompt(s) - including pre-prompts/system role. The latest gen models (and archs) have very strong instruction following so many times better (or just included!) instructions in your prompts can make a world of difference.
Not sure if the model understands your prompt(s)?
Ask it ->
"Check my prompt below and tell me how to make it clearer?" (prompt after this line)
"For my prompt below, explain the steps you wound take to execute it" (prompt after this line)
This will help the model fine tune your prompt so IT understands it.
However sometimes parameters and/or samplers are required to better "wrangle" the model and getting to perform to its maximum potential and/or fine tune it to your use case(s).