LLaDA

Running on Zero

App Files Files Community

multimodalart HF staff commited on 11 days ago

Commit

f6d8cac

verified ·

1 Parent(s): 7a998bc

Upload 11 files

Browse files

Files changed (12) hide show

.gitattributes +3 -0
GUIDELINES.md +140 -0
LICENSE +21 -0
chat.py +45 -0
generate.py +128 -0
get_log_likelihood.py +96 -0
imgs/LLaDA_vs_LLaMA.svg +2772 -0
imgs/LLaDA_vs_LLaMA_chat.svg +2665 -0
imgs/diff_remask.gif +3 -0
imgs/sample.png +3 -0
imgs/transformer1.png +0 -0
imgs/transformer2.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+imgs/diff_remask.gif filter=lfs diff=lfs merge=lfs -text
+imgs/sample.png filter=lfs diff=lfs merge=lfs -text
+imgs/transformer2.png filter=lfs diff=lfs merge=lfs -text

GUIDELINES.md ADDED Viewed

	@@ -0,0 +1,140 @@

+# Guidelines
+Here, we provide guidelines for the model architecture, pre-training, SFT, and inference of LLaDA.
+## Model Architecture
+LLaDA employs a Transformer Encoder as the network architecture for its mask predictor.
+In terms of trainable parameters, the Transformer Encoder is identical to the Transformer
+Decoder. Starting from an autoregressive model, we derive the backbone of LLaDA by simply
+removing the causal mask from the self-attention mechanism as following.
+<div style="display: flex; justify-content: center; flex-wrap: wrap; gap: 50px;">
+    <img src="imgs/transformer1.png" style="width: 90%;" />
+    <img src="imgs/transformer2.png" style="width: 90%;" />
+</div>
+In addition, LLaDA designates a reserved token as the mask token (i.e., 126336).
+## Pre-training
+The pre-training of LLaDA is straightforward and simple. Starting from an existing
+autoregressive model training code, only a few lines need to be modified.
+We provide the core code (i.e., loss computation) here.
+```angular2html
+def forward_process(input_ids, eps=1e-3):
+    b, l = input_ids.shape
+    t = torch.rand(b, device=input_ids.device)
+    p_mask = (1 - eps) * t + eps
+    p_mask = p_mask[:, None].repeat(1, l)
+    masked_indices = torch.rand((b, l), device=input_ids.device) < p_mask
+    # 126336 is used for [MASK] token
+    noisy_batch = torch.where(masked_indices, 126336, input_ids)
+    return noisy_batch, masked_indices, p_mask
+# The data is an integer tensor of shape (b, 4096),
+# where b represents the batch size and 4096 is the sequence length.
+input_ids = batch["input_ids"]
+# We set 1% of the pre-training data to a random length that is uniformly sampled from the range [1, 4096].
+# The following implementation is not elegant and involves some data waste.
+# However, the data waste is minimal, so we ignore it.
+if torch.rand(1) < 0.01:
+    random_length = torch.randint(1, input_ids.shape[1] + 1, (1,))
+    input_ids = input_ids[:, :random_length]
+noisy_batch, masked_indices, p_mask = forward_process(input_ids)
+logits = model(input_ids=noisy_batch).logits
+token_loss = F.cross_entropy(logits[masked_indices], input_ids[masked_indices], reduction='none') / p_mask[masked_indices]
+loss = token_loss.sum() / (input_ids.shape[0] * input_ids.shape[1])
+```
+## SFT
+First, please refer to Appendix B.1 for the preprocessing of the SFT data. After preprocessing the data,
+the data format is as follows. For simplicity, we treat each word as a token and set the batch size to 2
+in the following visualization.
+```angular2html
+input_ids:
+<BOS><start_id>user<end_id>\nWhat is the capital of France?<eot_id><start_id>assistant<end_id>\nParis.<EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS>
+<BOS><start_id>user<end_id>\nWhat is the capital of Canada?<eot_id><start_id>assistant<end_id>\nThe capital of Canada is Ottawa, located in Ontario.<EOS>
+prompt_lengths:
+[17, 17]
+```
+After preprocessing the SFT data, we can obtain the SFT code by making simple modifications to the pre-training code.
+The key difference from pre-training is that SFT does not add noise to the prompt.
+```angular2html
+input_ids, prompt_lengths = batch["input_ids"], batch["prompt_lengths"]
+noisy_batch, _, p_mask = forward_process(input_ids)
+# Do not add noise to the prompt
+token_positions = torch.arange(noisy_batch.shape[1], device=noisy_batch.device).expand(noisy_batch.size(0), noisy_batch.size(1))
+prompt_mask = (temp_tensor < prompt_length.unsqueeze(1))
+noisy_batch[prompt_mask] = input_ids[prompt_mask]
+# Calculate the answer length (including the padded <EOS> tokens)
+prompt_mask = prompt_mask.to(torch.int64)
+answer_lengths = torch.sum((1 - prompt_mask), dim=-1, keepdim=True)
+answer_lengths = answer_length.repeat(1, noisy_batch.shape[1])
+masked_indices = (noisy_batch == 126336)
+logits = model(input_ids=noisy_batch).logits
+token_loss = F.cross_entropy(logits[masked_indices], input_ids[masked_indices], reduction='none') / p_mask[masked_indices]
+ce_loss = torch.sum(token_loss / answer_lengths[masked_indices]) / input_ids.shape[0]
+```
+## Sampling
+Overall, we categorize LLaDA's sampling process into three types: fixed-length, semi-autoregressive-origin, and semi-autoregressive-padding.
+**It is worth noting that the semi-autoregressive-origin method was not mentioned in our paper, nor did we provide the corresponding code**.
+However, we include it here because we believe that sharing both our failures and insights from the exploration process is valuable.
+These three sampling methods are illustrated in the figure below.
+<div style="display: flex; justify-content: center; flex-wrap: wrap; gap: 50px;">
+    <img src="imgs/sample.png" style="width: 100%;" />
+</div>
+For each step in the above three sampling processes, as detailed in Section 2.4 in our paper, the mask predictor
+first predicts all masked tokens simultaneously. Then, a certain proportion of these predictions are remasked.
+To determine which predicted tokens should be re-masked, we can adopt two strategies: *randomly remasking* or
+*low-confidence remasking*. Notably, both remasking strategies can be applied to all three sampling processes
+mentioned above.
+For the LLaDA-Base model, we adapt low-confidence remasking to the three sampling processes mentioned above.
+We find that fixed-length and semi-autoregressive-padding achieve similar results, whereas semi-autoregressive-origin
+performs slightly worse.
+For the LLaDA-Instruct model, the situation is slightly more complex.
+First, if the semi-autoregressive-origin method is used,
+the Instruct model performs poorly. This is because, during SFT, each sequence is a complete sentence (whereas in pre-training,
+many sequences are truncated sentences). As a result, during sampling, given a generated length, regardless of whether it is
+long or short, the Instruct model tends to generate a complete sentence. Unlike the Base model, it does not encounter cases
+where a sentence is only partially generated and needs to be continued.
+When performing fixed-length sampling with a high answer length (e.g., greater than 512),
+we find that low-confidence remasking results in an unusually high proportion of `<EOS>` tokens in
+the generated sentences, which severely impacts the model's performance. In contrast, this
+issue does not arise when randomly remasking is used.
+Furthermore, since low-confidence remasking achieved better results in the Base model, we also hoped that it could be applied to
+the Instruct model. We found that combining low-confidence remasking with semi-autoregressive-padding effectively mitigates
+the issue of generating an excessively high proportion of <EOS> tokens. Moreover, this combination achieves
+slightly better results than randomly remasking & fixed-length.
+You can find more details about the sampling method in our paper.

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 NieShenRuc
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

chat.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import torch
+from generate import generate
+from transformers import AutoTokenizer, AutoModel
+def chat():
+    device = 'cuda'
+    model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()
+    tokenizer = AutoTokenizer.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True)
+    gen_length = 128
+    steps = 128
+    print('*' * 66)
+    print(f'**  Answer Length: {gen_length}  |  Sampling Steps: {steps}  **')
+    print('*' * 66)
+    conversation_num = 0
+    while True:
+        user_input = input("Enter your question: ")
+        m = [{"role": "user", "content": user_input}]
+        user_input = tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False)
+        input_ids = tokenizer(user_input)['input_ids']
+        input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
+        if conversation_num == 0:
+            prompt = input_ids
+        else:
+            prompt = torch.cat([prompt, input_ids[:, 1:]], dim=1)
+        out = generate(model, prompt, steps=steps, gen_length=gen_length, block_length=32, temperature=0., cfg_scale=0., remasking='low_confidence')
+        answer = tokenizer.batch_decode(out[:, prompt.shape[1]:], skip_special_tokens=True)[0]
+        print(f"Bot's reply: {answer}")
+        # remove the <EOS>
+        prompt = out[out != 126081].unsqueeze(0)
+        conversation_num += 1
+        print('-----------------------------------------------------------------------')
+if __name__ == "__main__":
+    chat()

generate.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import torch
+import numpy as np
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+def add_gumbel_noise(logits, temperature):
+    '''
+    The Gumbel max is a method for sampling categorical distributions.
+    According to arXiv:2409.02908, for MDM, low-precision Gumbel Max improves perplexity score but reduces generation quality.
+    Thus, we use float64.
+    '''
+    logits = logits.to(torch.float64)
+    noise = torch.rand_like(logits, dtype=torch.float64)
+    gumbel_noise = (- torch.log(noise)) ** temperature
+    return logits.exp() / gumbel_noise
+def get_num_transfer_tokens(mask_index, steps):
+    '''
+    In the reverse process, the interval [0, 1] is uniformly discretized into steps intervals.
+    Furthermore, because LLaDA employs a linear noise schedule (as defined in Eq. (8)),
+    the expected number of tokens transitioned at each step should be consistent.
+    This function is designed to precompute the number of tokens that need to be transitioned at each step.
+    '''
+    mask_num = mask_index.sum(dim=1, keepdim=True)
+    base = mask_num // steps
+    remainder = mask_num % steps
+    num_transfer_tokens = torch.zeros(mask_num.size(0), steps, device=mask_index.device, dtype=torch.int64) + base
+    for i in range(mask_num.size(0)):
+        num_transfer_tokens[i, :remainder[i]] += 1
+    return num_transfer_tokens
+@ torch.no_grad()
+def generate(model, prompt, steps=128, gen_length=128, block_length=128, temperature=0.,
+             cfg_scale=0., remasking='low_confidence', mask_id=126336):
+    '''
+    Args:
+        model: Mask predictor.
+        prompt: A tensor of shape (1, l).
+        steps: Sampling steps, less than or equal to gen_length.
+        gen_length: Generated answer length.
+        block_length: Block length, less than or equal to gen_length. If less than gen_length, it means using semi_autoregressive remasking.
+        temperature: Categorical distribution sampling temperature.
+        cfg_scale: Unsupervised classifier-free guidance scale.
+        remasking: Remasking strategy. 'low_confidence' or 'random'.
+        mask_id: The toke id of [MASK] is 126336.
+    '''
+    x = torch.full((1, prompt.shape[1] + gen_length), mask_id, dtype=torch.long).to(model.device)
+    x[:, :prompt.shape[1]] = prompt.clone()
+    prompt_index = (x != mask_id)
+    assert gen_length % block_length == 0
+    num_blocks = gen_length // block_length
+    assert steps % num_blocks == 0
+    steps = steps // num_blocks
+    for num_block in range(num_blocks):
+        block_mask_index = (x[:, prompt.shape[1] + num_block * block_length: prompt.shape[1] + (num_block + 1) * block_length:] == mask_id)
+        num_transfer_tokens = get_num_transfer_tokens(block_mask_index, steps)
+        for i in range(steps):
+            mask_index = (x == mask_id)
+            if cfg_scale > 0.:
+                un_x = x.clone()
+                un_x[prompt_index] = mask_id
+                x_ = torch.cat([x, un_x], dim=0)
+                logits = model(x_).logits
+                logits, un_logits = torch.chunk(logits, 2, dim=0)
+                logits = un_logits + (cfg_scale + 1) * (logits - un_logits)
+            else:
+                logits = model(x).logits
+            logits_with_noise = add_gumbel_noise(logits, temperature=temperature)
+            x0 = torch.argmax(logits_with_noise, dim=-1) # b, l
+            if remasking == 'low_confidence':
+                p = F.softmax(logits.to(torch.float64), dim=-1)
+                x0_p = torch.squeeze(
+                    torch.gather(p, dim=-1, index=torch.unsqueeze(x0, -1)), -1) # b, l
+            elif remasking == 'random':
+                x0_p = torch.rand((x0.shape[0], x0.shape[1]), device=x0.device)
+            else:
+                raise NotImplementedError(remasking)
+            x0_p[:, prompt.shape[1] + (num_block + 1) * block_length:] = -np.inf
+            x0 = torch.where(mask_index, x0, x)
+            confidence = torch.where(mask_index, x0_p, -np.inf)
+            transfer_index = torch.zeros_like(x0, dtype=torch.bool, device=x0.device)
+            for j in range(confidence.shape[0]):
+                _, select_index = torch.topk(confidence[j], k=num_transfer_tokens[j, i])
+                transfer_index[j, select_index] = True
+            x[transfer_index] = x0[transfer_index]
+    return x
+def main():
+    device = 'cuda'
+    model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()
+    tokenizer = AutoTokenizer.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True)
+    prompt = "Lily can run 12 kilometers per hour for 4 hours. After that, she runs 6 kilometers per hour. How many kilometers can she run in 8 hours?"
+    # Add special tokens for the Instruct model. The Base model does not require the following two lines.
+    m = [{"role": "user", "content": prompt}, ]
+    prompt = tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False)
+    input_ids = tokenizer(prompt)['input_ids']
+    input_ids = torch.tensor(input_ids).to(device).unsqueeze(0)
+    out = generate(model, input_ids, steps=128, gen_length=128, block_length=32, temperature=0., cfg_scale=0., remasking='low_confidence')
+    print(tokenizer.batch_decode(out[:, input_ids.shape[1]:], skip_special_tokens=True)[0])
+if __name__ == '__main__':
+    main()

get_log_likelihood.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+def forward_process(batch, prompt_index, mask_id):
+    b, l = batch.shape
+    target_len = (l - prompt_index.sum()).item()
+    k = torch.randint(1, target_len + 1, (), device=batch.device)
+    x = torch.round(torch.linspace(float(k), k + (b - 1) * (target_len / b), steps=b, device=batch.device)).long()
+    x = ((x - 1) % target_len) + 1
+    assert x.min() >= 1 and x.max() <= target_len
+    indices = torch.arange(target_len, device=batch.device).repeat(b, 1)
+    is_mask = indices < x.unsqueeze(1)
+    for i in range(b):
+        is_mask[i] = is_mask[i][torch.randperm(target_len)]
+    is_mask = torch.cat((torch.zeros(b, prompt_index.sum(), dtype=torch.bool, device=batch.device), is_mask), dim=1)
+    noisy_batch = torch.where(is_mask, mask_id, batch)
+    # Return the masked batch and the mask ratio
+    return noisy_batch, (x / target_len).unsqueeze(1).repeat(1, l)
+def get_logits(model, batch, prompt_index, cfg_scale, mask_id):
+    if cfg_scale > 0.:
+        assert len(prompt_index) == batch.shape[1]
+        prompt_index = prompt_index.unsqueeze(0).repeat(batch.shape[0], 1)
+        un_batch = batch.clone()
+        un_batch[prompt_index] = mask_id
+        batch = torch.cat([batch, un_batch])
+    input = batch
+    logits = model(input).logits
+    if cfg_scale > 0.:
+        logits, un_logits = torch.chunk(logits, 2, dim=0)
+        logits = un_logits + (cfg_scale + 1) * (logits - un_logits)
+    return logits
+@ torch.no_grad()
+def get_log_likelihood(model, prompt, answer, mc_num=128, batch_size=16, cfg_scale=0., mask_id=126336):
+    '''
+    Args:
+        model: Mask predictor.
+        prompt: A tensor of shape (l1).
+        answer: A tensor of shape (l2).
+        mc_num: Monte Carlo estimation times.
+                As detailed in Appendix B.5. Since MMLU, CMMLU, and C-EVAL only require the likelihood of a single token, a
+                single Monte Carlo estimate is sufficient for these benchmarks. For all other benchmarks, we find that 128
+                Monte Carlo samples are adequate to produce stable results.
+        batch_size: Mini batch size.
+        cfg_scale: Unsupervised classifier-free guidance scale.
+        mask_id: The toke id of [MASK] is 126336.
+    '''
+    seq = torch.concatenate([prompt, answer])[None, :]
+    seq = seq.repeat((batch_size, 1)).to(model.device)
+    prompt_index = torch.arange(seq.shape[1], device=model.device) < len(prompt)
+    loss_ = []
+    for _ in range(mc_num // batch_size):
+        perturbed_seq, p_mask = forward_process(seq, prompt_index, mask_id)
+        mask_index = perturbed_seq == mask_id
+        logits = get_logits(model, perturbed_seq, prompt_index, cfg_scale, mask_id)
+        loss = F.cross_entropy(logits[mask_index], seq[mask_index], reduction='none') / p_mask[mask_index]
+        loss = loss.sum() / batch_size
+        loss_.append(loss.item())
+    return - sum(loss_) / len(loss_)
+def main():
+    device = 'cuda'
+    model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Base', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()
+    tokenizer = AutoTokenizer.from_pretrained('GSAI-ML/LLaDA-8B-Base', trust_remote_code=True)
+    # this prompt and answer is from Hellaswag dataset
+    prompt = 'Roof shingle removal: A man is sitting on a roof. He'
+    answer = ' is using wrap to wrap a pair of skis.'
+    prompt = torch.tensor(tokenizer(prompt)['input_ids']).to(device)
+    answer = torch.tensor(tokenizer(answer)['input_ids']).to(device)
+    print(get_log_likelihood(model, prompt, answer, mc_num=128))
+if __name__ == '__main__':
+    main()