LLaDA

Running on Zero

App Files Files Community

LLaDA / GUIDELINES.md

multimodalart HF staff

Upload 11 files

f6d8cac verified 12 days ago

preview code

raw

history blame

6.85 kB

	# Guidelines
	Here, we provide guidelines for the model architecture, pre-training, SFT, and inference of LLaDA.

	## Model Architecture

	LLaDA employs a Transformer Encoder as the network architecture for its mask predictor.
	In terms of trainable parameters, the Transformer Encoder is identical to the Transformer
	Decoder. Starting from an autoregressive model, we derive the backbone of LLaDA by simply
	removing the causal mask from the self-attention mechanism as following.

	<div style="display: flex; justify-content: center; flex-wrap: wrap; gap: 50px;">
	<img src="imgs/transformer1.png" style="width: 90%;" />
	<img src="imgs/transformer2.png" style="width: 90%;" />
	</div>

	In addition, LLaDA designates a reserved token as the mask token (i.e., 126336).


	## Pre-training
	The pre-training of LLaDA is straightforward and simple. Starting from an existing
	autoregressive model training code, only a few lines need to be modified.
	We provide the core code (i.e., loss computation) here.

	```angular2html
	def forward_process(input_ids, eps=1e-3):
	b, l = input_ids.shape
	t = torch.rand(b, device=input_ids.device)
	p_mask = (1 - eps) * t + eps
	p_mask = p_mask[:, None].repeat(1, l)

	masked_indices = torch.rand((b, l), device=input_ids.device) < p_mask
	# 126336 is used for [MASK] token
	noisy_batch = torch.where(masked_indices, 126336, input_ids)
	return noisy_batch, masked_indices, p_mask

	# The data is an integer tensor of shape (b, 4096),
	# where b represents the batch size and 4096 is the sequence length.
	input_ids = batch["input_ids"]

	# We set 1% of the pre-training data to a random length that is uniformly sampled from the range [1, 4096].
	# The following implementation is not elegant and involves some data waste.
	# However, the data waste is minimal, so we ignore it.
	if torch.rand(1) < 0.01:
	random_length = torch.randint(1, input_ids.shape[1] + 1, (1,))
	input_ids = input_ids[:, :random_length]

	noisy_batch, masked_indices, p_mask = forward_process(input_ids)
	logits = model(input_ids=noisy_batch).logits

	token_loss = F.cross_entropy(logits[masked_indices], input_ids[masked_indices], reduction='none') / p_mask[masked_indices]
	loss = token_loss.sum() / (input_ids.shape[0] * input_ids.shape[1])

	```

	## SFT
	First, please refer to Appendix B.1 for the preprocessing of the SFT data. After preprocessing the data,
	the data format is as follows. For simplicity, we treat each word as a token and set the batch size to 2
	in the following visualization.
	```angular2html
	input_ids:
	<BOS><start_id>user<end_id>\nWhat is the capital of France?<eot_id><start_id>assistant<end_id>\nParis.<EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS><EOS>
	<BOS><start_id>user<end_id>\nWhat is the capital of Canada?<eot_id><start_id>assistant<end_id>\nThe capital of Canada is Ottawa, located in Ontario.<EOS>

	prompt_lengths:
	[17, 17]
	```

	After preprocessing the SFT data, we can obtain the SFT code by making simple modifications to the pre-training code.
	The key difference from pre-training is that SFT does not add noise to the prompt.
	```angular2html
	input_ids, prompt_lengths = batch["input_ids"], batch["prompt_lengths"]

	noisy_batch, _, p_mask = forward_process(input_ids)

	# Do not add noise to the prompt
	token_positions = torch.arange(noisy_batch.shape[1], device=noisy_batch.device).expand(noisy_batch.size(0), noisy_batch.size(1))
	prompt_mask = (temp_tensor < prompt_length.unsqueeze(1))
	noisy_batch[prompt_mask] = input_ids[prompt_mask]

	# Calculate the answer length (including the padded <EOS> tokens)
	prompt_mask = prompt_mask.to(torch.int64)
	answer_lengths = torch.sum((1 - prompt_mask), dim=-1, keepdim=True)
	answer_lengths = answer_length.repeat(1, noisy_batch.shape[1])

	masked_indices = (noisy_batch == 126336)

	logits = model(input_ids=noisy_batch).logits

	token_loss = F.cross_entropy(logits[masked_indices], input_ids[masked_indices], reduction='none') / p_mask[masked_indices]
	ce_loss = torch.sum(token_loss / answer_lengths[masked_indices]) / input_ids.shape[0]
	```

	## Sampling
	Overall, we categorize LLaDA's sampling process into three types: fixed-length, semi-autoregressive-origin, and semi-autoregressive-padding.
	It is worth noting that the semi-autoregressive-origin method was not mentioned in our paper, nor did we provide the corresponding code.
	However, we include it here because we believe that sharing both our failures and insights from the exploration process is valuable.
	These three sampling methods are illustrated in the figure below.


	<div style="display: flex; justify-content: center; flex-wrap: wrap; gap: 50px;">
	<img src="imgs/sample.png" style="width: 100%;" />
	</div>

	For each step in the above three sampling processes, as detailed in Section 2.4 in our paper, the mask predictor
	first predicts all masked tokens simultaneously. Then, a certain proportion of these predictions are remasked.
	To determine which predicted tokens should be re-masked, we can adopt two strategies: randomly remasking or
	low-confidence remasking. Notably, both remasking strategies can be applied to all three sampling processes
	mentioned above.

	For the LLaDA-Base model, we adapt low-confidence remasking to the three sampling processes mentioned above.
	We find that fixed-length and semi-autoregressive-padding achieve similar results, whereas semi-autoregressive-origin
	performs slightly worse.

	For the LLaDA-Instruct model, the situation is slightly more complex.

	First, if the semi-autoregressive-origin method is used,
	the Instruct model performs poorly. This is because, during SFT, each sequence is a complete sentence (whereas in pre-training,
	many sequences are truncated sentences). As a result, during sampling, given a generated length, regardless of whether it is
	long or short, the Instruct model tends to generate a complete sentence. Unlike the Base model, it does not encounter cases
	where a sentence is only partially generated and needs to be continued.

	When performing fixed-length sampling with a high answer length (e.g., greater than 512),
	we find that low-confidence remasking results in an unusually high proportion of `<EOS>` tokens in
	the generated sentences, which severely impacts the model's performance. In contrast, this
	issue does not arise when randomly remasking is used.

	Furthermore, since low-confidence remasking achieved better results in the Base model, we also hoped that it could be applied to
	the Instruct model. We found that combining low-confidence remasking with semi-autoregressive-padding effectively mitigates
	the issue of generating an excessively high proportion of <EOS> tokens. Moreover, this combination achieves
	slightly better results than randomly remasking & fixed-length.

	You can find more details about the sampling method in our paper.