File size: 4,225 Bytes
290ca23
 
 
1991c0b
 
 
 
290ca23
 
9c123e6
290ca23
1991c0b
290ca23
1991c0b
 
 
290ca23
1991c0b
290ca23
143a3c2
290ca23
1991c0b
290ca23
1991c0b
290ca23
1991c0b
290ca23
1991c0b
290ca23
1991c0b
290ca23
1991c0b
290ca23
 
1991c0b
 
 
290ca23
 
1991c0b
290ca23
 
1991c0b
 
290ca23
1991c0b
 
 
290ca23
 
1991c0b
290ca23
 
1991c0b
 
 
 
290ca23
 
1991c0b
290ca23
1991c0b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
290ca23
 
1991c0b
290ca23
1991c0b
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
library_name: transformers
license: cc-by-nc-4.0
tags:
- creative-writing
- creative-writer
- multiplicative-lora
---

An experimental model, fine-tuned using the ["multiplicative-LoRA" method](#the-multiplicative-lora-method) on [c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01).

Other experimental models, based off `creative-writer-v0.1-alfa-35b` that attempt to encourage more diverse/creative text generation:

- [creative-writer-v0.1-bravo-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-bravo-35b) - Scaled the pre-softmax logits by `1.1` during training (and then reset after training).
- [creative-writer-v0.1-charlie-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-charlie-35b) - Scaled the pre-softmax logits by `0.9` during training (and didn't reset after training).
- [creative-writer-v0.1-delta-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-delta-35b) - Trained using [Focal Loss](https://arxiv.org/abs/1708.02002) with `gamma=2` (instead of stock [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).

---

# The Multiplicative-LoRA method

Uses:

`h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x`

instead of the normal "addative-LoRA" method of:

`h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x`

I only apply this to the `down_proj` matrices, and skip the last layer's `down_proj` matrix in the same way as [creative-writing-control-vectors-v3.0](https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0).

This currently requires hacking [PEFT's layer.py](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py) like so:

```python
#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)
```

and:

```python
#x = x.to(lora_A.weight.dtype)
temp = result.to(lora_A.weight.dtype)

if not self.use_dora[active_adapter]:
    #result = result + lora_B(lora_A(dropout(x))) * scaling
    result = result + lora_B(lora_A(dropout(temp))) * scaling
```

Then to merge you need to hack [qlora-pipe's merge_lora.py](https://github.com/tdrussell/qlora-pipe/blob/main/merge_lora.py) to use:

```python
old_type = tensor.dtype
tensor = tensor.to(torch.float32)
tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
tensor = tensor.to(old_type)
```

---

# Training

- Took just under 4 days using dual-A6000 GPUs connected via NVLink, using [qlora-pipe](https://github.com/tdrussell/qlora-pipe).
- The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter).

## `config_creative_writer.toml`

```toml
# Paths
model = '/mnt/data/c4ai-command-r-v01'
output_dir = '/mnt/data/creative-writer-v0.1-alfa-35b'

# Lora configuration
lora_rank = 64
lora_alpha = 64
lora_dropout = 0.0
target_modules = ['down_proj']
layers_to_transform = '0:38'  # skip last layer

# Optimization configuration
epochs = 1
lr_scheduler = 'constant'
warmup_steps = 100
batch_size_tokens = 8192

# Performance settings
pipeline_stages = 2
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
model_weight_dtype = 'bfloat16'
lora_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'

# Resume a prior run
resume_from_checkpoint = false

# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 1

[optimizer]
type = 'adamw_kahan'
lr = 5e-6
beta1 = 0.9
beta2 = 0.99
weight_decay = 0.01

[[datasets]]
name = 'books'
dataset_type = 'textfile'
dataset_path = '/mnt/data/datasets/ebooks/*.txt'
sequence_len = 8192
eval_size = 0.01
```

## `ds_creative_writer.json`

```json
{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "gradient_clipping": 1.0,
    "steps_per_print": 1
}
```