File size: 6,156 Bytes
43dc00b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cdf30a
43dc00b
4cdf30a
43dc00b
e55b7a3
563c988
489b4e2
6bb1ea9
e66fe81
6bb1ea9
 
 
e98b8cb
 
 
 
 
 
 
 
 
489b4e2
43dc00b
4cdf30a
43dc00b
4cdf30a
43dc00b
e55b7a3
e98b8cb
 
 
 
 
 
 
aa67b14
 
 
e98b8cb
aa67b14
e98b8cb
 
aa67b14
 
 
e98b8cb
aa67b14
 
e98b8cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43dc00b
 
e55b7a3
 
 
 
 
 
 
 
43dc00b
 
 
4cdf30a
43dc00b
4cdf30a
43dc00b
 
 
4cdf30a
43dc00b
4cdf30a
43dc00b
 
 
 
 
 
 
e2b0a68
43dc00b
 
e2b0a68
43dc00b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e98b8cb
43dc00b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cdf30a
43dc00b
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: mit
datasets:
- glue
language:
- en
metrics:
- accuracy
- f1
- spearmanr
- pearsonr
- matthews_correlation
base_model: google-bert/bert-base-uncased
pipeline_tag: text-classification
tags:
- adapter
- low-rank
- fine-tuning
- LoRA
- DiffLoRA
eval_results: "Refer to GLUE experiments in the examples folder"
view_doc: "https://huggingface.co/nozomuteruyo14/Diff_LoRA"
---

# Model Card for DiffLoRA

<!-- Provide a quick summary of what the model is/does. -->

DiffLoRA is an innovative adapter architecture that extends conventional low-rank adaptation (LoRA) by fine-tuning a pre-trained large-scale model using differential low-rank matrices. Instead of updating all model parameters, DiffLoRA updates only a small set of low-rank matrices, which allows for efficient fine-tuning with reduced trainable parameters.

## Model Details

### Model Description

DiffLoRA is an original method developed by the author and is inspired by the conceptual ideas from the Differential Transformer paper (https://arxiv.org/abs/2410.05258). It decomposes the weight update into two components—positive and negative contributions—enabling a more fine-grained adjustment than traditional LoRA. The output of a single layer is computed as:

$$
y = W x + \Delta y
$$


where:

$$
x \in \mathbb{R}^{d_{in}} 
$$
is the input vector (or each sample in a batch).

$$
W \in \mathbb{R}^{d_{out} \times d_{in}}
$$
is the fixed pre-trained weight matrix.

$$
\Delta y 
$$
is the differential update computed as:


$$
\Delta y = \frac{\alpha}{r} \Big( x' A_{\text{pos}} B_{\text{pos}} - \tau \, x' A_{\text{neg}} B_{\text{neg}} \Big)
$$

with:  

$$
x'
$$
being the input after dropout (or another regularization).

$$
A_{\text{pos}} \in \mathbb{R}^{d_{in} \times r} 
\quad \text{and} \quad
B_{\text{pos}} \in \mathbb{R}^{r \times d_{out}}
$$
capturing the positive contribution.

$$
A_{\text{neg}} \in \mathbb{R}^{d_{in} \times r}
\quad \text{and} \quad
B_{\text{neg}} \in \mathbb{R}^{r \times d_{out}}
$$
capturing the negative contribution.


$$
\tau \in \mathbb{R}
$$
is a learnable scalar that balances the two contributions.

$$
\alpha 
$$
is a scaling factor.

$$
r
$$
is the chosen rank.

For computational efficiency, the two low-rank components are fused via concatenation:

$$
\text{combined\_A} = \big[ A_{\text{pos}}, A_{\text{neg}} \big] \in \mathbb{R}^{d_{in} \times 2r}
$$

$$
\text{combined\_B} = \begin{bmatrix} B_{\text{pos}} \\ -\tau \, B_{\text{neg}} \end{bmatrix} \in \mathbb{R}^{2r \times d_{out}}
$$

The update is then calculated as:

$$
\text{update} = x' \cdot \text{combined\_A} \cdot \text{combined\_B}
$$

resulting in the final output:

$$
y = W x + \frac{\alpha}{r} \, \text{update}.
$$

- **Developed by:** Nozomu Fujisawa in Kondo Lab
- **Model type:** Differential Low-Rank Adapter (DiffLoRA)
- **Language(s) (NLP):** en
- **License:** MIT
- **Finetuned from model [optional]:** bert-base-uncased

### Model Sources

- **Repository:** [https://huggingface.co/nozomuteruyo14/Diff_LoRA](https://huggingface.co/nozomuteruyo14/Diff_LoRA)
- **Paper:** DiffLoRA is inspired by ideas from the Differential Transformer (https://arxiv.org/abs/2410.05258), but it is an original method developed by the author.

## Uses

### Direct Use

DiffLoRA is intended to be integrated as an adapter module into pre-trained transformer models. It allows efficient fine-tuning by updating only a small number of low-rank parameters, making it ideal for scenarios where computational resources are limited.

### Out-of-Scope Use

DiffLoRA is not designed for training models from scratch, nor is it recommended for tasks where full parameter updates are necessary. It is optimized for transformer-based NLP tasks and may not generalize well to non-NLP domains. Also, there are only a limited number of base models that can be used.

## Bias, Risks, and Limitations

While DiffLoRA offers a parameter-efficient fine-tuning approach, it inherits limitations from its base models (e.g., BERT, MiniLM). It may not capture all domain-specific nuances when only a limited number of parameters are updated. Users should carefully evaluate performance and consider potential biases in their applications.

### Recommendations

Users should:
- Experiment with different rank r and scaling factor α values.
- Compare DiffLoRA with other adapter techniques.
- Be cautious about over-relying on the adapter when full model adaptation might be necessary.

## How to Get Started with the Model

To integrate DiffLoRA into your fine-tuning workflow, check the example script in the `examples/run_glue_experiment.py` file.

## Training Details

### Training Data

This implementation has been demonstrated on GLUE tasks using the Hugging Face Datasets library.

### Training Procedure

DiffLoRA is applied by freezing the base model weights and updating only the low-rank adapter parameters. The procedure involves:
- Preprocessing text inputs (concatenating multiple text columns if necessary).
- Injecting DiffLoRA adapters into target linear layers.
- Fine-tuning on a downstream task while the base model remains frozen.

#### Training Hyperparameters

- **Training regime:** Fine-tuning with frozen base weights; only adapter parameters are updated.
- **Learning rate:** 2e-5 (example)
- **Batch size:** 32 per device
- **Epochs:** 3 (example)
- **Optimizer:** AdamW with weight decay

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

GLUE validation sets are used for evaluation.

#### Factors

Evaluations are performed across multiple GLUE tasks to ensure comprehensive performance analysis.

#### Metrics

Evaluation metrics include accuracy, F1 score, Pearson correlation, and Spearman correlation, depending on the task.

### Results

For detailed evaluation results, please refer to the GLUE experiment script in the `examples` directory.

#### Summary

DiffLoRA achieves faster convergence and competitive performance on GLUE tasks compared to other parameter-efficient fine-tuning methods.

## Citation

paper: Writing

## Model Card Contact

For any questions regarding this model card, please contact: [nozomu_[email protected]]