File size: 6,156 Bytes
43dc00b 4cdf30a 43dc00b 4cdf30a 43dc00b e55b7a3 563c988 489b4e2 6bb1ea9 e66fe81 6bb1ea9 e98b8cb 489b4e2 43dc00b 4cdf30a 43dc00b 4cdf30a 43dc00b e55b7a3 e98b8cb aa67b14 e98b8cb aa67b14 e98b8cb aa67b14 e98b8cb aa67b14 e98b8cb 43dc00b e55b7a3 43dc00b 4cdf30a 43dc00b 4cdf30a 43dc00b 4cdf30a 43dc00b 4cdf30a 43dc00b e2b0a68 43dc00b e2b0a68 43dc00b e98b8cb 43dc00b 4cdf30a 43dc00b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
---
license: mit
datasets:
- glue
language:
- en
metrics:
- accuracy
- f1
- spearmanr
- pearsonr
- matthews_correlation
base_model: google-bert/bert-base-uncased
pipeline_tag: text-classification
tags:
- adapter
- low-rank
- fine-tuning
- LoRA
- DiffLoRA
eval_results: "Refer to GLUE experiments in the examples folder"
view_doc: "https://huggingface.co/nozomuteruyo14/Diff_LoRA"
---
# Model Card for DiffLoRA
<!-- Provide a quick summary of what the model is/does. -->
DiffLoRA is an innovative adapter architecture that extends conventional low-rank adaptation (LoRA) by fine-tuning a pre-trained large-scale model using differential low-rank matrices. Instead of updating all model parameters, DiffLoRA updates only a small set of low-rank matrices, which allows for efficient fine-tuning with reduced trainable parameters.
## Model Details
### Model Description
DiffLoRA is an original method developed by the author and is inspired by the conceptual ideas from the Differential Transformer paper (https://arxiv.org/abs/2410.05258). It decomposes the weight update into two components—positive and negative contributions—enabling a more fine-grained adjustment than traditional LoRA. The output of a single layer is computed as:
$$
y = W x + \Delta y
$$
where:
$$
x \in \mathbb{R}^{d_{in}}
$$
is the input vector (or each sample in a batch).
$$
W \in \mathbb{R}^{d_{out} \times d_{in}}
$$
is the fixed pre-trained weight matrix.
$$
\Delta y
$$
is the differential update computed as:
$$
\Delta y = \frac{\alpha}{r} \Big( x' A_{\text{pos}} B_{\text{pos}} - \tau \, x' A_{\text{neg}} B_{\text{neg}} \Big)
$$
with:
$$
x'
$$
being the input after dropout (or another regularization).
$$
A_{\text{pos}} \in \mathbb{R}^{d_{in} \times r}
\quad \text{and} \quad
B_{\text{pos}} \in \mathbb{R}^{r \times d_{out}}
$$
capturing the positive contribution.
$$
A_{\text{neg}} \in \mathbb{R}^{d_{in} \times r}
\quad \text{and} \quad
B_{\text{neg}} \in \mathbb{R}^{r \times d_{out}}
$$
capturing the negative contribution.
$$
\tau \in \mathbb{R}
$$
is a learnable scalar that balances the two contributions.
$$
\alpha
$$
is a scaling factor.
$$
r
$$
is the chosen rank.
For computational efficiency, the two low-rank components are fused via concatenation:
$$
\text{combined\_A} = \big[ A_{\text{pos}}, A_{\text{neg}} \big] \in \mathbb{R}^{d_{in} \times 2r}
$$
$$
\text{combined\_B} = \begin{bmatrix} B_{\text{pos}} \\ -\tau \, B_{\text{neg}} \end{bmatrix} \in \mathbb{R}^{2r \times d_{out}}
$$
The update is then calculated as:
$$
\text{update} = x' \cdot \text{combined\_A} \cdot \text{combined\_B}
$$
resulting in the final output:
$$
y = W x + \frac{\alpha}{r} \, \text{update}.
$$
- **Developed by:** Nozomu Fujisawa in Kondo Lab
- **Model type:** Differential Low-Rank Adapter (DiffLoRA)
- **Language(s) (NLP):** en
- **License:** MIT
- **Finetuned from model [optional]:** bert-base-uncased
### Model Sources
- **Repository:** [https://huggingface.co/nozomuteruyo14/Diff_LoRA](https://huggingface.co/nozomuteruyo14/Diff_LoRA)
- **Paper:** DiffLoRA is inspired by ideas from the Differential Transformer (https://arxiv.org/abs/2410.05258), but it is an original method developed by the author.
## Uses
### Direct Use
DiffLoRA is intended to be integrated as an adapter module into pre-trained transformer models. It allows efficient fine-tuning by updating only a small number of low-rank parameters, making it ideal for scenarios where computational resources are limited.
### Out-of-Scope Use
DiffLoRA is not designed for training models from scratch, nor is it recommended for tasks where full parameter updates are necessary. It is optimized for transformer-based NLP tasks and may not generalize well to non-NLP domains. Also, there are only a limited number of base models that can be used.
## Bias, Risks, and Limitations
While DiffLoRA offers a parameter-efficient fine-tuning approach, it inherits limitations from its base models (e.g., BERT, MiniLM). It may not capture all domain-specific nuances when only a limited number of parameters are updated. Users should carefully evaluate performance and consider potential biases in their applications.
### Recommendations
Users should:
- Experiment with different rank r and scaling factor α values.
- Compare DiffLoRA with other adapter techniques.
- Be cautious about over-relying on the adapter when full model adaptation might be necessary.
## How to Get Started with the Model
To integrate DiffLoRA into your fine-tuning workflow, check the example script in the `examples/run_glue_experiment.py` file.
## Training Details
### Training Data
This implementation has been demonstrated on GLUE tasks using the Hugging Face Datasets library.
### Training Procedure
DiffLoRA is applied by freezing the base model weights and updating only the low-rank adapter parameters. The procedure involves:
- Preprocessing text inputs (concatenating multiple text columns if necessary).
- Injecting DiffLoRA adapters into target linear layers.
- Fine-tuning on a downstream task while the base model remains frozen.
#### Training Hyperparameters
- **Training regime:** Fine-tuning with frozen base weights; only adapter parameters are updated.
- **Learning rate:** 2e-5 (example)
- **Batch size:** 32 per device
- **Epochs:** 3 (example)
- **Optimizer:** AdamW with weight decay
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
GLUE validation sets are used for evaluation.
#### Factors
Evaluations are performed across multiple GLUE tasks to ensure comprehensive performance analysis.
#### Metrics
Evaluation metrics include accuracy, F1 score, Pearson correlation, and Spearman correlation, depending on the task.
### Results
For detailed evaluation results, please refer to the GLUE experiment script in the `examples` directory.
#### Summary
DiffLoRA achieves faster convergence and competitive performance on GLUE tasks compared to other parameter-efficient fine-tuning methods.
## Citation
paper: Writing
## Model Card Contact
For any questions regarding this model card, please contact: [nozomu_[email protected]]
|