metadata

license: mit
datasets:
  - glue
language:
  - en
metrics:
  - accuracy
  - f1
  - spearmanr
  - pearsonr
  - matthews_correlation
base_model: google-bert/bert-base-uncased
pipeline_tag: text-classification
tags:
  - adapter
  - low-rank
  - fine-tuning
  - LoRA
  - DiffLoRA
eval_results: Refer to GLUE experiments in the examples folder
view_doc: https://huggingface.co/nozomuteruyo14/Diff_LoRA

Model Card for DiffLoRA

DiffLoRA is an innovative adapter architecture that extends conventional low-rank adaptation (LoRA) by fine-tuning a pre-trained large-scale model using differential low-rank matrices. Instead of updating all model parameters, DiffLoRA updates only a small set of low-rank matrices, which allows for efficient fine-tuning with reduced trainable parameters.

Model Details

Model Description

DiffLoRA is an original method developed by the author and is inspired by the conceptual ideas from the Differential Transformer paper (https://arxiv.org/abs/2410.05258). It decomposes the weight update into two components—positive and negative contributions—enabling a more fine-grained adjustment than traditional LoRA. The output of a single layer is computed as:

$y = W x + \Delta y$

where:

$x \in \mathbb{R}^{d_{in}}$ is the input vector (or each sample in a batch).

$W \in \mathbb{R}^{d_{out} \times d_{in}}$ is the fixed pre-trained weight matrix.

$\Delta y$ is the differential update computed as:

$\Delta y = \frac{\alpha}{r} \Big( x' A_{\text{pos}} B_{\text{pos}} - \tau \, x' A_{\text{neg}} B_{\text{neg}} \Big)$

with:

$x^{'}$ being the input after dropout (or another regularization).

$A_{\text{pos}} \in \mathbb{R}^{d_{in} \times r} \quad \text{and} \quad B_{\text{pos}} \in \mathbb{R}^{r \times d_{out}}$ capturing the positive contribution.

$A_{\text{neg}} \in \mathbb{R}^{d_{in} \times r} \quad \text{and} \quad B_{\text{neg}} \in \mathbb{R}^{r \times d_{out}}$ capturing the negative contribution.

$\tau \in \mathbb{R}$ is a learnable scalar that balances the two contributions.

$\alpha$ is a scaling factor.

$r$ is the chosen rank.

For computational efficiency, the two low-rank components are fused via concatenation:

$\text{combined\_A} = \big[ A_{\text{pos}}, A_{\text{neg}} \big] \in \mathbb{R}^{d_{in} \times 2r}$

$\text{combined\_B} = \begin{bmatrix} B_{\text{pos}} \\ -\tau \, B_{\text{neg}} \end{bmatrix} \in \mathbb{R}^{2r \times d_{out}}$

The update is then calculated as:

$\text{update} = x' \cdot \text{combined\_A} \cdot \text{combined\_B}$

resulting in the final output:

$y = W x + \frac{\alpha}{r} \, \text{update}.$

Developed by: Nozomu Fujisawa in Kondo Lab
Model type: Differential Low-Rank Adapter (DiffLoRA)
Language(s) (NLP): en
License: MIT
Finetuned from model [optional]: bert-base-uncased

Model Sources

Repository: https://huggingface.co/nozomuteruyo14/Diff_LoRA
Paper: DiffLoRA is inspired by ideas from the Differential Transformer (https://arxiv.org/abs/2410.05258), but it is an original method developed by the author.

Uses

Direct Use

DiffLoRA is intended to be integrated as an adapter module into pre-trained transformer models. It allows efficient fine-tuning by updating only a small number of low-rank parameters, making it ideal for scenarios where computational resources are limited.

Out-of-Scope Use

DiffLoRA is not designed for training models from scratch, nor is it recommended for tasks where full parameter updates are necessary. It is optimized for transformer-based NLP tasks and may not generalize well to non-NLP domains. Also, there are only a limited number of base models that can be used.

Bias, Risks, and Limitations

While DiffLoRA offers a parameter-efficient fine-tuning approach, it inherits limitations from its base models (e.g., BERT, MiniLM). It may not capture all domain-specific nuances when only a limited number of parameters are updated. Users should carefully evaluate performance and consider potential biases in their applications.

Recommendations

Users should:

Experiment with different rank r and scaling factor α values.
Compare DiffLoRA with other adapter techniques.
Be cautious about over-relying on the adapter when full model adaptation might be necessary.

How to Get Started with the Model

To integrate DiffLoRA into your fine-tuning workflow, check the example script in the examples/run_glue_experiment.py file.

Training Details

Training Data

This implementation has been demonstrated on GLUE tasks using the Hugging Face Datasets library.

Training Procedure

DiffLoRA is applied by freezing the base model weights and updating only the low-rank adapter parameters. The procedure involves:

Preprocessing text inputs (concatenating multiple text columns if necessary).
Injecting DiffLoRA adapters into target linear layers.
Fine-tuning on a downstream task while the base model remains frozen.

Training Hyperparameters

Training regime: Fine-tuning with frozen base weights; only adapter parameters are updated.
Learning rate: 2e-5 (example)
Batch size: 32 per device
Epochs: 3 (example)
Optimizer: AdamW with weight decay

Evaluation

Testing Data, Factors & Metrics

Testing Data

GLUE validation sets are used for evaluation.

Factors

Evaluations are performed across multiple GLUE tasks to ensure comprehensive performance analysis.

Metrics

Evaluation metrics include accuracy, F1 score, Pearson correlation, and Spearman correlation, depending on the task.

Results

For detailed evaluation results, please refer to the GLUE experiment script in the examples directory.

Summary

DiffLoRA achieves faster convergence and competitive performance on GLUE tasks compared to other parameter-efficient fine-tuning methods.

Citation

paper: Writing

Model Card Contact

For any questions regarding this model card, please contact: [[email protected]]