RNAMamba-14M
This model is a small Mamba based model trained from scratch on 1.96 million sequences (1.56 billion bases) extracted from RNAcentral's active sequences FASTA file for release 24 (March 2024).
This is intended to be a sequence embedding model for downstream processing of ncRNA sequences. It is trained with a masked language modelling objective, and a context size of 8,192 nucleotides. This particular model has the MLM head stripped off and so should be almost ready to use for embedding. The dataset has sequences ranging in length from 10 to 8192, so the model should be pretty good at handling sequences in that range. This is a deliberately small model with only 14.1 million parameters (8 hidden layers, hidden dim 512, intermediate size 1024) such that it will run fast without a GPU. We may train something bigger if it looks like these embeddings are not good enough.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1.0
Framework versions
- Transformers 4.39.3
- Pytorch 2.2.2+cu118
- Datasets 2.18.0
- Tokenizers 0.15.2
- Downloads last month
- 5