nomic 1.5 base Financial Matryoshka

This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v1.5 on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: nomic-ai/nomic-embed-text-v1.5
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CatSchroedinger/nomic-v1.5-financial-matryoshka")
# Run inference
sentences = [
    'What is the maximum duration for patent term restoration for pharmaceutical products in the U.S.?',
    "Patent term restoration for a single patent for a pharmaceutical product is provided to U.S. patent holders to compensate for a portion of the time invested in clinical trials and the U.S. Food and Drug Administration (FDA). There is a five-year cap on any restoration, and no patent's expiration date may be extended beyond 14 years from FDA approval.",
    'Using AI technologies, our Tax Advisor offering leverages information generated from our ProConnect Tax Online and Lacerte offerings to enable year-round tax planning services and communicate tax savings strategies to clients.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric dim_768 dim_512 dim_256 dim_128 dim_64
cosine_accuracy@1 0.7286 0.7143 0.7014 0.7071 0.67
cosine_accuracy@3 0.8514 0.8414 0.8386 0.8343 0.8086
cosine_accuracy@5 0.8886 0.8857 0.88 0.8743 0.8514
cosine_accuracy@10 0.92 0.9214 0.9243 0.9243 0.8986
cosine_precision@1 0.7286 0.7143 0.7014 0.7071 0.67
cosine_precision@3 0.2838 0.2805 0.2795 0.2781 0.2695
cosine_precision@5 0.1777 0.1771 0.176 0.1749 0.1703
cosine_precision@10 0.092 0.0921 0.0924 0.0924 0.0899
cosine_recall@1 0.7286 0.7143 0.7014 0.7071 0.67
cosine_recall@3 0.8514 0.8414 0.8386 0.8343 0.8086
cosine_recall@5 0.8886 0.8857 0.88 0.8743 0.8514
cosine_recall@10 0.92 0.9214 0.9243 0.9243 0.8986
cosine_ndcg@10 0.8269 0.82 0.8144 0.8165 0.7842
cosine_mrr@10 0.7968 0.7873 0.7791 0.7822 0.7478
cosine_map@100 0.8006 0.791 0.7827 0.7856 0.7519

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 6,300 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 9 tokens
    • mean: 20.49 tokens
    • max: 51 tokens
    • min: 6 tokens
    • mean: 45.72 tokens
    • max: 687 tokens
  • Samples:
    anchor positive
    What limitations are associated with using non-GAAP financial measures such as contribution margin and adjusted income from operations? Further, these metrics have certain limitations, as they do not include the impact of certain expenses that are reflected in our consolidated statements of operations.
    What type of firm is PricewaterhouseCoopers LLP as mentioned in the financial statements? PricewaterhouseCoopers LLP, mentioned as the independent registered public accounting firm with PCAOB ID 238, prepared the report on the consolidated financial statements.
    What pages contain the financial Statements and Supplementary Data in IBM's 2023 Annual Report? The Financial Statements and Supplementary Data for IBM's 2023 Annual Report are found on pages 44 through 121.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: False
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.8122 10 11.5729 - - - - -
1.0 13 - 0.7995 0.7976 0.7929 0.7889 0.7646
1.5685 20 3.4999 - - - - -
2.0 26 - 0.8207 0.8189 0.8099 0.8090 0.7825
2.3249 30 2.8578 - - - - -
3.0 39 - 0.8267 0.8218 0.8151 0.8168 0.7826
3.0812 40 2.0904 - - - - -
3.7310 48 - 0.8269 0.8200 0.8144 0.8165 0.7842
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.11.5
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.2
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.3.0
  • Datasets: 2.19.1
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
5
Safetensors
Model size
137M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for CatSchroedinger/nomic-v1.5-financial-matryoshka

Finetuned
(18)
this model

Evaluation results