Exploring Hard Negative Mining with NV-Retriever in Korean Financial Text

Community Article Published January 12, 2025

This article is written by Yewon Hwang, with advisory contributions from Hanwool Lee.

Modern text embedding models—post SimCSE[1]—have widely relied on contrastive learning to fine-tune sentence embeddings. The contrastive approach aims to push semantically similar sentences closer and dissimilar sentences further apart in the embedding space. Yet how to define “similar” or “not similar” is highly dependent on having the right positive and negative pairs.

In particular, negative pairs that are too easy or truly random (such as picking any random sentence from a large corpus) might not offer enough training signal to the model. The model easily recognizes them as dissimilar and learns little from them. Hard negatives, on the other hand, are tricky sentence pairs that share some superficial similarity but are ultimately unrelated in meaning—making them invaluable for model performance.

In this blog post, we focus on:

  1. The rationale behind Hard Negative Mining,
  2. NV-Retriever[7] as a “positive-aware” approach,
  3. Experimental application in a Korean financial domain setting.

1. Contrastive Learning & the Hard Negative Problem

1.1 Quick background

SimCSE[1] ignited a wave of sentence-embedding approaches based on contrastive learning. The principle:

  • Positive pairs (semantically close) are pulled together in embedding space,
  • Negative pairs (semantically distant) are pushed away.

To accomplish this effectively, a well-constructed dataset with positive and negative examples is crucial. If the negative pairs are trivially random (e.g., “고양이가 좋아하는 음식은?(What's your cat's favorite food?)” vs “넷플릭스의 설립연도는 1997년이다.(Netflix was founded in 1997.)”), the model sees them as obviously unrelated and gains minimal benefit. Such “easy negatives” do not help refine embedding distinctions.

Hence, Hard Negatives—pairs the model struggles to distinguish from real positives—drive more nuanced learning. However, systematically finding these hard negatives remains challenging.

1.2 Earlier attempts & limitations

Approach Explanation Weaknesses
Naive top-k Pick the top-k most similar passages (excluding the known positive) High chance of introducing false negatives
Top-K shifted by N Skip the top N hits and then pick top k Ignores similarity scores beyond an absolute rank cutoff; could either lose valuable negatives or keep false ones
Top-k abs Exclude negative passages above a certain similarity threshold Heavily reliant on a hyper-sensitive threshold

Additionally, BM25 or naive methods from DPR[4], ANCE[5], etc. sometimes yield a large portion of “false negatives.” RocketQA[6] found that on MS-MARCO data, nearly 70% of BM25-based “hard negatives” were in fact positives upon manual inspection.


2. NV-Retriever: Positive-aware Hard Negatives

NV-Retriever[7] proposes a positive-aware negative mining approach where each query’s positive similarity guides the maximum negative similarity threshold:

  1. Pick a larger Teacher Model (e.g., e5-based or Mistral-based).

  2. Encode queries and passages with the teacher embeddings.

  3. For each query ( q ) with a known positive pair:

    • Get the positive score (pos_score).
    • Define a max negative similarity threshold:
      • Top-K MarginPos: max_neg_score_threshold = pos_score - absolute_margin
      • Top-K PercPos: max_neg_score_threshold = pos_score * percentage_margin
  4. Among the filtered negative candidates, select top-k as Hard Negatives.

In the original NV-Retriever experiments, the best performance emerged using:

  • Teacher Model = Mistral
  • Mining Method = TopK-PercPos with margin = 0.95

3. Korean Financial Domain Experiments

After reading about NV-Retriever, we applied a similar methodology to build a Korean financial text-embedding dataset and see if the positive-aware approach generalizes. Below is an outline of our process:

3.1 Teacher Model & Base Model

  • Teacher Model candidates:

    • BM25 (Okapi)
      - Though it performed poorly in the original NV-Retriever, we wanted to see if the keyword-based approach might do better in a domain heavily reliant on financial keywords.
    • bge-m3 (BAAI/bge-m3): a multilingual embedding model (568M params)
    • KURE-v1 (nlpai-lab/KURE-v1): a Korean-finetuned version of bge-m3
  • Base Model candidates (for eventual fine-tuning):

3.2 Data

We used two main data types:

  1. QA Dataset
    - BCCard/BCCard-Finance-Kor-QnA
    - (Query - Answer) pairs as positive

  2. Non-QA Dataset
    - Naver finance news crawling (2024)
    - (Title - Passage) pairs as positive

Note: The textual content is all in Korean. For example:

(Korean)
Query : "미성년 자녀에게 증여한 재산이 상속세에 포함되나요?"
Answer : "미성년 자녀에게 증여한 재산은 상속세 계산 시 포함될 수 있습니다. ... "
(English)
Query : “Is property given to minor children included in the estate tax?”
Answer : “Property given to minor children can be included in the calculation of inheritance tax. ... ”

3.3 Hard Negative Mining

  • Mining Method = TopK-PercPos with percentage_margin = 0.95
  • Each query can retrieve up to 4 Hard Negatives.

Below is a partial code snippet (Korean variable naming is included) using BM25 for demonstration:

def mine_hard_negatives(data, bm25, max_neg=4):
    results = []
    for index, row in tqdm(data.iterrows(), total=len(data)):
        query = row['Query']
        positive_answer = row['Answer']
        
        # BM25 scores
        scores = bm25.get_scores(tokenizing(query))  # tokenizing is user-defined
        normalized_scores = normalize_scores(scores)
        
        pos_score = normalized_scores[index]  # This is the positive
        max_neg_score_threshold = pos_score * 0.95
        
        negative_candidates = [
            (i, normalized_scores[i]) 
            for i in range(len(scores)) 
            if normalized_scores[i] <= max_neg_score_threshold and i != index
        ]
        negative_candidates.sort(key=lambda x: x[1], reverse=True)
        hard_negatives = negative_candidates[:max_neg]
        
        for neg in hard_negatives:
            results.append({
                'Query': query,
                'Positive Answer': positive_answer,
                'Hard Negative': data.iloc[neg[0]]['Answer'],
                'Positive Score': pos_score,
                'Negative Score': neg[1]
            })
    return pd.DataFrame(results)

3.4 Results & Observations

For QA sets (e.g., BCCard Q&A):

  • BM25 often yielded extreme 0 or 1 similarity scores after minmax scaling. Hard Negative sampling became somewhat meaningless if ( ext{pos_score} pprox 1) or (pprox 0).
  • bge-m3 and KURE-v1 produced more stable similarity distributions, so Hard Negatives were more realistically mined.

For the NonQA news dataset:

  • Overall, positive similarity scores were lower. The text can be quite long and more topically diverse (e.g., multiple topics in a single article).
  • Distinguishing false negatives vs. genuinely negative is harder.
    • Example: Two different news paragraphs could each mention “뉴욕증시 혼조세” plus something else entirely—should they be negative or semantically overlapping?

Conclusion: We found embedding-based teachers (bge-m3, KURE-v1) outperformed BM25 in Hard Negative curation. However, the more domain- and topic-diverse the data, the more complicated it is to define “truly negative” pairs. Careful data curation or explicit type/metadata labeling helps reduce false negatives.


4. Summary

  1. Contrastive Learning thrives on well-chosen negatives. Random negative sampling can limit the model’s potential.
  2. NV-Retriever addresses naive negative mining’s shortcomings by setting an upper bound on negative similarity relative to the positive.
  3. In the Korean financial domain, we tested BM25 vs. embedding-based teacher models. BM25 frequently gave extreme similarity values, undermining Hard Negative selection. bge-m3 and KURE-v1 yielded more stable distributions.
  4. News-based NonQA sets introduced more complexity, underscoring the importance of data “type labeling” to avoid excessive false negatives.

Despite these challenges, NV-Retriever’s “positive-aware threshold” approach proved a solid improvement over older “top-k” methods. We remain convinced that continuing to refine negative sampling will yield further gains in embedding quality.


References

[1]: SimCSE: Simple Contrastive Learning of Sentence Embeddings ( https://arxiv.org/abs/2104.08821 )
[2]: Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere ( https://arxiv.org/abs/2005.10242 )
[3]: A large annotated corpus for learning natural language inference ( https://arxiv.org/abs/1508.05326 )
[4]: DPR ( https://arxiv.org/abs/2004.04906 )
[5]: ANCE ( https://arxiv.org/abs/2007.00808 )
[6]: RocketQA ( https://arxiv.org/abs/2010.08191 )
[7]: NV-Retriever ( https://arxiv.org/abs/2407.15831 )


Thank you for reading! If you have questions or insights about Hard Negative Mining for Korean text embeddings—particularly in specialized domains like finance—feel free to leave a comment or share your own experience.

NMIXX-Financial-NLP-Lab is an open-source financial natural language processing research lab supported by ModuLabs. The lab is dedicated to conducting extensive research in financial NLP in the future. Please check out and support our work at https://huggingface.co/nmixx-fin/https://github.com/nmixx-fin!

citation

@misc{hwang2025nvretriever,
  author       = {Yewon Hwang and Hanwool Lee},
  title        = {Exploring Hard Negative Mining with NV-Retriever in Korean Financial Text},
  year         = {2025},
  url          = {https://huggingface.co/blog/Albertmade/nvretriever-into-financial-text},
  note         = {Hugging Face Blog Post}
}