CodeModernBERT-Owl
概要 / Overview
🦉 CodeModernBERT-Owl: 高精度なコード検索 & コード理解モデル
CodeModernBERT-Owl is a pretrained model designed from scratch for code search and code understanding tasks.
Compared to previous versions such as CodeHawks-ModernBERT and CodeMorph-ModernBERT, this model now supports Rust and improves search accuracy in Python, PHP, Java, JavaScript, Go, and Ruby.
🛠 主な特徴 / Key Features
✅ Supports long sequences up to 2048 tokens (compared to Microsoft's 512-token models)
✅ Optimized for code search, code understanding, and code clone detection
✅ Fine-tuned on GitHub open-source repositories (Java, Rust)
✅ Achieves the highest accuracy among the CodeHawks/CodeMorph series
✅ Multi-language support: Python, PHP, Java, JavaScript, Go, Ruby, and Rust
📊 モデルパラメータ / Model Parameters
パラメータ / Parameter | 値 / Value |
---|---|
vocab_size | 50,000 |
hidden_size | 768 |
num_hidden_layers | 12 |
num_attention_heads | 12 |
intermediate_size | 3,072 |
max_position_embeddings | 2,048 |
type_vocab_size | 2 |
hidden_dropout_prob | 0.1 |
attention_probs_dropout_prob | 0.1 |
local_attention_window | 128 |
rope_theta | 160,000 |
local_attention_rope_theta | 10,000 |
💻 モデルの使用方法 / How to Use
This model can be easily loaded using the Hugging Face Transformers library.
⚠️ Requires transformers >= 4.48.0
🔗 Colab Demo (Replace with "CodeModernBERT-Owl")
モデルのロード / Load the Model
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeModernBERT-Owl")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeModernBERT-Owl")
コード埋め込みの取得 / Get Code Embeddings
import torch
def get_embedding(text, model, tokenizer, device="cuda"):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
if "token_type_ids" in inputs:
inputs.pop("token_type_ids")
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model.model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
return embedding
embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
🔍 評価結果 / Evaluation Results
データセット / Dataset
📌 Tested on code_x_glue_ct_code_to_text
with a candidate pool size of 100.
📌 Rust-specific evaluations were conducted using Shuu12121/rust-codesearch-dataset-open
.
📈 主要な評価指標の比較(同一シード値)/ Key Evaluation Metrics (Same Seed)
言語 / Language | CodeModernBERT-Owl | CodeHawks-ModernBERT | Salesforce CodeT5+ | Microsoft CodeBERT | GraphCodeBERT |
---|---|---|---|---|---|
Python | 0.8793 | 0.8551 | 0.8266 | 0.5243 | 0.5493 |
Java | 0.8880 | 0.7971 | 0.8867 | 0.3134 | 0.5879 |
JavaScript | 0.8423 | 0.7634 | 0.7628 | 0.2694 | 0.5051 |
PHP | 0.9129 | 0.8578 | 0.9027 | 0.2642 | 0.6225 |
Ruby | 0.8038 | 0.7469 | 0.7568 | 0.3318 | 0.5876 |
Go | 0.9386 | 0.9043 | 0.8117 | 0.3262 | 0.4243 |
✅ Achieves the highest accuracy in all target languages.
✅ Significantly improved Java accuracy using additional fine-tuned GitHub data.
✅ Outperforms previous models, especially in PHP and Go.
📊 Rust (独自データセット) / Rust Performance
指標 / Metric | CodeModernBERT-Owl |
---|---|
MRR | 0.7940 |
MAP | 0.7940 |
R-Precision | 0.7173 |
📌 K別評価指標 / Evaluation Metrics by K
K | Recall@K | Precision@K | NDCG@K | F1@K | Success Rate@K | Query Coverage@K |
---|---|---|---|---|---|---|
1 | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 | 0.7173 |
5 | 0.8913 | 0.7852 | 0.8118 | 0.8132 | 0.8913 | 0.8913 |
10 | 0.9333 | 0.7908 | 0.8254 | 0.8230 | 0.9333 | 0.9333 |
50 | 0.9887 | 0.7938 | 0.8383 | 0.8288 | 0.9887 | 0.9887 |
100 | 1.0000 | 0.7940 | 0.8401 | 0.8291 | 1.0000 | 1.0000 |
📝 結論 / Conclusion
✅ Top performance in all languages
✅ Rust support successfully added through dataset augmentation
✅ Further performance improvements possible with better datasets
📜 ライセンス / License
📄 Apache 2.0
📧 連絡先 / Contact
📩 For any questions, please contact:
📧 [email protected]
- Downloads last month
- 34