This Model is a fine-tuned version of huggingface/CodeBERTa-small-v1 on cakiki/rosetta-code Dataset for 26 Programming Languages as mentioned below.

Training Details:

Model is trained for 25 epochs on Azure for nearly 26000 Datapoints for above Mentioned 26 Programming Languages
extracted from Dataset having 1006 of total Programming Language.

Programming Languages this model is able to detect vs Examples used for training

  1. 'ARM Assembly':
  2. 'AppleScript'
  3. 'C'
  4. 'C#'
  5. 'C++'
  6. 'COBOL'
  7. 'Erlang'
  8. 'Fortran'
  9. 'Go'
  10. 'Java'
  11. 'JavaScript'
  12. 'Kotlin'
  13. 'Lua
  14. 'Mathematica/Wolfram Language'
  15. 'PHP'
  16. 'Pascal'
  17. 'Perl'
  18. 'PowerShell'
  19. 'Python'
  20. 'R
  21. 'Ruby'
  22. 'Rust'
  23. 'Scala'
  24. 'Swift'
  25. 'Visual Basic .NET'
  26. 'jq'

Below is the Training Result for 25 epochs.

  • Training Computer Configuration:
    • GPU:1xNvidia Tesla T4,
    • VRam: 16GB,
    • Ram:112GB,
    • Cores:6 Cores
  • Training Time taken: exactly 7 hours for 25 epochs
  • Training Hyper-parameters:

image/png

training detail.png

Inference Code

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
model_name = 'philomath-1209/programming-language-identification'
loaded_tokenizer = AutoTokenizer.from_pretrained(model_name)
loaded_model = AutoModelForSequenceClassification.from_pretrained(model_name)


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
text = """
PROGRAM Triangle
   IMPLICIT NONE
   REAL :: a, b, c, Area
   PRINT *, 'Welcome, please enter the&
            &lengths of the 3 sides.'
   READ *, a, b, c
   PRINT *, 'Triangle''s area:  ', Area(a,b,c)
  END PROGRAM Triangle
  FUNCTION Area(x,y,z)
   IMPLICIT NONE
   REAL :: Area            ! function type
   REAL, INTENT( IN ) :: x, y, z
   REAL :: theta, height
   theta = ACOS((x**2+y**2-z**2)/(2.0*x*y))
   height = x*SIN(theta); Area = 0.5*y*height
  END FUNCTION Area

"""
inputs = loaded_tokenizer(text, return_tensors="pt",truncation=True)
with torch.no_grad():
  logits = loaded_model(**inputs).logits
predicted_class_id = logits.argmax().item()
loaded_model.config.id2label[predicted_class_id]

Optimum with ONNX inference

Loading the model requires the πŸ€— Optimum library installed.

pip install transformers optimum[onnxruntime] optimum
model_path = "philomath-1209/programming-language-identification"
import torch
from transformers import pipeline, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_path, subfolder="onnx")
model = ORTModelForSequenceClassification.from_pretrained(model_path, export=False, subfolder="onnx")

text = """
  PROGRAM Triangle
     IMPLICIT NONE
     REAL :: a, b, c, Area
     PRINT *, 'Welcome, please enter the&
              &lengths of the 3 sides.'
     READ *, a, b, c
     PRINT *, 'Triangle''s area:  ', Area(a,b,c)
    END PROGRAM Triangle
    FUNCTION Area(x,y,z)
     IMPLICIT NONE
     REAL :: Area            ! function type
     REAL, INTENT( IN ) :: x, y, z
     REAL :: theta, height
     theta = ACOS((x**2+y**2-z**2)/(2.0*x*y))
     height = x*SIN(theta); Area = 0.5*y*height
    END FUNCTION Area

"""
inputs = tokenizer(text, return_tensors="pt",truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
Downloads last month
5,627
Safetensors
Model size
83.5M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for philomath-1209/programming-language-identification

Quantized
(1)
this model

Dataset used to train philomath-1209/programming-language-identification

Spaces using philomath-1209/programming-language-identification 3