Model Card for DoNUT Model
This model card provides details about the DoNUT model fine-tuned for document question answering (docQA) on a synthetically generated dataset.
Model Details
Model Description
The DoNUT model is a document question answering model that has been fine-tuned for answering questions related to tax forms, specifically 1099-div, 1099-int, w2, and w3 forms. It has been trained on a synthetically generated dataset to achieve high accuracy and performance in identifying and extracting information from these forms.
Developed by: [CALM.ai]
Model type: Question Answering (QA)
Language(s) (NLP): English
License: Apache-2.0
Finetuned from model : DoNUT Model
Model Sources
- Repository: naver-clova-ix/donut-base
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
The model can be directly used for querying tax forms and extracting information from them. Users can interact with the extracted information using the llama-3 LLM, which provides a better understanding of the forms and allows for simple mathematical operations on some fields.
General Purpose Use
The model can also be used as a general-purpose document question answering system. It can parse various types of documents, such as textbooks, magazines, articles, and technical papers, providing users with relevant information and insights.
Downstream Use
The model can be further fine-tuned for specific use cases or integrated into larger document processing systems. It can also be used for classifying uploaded documents into form documents (1099-DIV, 1099-INT, W2, W3) and non-form documents (non-form). This allows for general-purpose use, such as parsing textbooks, magazines, articles, technical papers, etc.
Out-of-Scope Use
The model is not suitable for non-tax-related documents and may not perform well on handwritten or poorly scanned forms.
Bias, Risks, and Limitations
The model may exhibit biases based on the synthetic nature of the dataset and may not generalize well to real-world scenarios. It may also struggle with handwritten or poorly scanned forms.
How to Get Started with the Model
To get started with the model, you can use the following code:
Installing reqired libraries
!pip install -q transformers\
datasets
Loading the Dataset
from datasets import load_dataset
dataset = load_dataset("calm-ai/Multiple_financial_forms", split="test", use_auth_token=True)
Loading the Model
from transformers import DonutProcessor, VisionEncoderDecoderModel
processor = DonutProcessor.from_pretrained("calm-ai/donut-base-finetuned-forms-v1")
model = VisionEncoderDecoderModel.from_pretrained("calm-ai/donut-base-finetuned-forms-v1")
Use the model for inference
import re
import json
import torch
from tqdm.auto import tqdm
import numpy as np
def process_document(image):
# prepare encoder inputs
pixel_values = processor(image, return_tensors="pt").pixel_values
print(type(pixel_values),pixel_values.shape)
# prepare decoder inputs
task_prompt = "<s>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
# generate answer
outputs = model.generate(
pixel_values.to(torch.device(1)),
decoder_input_ids=decoder_input_ids.to(device),
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
)
# postprocess
sequence = processor.batch_decode(outputs.sequences)[0]
sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
return processor.token2json(sequence)
#youcan change the index number between 0-99 and check the parsed information
image = dataset[20]['image']
image
Training Details
Training Data
The model was trained on a synthetically generated dataset consisting of 4000 tax forms (1099-div, 1099-int, w2, w3) with complete data imputed using the Faker library.
Training Procedure
Preprocessing.
The forms were preprocessed to extract text and annotating information for training.
Training Hyperparameters
- Training regime: Fine-tuning on the DoNUT model Optimizer: Adam Learning rate: 5e-5 Batch size: 8
Speeds, Sizes, Time
Training time: 3 epochs
Speed: 6s
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on a separate set of tax forms not seen during training.
Factors
The evaluation was disaggregated by form type (1099-div, 1099-int, w2, w3).
Metrics
Val_edit_distance: 0.0434
Val Edit distance is a measure of similarity between two strings, calculated as the minimum number of operations required to transform one string into the other. In the context of document parsing and generation, edit distance can be used to measure the accuracy of the generated output compared to the ground truth.
Here's why val-edit-distance may be a suitable metric for this purpose:
Quantifies Accuracy: Edit distance provides a quantitative measure of how similar the generated JSON output is to the ground truth. A lower edit distance indicates a higher degree of accuracy in the generated output.
Handles Variability: Edit distance is robust to variations in the generated output that may still be considered correct. For example, minor differences in formatting or word choice may result in a small edit distance but still be acceptable.
Easy Interpretation: The edit distance value is easy to interpret, with smaller values indicating higher similarity between the generated and ground truth outputs.
Results
Accuracy: 97%
Summary
Our DoNUT finetunde model is the only open-source model capable of extracting information from tax forms such as 1099-div, 1099-int, w2, and w3, achieving an accuracy of 97%.
Technical Specifications.
Compute Infrastructure
GPU requirements : (min) 4gb System Ram : (min) 8gb
Model Card Authors
Abhishek A Chandan V K Likhith V Monish M
Model Card Contact
- Downloads last month
- 26