|
--- |
|
library_name: transformers |
|
pipeline_tag: image-to-text |
|
datasets: |
|
- Mouwiya/image-in-Words400 |
|
--- |
|
# BLIP Image Captioning |
|
|
|
## Model Description |
|
BLIP_image_captioning is a model based on the BLIP (Bootstrapping Language-Image Pre-training) architecture, specifically designed for image captioning tasks. The model has been fine-tuned on the "image-in-words400" dataset, which consists of images and their corresponding descriptive captions. This model leverages both visual and textual data to generate accurate and contextually relevant captions for images. |
|
|
|
## Model Details |
|
- **Model Architecture**: BLIP (Bootstrapping Language-Image Pre-training) |
|
- **Base Model**: Salesforce/blip-image-captioning-base |
|
- **Fine-tuning Dataset**: mouwiya/image-in-words400 |
|
- **Number of Parameters**: 109 million |
|
|
|
## Training Data |
|
The model was fine-tuned on a shuffled and subsetted version of the **"image-in-words400"** dataset. A total of 400 examples were used during the fine-tuning process to allow for faster iteration and development. |
|
|
|
## Training Procedure |
|
- **Optimizer**: AdamW |
|
- **Learning Rate**: 2e-5 |
|
- **Batch Size**: 16 |
|
- **Epochs**: 3 |
|
- **Evaluation Metric**: BLEU Score |
|
|
|
## Usage |
|
To use this model for image captioning, you can load it using the Hugging Face transformers library and perform inference as shown below: |
|
```python |
|
from transformers import BlipProcessor, BlipForConditionalGeneration |
|
from PIL import Image |
|
import requests |
|
from io import BytesIO |
|
|
|
# Load the processor and model |
|
model_name = "Mouwiya/BLIP_image_captioning" |
|
processor = BlipProcessor.from_pretrained(model_name) |
|
model = BlipForConditionalGeneration.from_pretrained(model_name) |
|
|
|
# Example usage |
|
image_url = "URL_OF_THE_IMAGE" |
|
response = requests.get(image_url) |
|
image = Image.open(BytesIO(response.content)).convert("RGB") |
|
|
|
inputs = processor(images=image, return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
caption = processor.decode(outputs[0], skip_special_tokens=True) |
|
print(caption) |
|
|
|
``` |
|
## Evaluation |
|
The model was evaluated on a subset of the "image-in-words400" dataset using the BLEU score. The evaluation results are as follows: |
|
|
|
- **Average BLEU Score**: 0.35 |
|
This score indicates the model's ability to generate captions that closely match the reference descriptions in terms of overlapping n-grams. |
|
|
|
## Limitations |
|
- **Dataset Size**: The model was fine-tuned on a relatively small subset of the dataset, which may limit its generalization capabilities. |
|
- **Domain-Specific**: This model was trained on a specific dataset and may not perform as well on images from different domains. |
|
|
|
## Contact |
|
**Mouwiya S. A. Al-Qaisieh** |
|
[email protected] |