File size: 2,646 Bytes
bac3d20
 
a44a2e2
4e91937
 
bac3d20
09e8a85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
library_name: transformers
pipeline_tag: image-to-text
datasets:
- Mouwiya/image-in-Words400
---
# BLIP Image Captioning

## Model Description
BLIP_image_captioning is a model based on the BLIP (Bootstrapping Language-Image Pre-training) architecture, specifically designed for image captioning tasks. The model has been fine-tuned on the "image-in-words400" dataset, which consists of images and their corresponding descriptive captions. This model leverages both visual and textual data to generate accurate and contextually relevant captions for images.

## Model Details
- **Model Architecture**: BLIP (Bootstrapping Language-Image Pre-training)
- **Base Model**: Salesforce/blip-image-captioning-base
- **Fine-tuning Dataset**: mouwiya/image-in-words400
- **Number of Parameters**: 109 million

## Training Data
The model was fine-tuned on a shuffled and subsetted version of the **"image-in-words400"** dataset. A total of 400 examples were used during the fine-tuning process to allow for faster iteration and development.

## Training Procedure
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Batch Size**: 16
- **Epochs**: 3
- **Evaluation Metric**: BLEU Score

## Usage
To use this model for image captioning, you can load it using the Hugging Face transformers library and perform inference as shown below:
```python
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
from io import BytesIO

# Load the processor and model
model_name = "Mouwiya/BLIP_image_captioning"
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name)

# Example usage
image_url = "URL_OF_THE_IMAGE"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content)).convert("RGB")

inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)

```
## Evaluation
The model was evaluated on a subset of the "image-in-words400" dataset using the BLEU score. The evaluation results are as follows:

- **Average BLEU Score**: 0.35
This score indicates the model's ability to generate captions that closely match the reference descriptions in terms of overlapping n-grams.

## Limitations
- **Dataset Size**: The model was fine-tuned on a relatively small subset of the dataset, which may limit its generalization capabilities.
- **Domain-Specific**: This model was trained on a specific dataset and may not perform as well on images from different domains.

## Contact
**Mouwiya S. A. Al-Qaisieh** 
[email protected]