Mouwiya
/

BLIP_image_captioning

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

BLIP_image_captioning / README.md

Mouwiya's picture

Update README.md

09e8a85 verified 9 months ago

|

2.65 kB

	---
	library_name: transformers
	pipeline_tag: image-to-text
	datasets:
	- Mouwiya/image-in-Words400
	---
	# BLIP Image Captioning

	## Model Description
	BLIP_image_captioning is a model based on the BLIP (Bootstrapping Language-Image Pre-training) architecture, specifically designed for image captioning tasks. The model has been fine-tuned on the "image-in-words400" dataset, which consists of images and their corresponding descriptive captions. This model leverages both visual and textual data to generate accurate and contextually relevant captions for images.

	## Model Details
	- Model Architecture: BLIP (Bootstrapping Language-Image Pre-training)
	- Base Model: Salesforce/blip-image-captioning-base
	- Fine-tuning Dataset: mouwiya/image-in-words400
	- Number of Parameters: 109 million

	## Training Data
	The model was fine-tuned on a shuffled and subsetted version of the "image-in-words400" dataset. A total of 400 examples were used during the fine-tuning process to allow for faster iteration and development.

	## Training Procedure
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Batch Size: 16
	- Epochs: 3
	- Evaluation Metric: BLEU Score

	## Usage
	To use this model for image captioning, you can load it using the Hugging Face transformers library and perform inference as shown below:
	```python
	from transformers import BlipProcessor, BlipForConditionalGeneration
	from PIL import Image
	import requests
	from io import BytesIO

	# Load the processor and model
	model_name = "Mouwiya/BLIP_image_captioning"
	processor = BlipProcessor.from_pretrained(model_name)
	model = BlipForConditionalGeneration.from_pretrained(model_name)

	# Example usage
	image_url = "URL_OF_THE_IMAGE"
	response = requests.get(image_url)
	image = Image.open(BytesIO(response.content)).convert("RGB")

	inputs = processor(images=image, return_tensors="pt")
	outputs = model.generate(**inputs)
	caption = processor.decode(outputs[0], skip_special_tokens=True)
	print(caption)

	```
	## Evaluation
	The model was evaluated on a subset of the "image-in-words400" dataset using the BLEU score. The evaluation results are as follows:

	- Average BLEU Score: 0.35
	This score indicates the model's ability to generate captions that closely match the reference descriptions in terms of overlapping n-grams.

	## Limitations
	- Dataset Size: The model was fine-tuned on a relatively small subset of the dataset, which may limit its generalization capabilities.
	- Domain-Specific: This model was trained on a specific dataset and may not perform as well on images from different domains.

	## Contact
	Mouwiya S. A. Al-Qaisieh
	[email protected]