Spaces:

Gepe55o
/

paper_based_rag

Sleeping

App Files Files Community

paper_based_rag / README.md

Юра Цепліцький

Initial commit

693d949 2 months ago

preview code

raw

history blame

3.75 kB

	---
	title: "Document QA System"
	emoji: "📄"
	colorFrom: "blue"
	colorTo: "indigo"
	sdk: gradio
	sdk_version: 5.8.0
	app_file: app.py
	python_version: 3.11.0
	models:
	- sentence-transformers/all-mpnet-base-v2
	tags:
	- question-answering
	- gradio
	- LLM
	- document-processing
	---

	# Document QA System

	Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment.

	## Features

	- Document Indexing: Efficiently processes and indexes documents for quick retrieval.
	- Interactive Interface: Provides a user-friendly interface for querying documents.
	- Dockerization: Easy to build and deploy using Docker.

	## Technologies

	- Data source
	- [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464) located in the data directory are used as the data source for indexing.
	- Chunking
	- Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
	- LLM
	- The system utilizes the [Cohere Command R](https://cohere.com/command) for generating responses
	- Retriever, Reranker
	- [Cohere Command R](https://cohere.com/command) is used
	- UI
	- The user interface is built with Gradio

	## Installation

	### Prerequisites

	1. Docker:

	- [Install Docker](https://docs.docker.com/get-docker/)

	2. Set path to the data directory, index directory:

	- Update the variables in `utils/constant.py`.

	3. Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key):

	- Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`.

	### Using Docker

	1. Clone the Repository:

	```bash
	git clone <repository-url>
	cd <repository-folder>
	```

	2. Build the Docker Image:

	```bash
	docker build -t doc-qa-system .
	```

	3. Run the Docker Container:

	```bash
	docker run -p 7860:7860 doc-qa-system
	```

	4. Access the Interface:

	Open your browser and go to `http://localhost:7860`.

	### Using Python

	1. Clone the Repository:

	```bash
	git clone <repository-url>
	cd <repository-folder>
	```

	2. Install Dependencies:

	```bash
	pip install -r requirements.txt
	```

	3. Run indexing data:

	```bash
	python index.py
	```

	4. Run the Application:

	```bash
	python app.py
	```

	## Project structure
	```bash
	├── app.py # Gradio application
	├── main.py # Main script for answering queries
	├── utils/ # Utility functions and helpers
	│ ├── constant.py # Constant values used in the project
	│ ├── index.py # Handles document indexing
	│ ├── retriever.py # Retrieves and ranks documents
	│ ├── settings.py # Configuration settings
	├── data/ # Directory containing documents to be indexed
	├── index/ # Stores the generated index files
	│ ├── default__vector_store.json
	│ ├── docstore.json
	│ ├── graph_store.json
	│ ├── image__vector_store.json
	│ ├── index_store.json
	├── requirements.txt # Python dependencies
	├── Dockerfile # Docker configuration
	├── README.md # Project documentation

	```

	## Example questions

	- What is Few-NERD?
	- What is the Few-NERD dataset used for?
	- What are NER types in dataset?
	- What role does "transfer learning" play in the proposed few-shot learning system?
	- What metric does the paper use to evaluate the effectiveness of the few-shot model?