paper_based_rag / README.md
ะฎั€ะฐ ะฆะตะฟะปั–ั†ัŒะบะธะน
Initial commit
693d949
|
raw
history blame
3.75 kB
---
title: "Document QA System"
emoji: "๐Ÿ“„"
colorFrom: "blue"
colorTo: "indigo"
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
python_version: 3.11.0
models:
- sentence-transformers/all-mpnet-base-v2
tags:
- question-answering
- gradio
- LLM
- document-processing
---
# Document QA System
Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment.
## Features
- **Document Indexing**: Efficiently processes and indexes documents for quick retrieval.
- **Interactive Interface**: Provides a user-friendly interface for querying documents.
- **Dockerization**: Easy to build and deploy using Docker.
## Technologies
- Data source
- [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464) located in the data directory are used as the data source for indexing.
- Chunking
- Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- LLM
- The system utilizes the [Cohere Command R](https://cohere.com/command) for generating responses
- Retriever, Reranker
- [Cohere Command R](https://cohere.com/command) is used
- UI
- The user interface is built with Gradio
## Installation
### Prerequisites
1. **Docker**:
- [Install Docker](https://docs.docker.com/get-docker/)
2. **Set path to the data directory, index directory**:
- Update the variables in `utils/constant.py`.
3. **Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key)**:
- Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`.
### Using Docker
1. **Clone the Repository**:
```bash
git clone <repository-url>
cd <repository-folder>
```
2. **Build the Docker Image**:
```bash
docker build -t doc-qa-system .
```
3. **Run the Docker Container**:
```bash
docker run -p 7860:7860 doc-qa-system
```
4. **Access the Interface**:
Open your browser and go to `http://localhost:7860`.
### Using Python
1. **Clone the Repository**:
```bash
git clone <repository-url>
cd <repository-folder>
```
2. **Install Dependencies**:
```bash
pip install -r requirements.txt
```
3. **Run indexing data**:
```bash
python index.py
```
4. **Run the Application**:
```bash
python app.py
```
## Project structure
```bash
โ”œโ”€โ”€ app.py # Gradio application
โ”œโ”€โ”€ main.py # Main script for answering queries
โ”œโ”€โ”€ utils/ # Utility functions and helpers
โ”‚ โ”œโ”€โ”€ constant.py # Constant values used in the project
โ”‚ โ”œโ”€โ”€ index.py # Handles document indexing
โ”‚ โ”œโ”€โ”€ retriever.py # Retrieves and ranks documents
โ”‚ โ”œโ”€โ”€ settings.py # Configuration settings
โ”œโ”€โ”€ data/ # Directory containing documents to be indexed
โ”œโ”€โ”€ index/ # Stores the generated index files
โ”‚ โ”œโ”€โ”€ default__vector_store.json
โ”‚ โ”œโ”€โ”€ docstore.json
โ”‚ โ”œโ”€โ”€ graph_store.json
โ”‚ โ”œโ”€โ”€ image__vector_store.json
โ”‚ โ”œโ”€โ”€ index_store.json
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ Dockerfile # Docker configuration
โ”œโ”€โ”€ README.md # Project documentation
```
## Example questions
- What is Few-NERD?
- What is the Few-NERD dataset used for?
- What are NER types in dataset?
- What role does "transfer learning" play in the proposed few-shot learning system?
- What metric does the paper use to evaluate the effectiveness of the few-shot model?