Spaces:
Sleeping
Sleeping
title: "Document QA System" | |
emoji: "๐" | |
colorFrom: "blue" | |
colorTo: "indigo" | |
sdk: gradio | |
sdk_version: 5.8.0 | |
app_file: app.py | |
python_version: 3.11.0 | |
models: | |
- sentence-transformers/all-mpnet-base-v2 | |
tags: | |
- question-answering | |
- gradio | |
- LLM | |
- document-processing | |
# Document QA System | |
Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment. | |
## Features | |
- **Document Indexing**: Efficiently processes and indexes documents for quick retrieval. | |
- **Interactive Interface**: Provides a user-friendly interface for querying documents. | |
- **Dockerization**: Easy to build and deploy using Docker. | |
## Technologies | |
- Data source | |
- [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464) located in the data directory are used as the data source for indexing. | |
- Chunking | |
- Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | |
- LLM | |
- The system utilizes the [Cohere Command R](https://cohere.com/command) for generating responses | |
- Retriever, Reranker | |
- [Cohere Command R](https://cohere.com/command) is used | |
- UI | |
- The user interface is built with Gradio | |
## Installation | |
### Prerequisites | |
1. **Docker**: | |
- [Install Docker](https://docs.docker.com/get-docker/) | |
2. **Set path to the data directory, index directory**: | |
- Update the variables in `utils/constant.py`. | |
3. **Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key)**: | |
- Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`. | |
### Using Docker | |
1. **Clone the Repository**: | |
```bash | |
git clone <repository-url> | |
cd <repository-folder> | |
``` | |
2. **Build the Docker Image**: | |
```bash | |
docker build -t doc-qa-system . | |
``` | |
3. **Run the Docker Container**: | |
```bash | |
docker run -p 7860:7860 doc-qa-system | |
``` | |
4. **Access the Interface**: | |
Open your browser and go to `http://localhost:7860`. | |
### Using Python | |
1. **Clone the Repository**: | |
```bash | |
git clone <repository-url> | |
cd <repository-folder> | |
``` | |
2. **Install Dependencies**: | |
```bash | |
pip install -r requirements.txt | |
``` | |
3. **Run indexing data**: | |
```bash | |
python index.py | |
``` | |
4. **Run the Application**: | |
```bash | |
python app.py | |
``` | |
## Project structure | |
```bash | |
โโโ app.py # Gradio application | |
โโโ main.py # Main script for answering queries | |
โโโ utils/ # Utility functions and helpers | |
โ โโโ constant.py # Constant values used in the project | |
โ โโโ index.py # Handles document indexing | |
โ โโโ retriever.py # Retrieves and ranks documents | |
โ โโโ settings.py # Configuration settings | |
โโโ data/ # Directory containing documents to be indexed | |
โโโ index/ # Stores the generated index files | |
โ โโโ default__vector_store.json | |
โ โโโ docstore.json | |
โ โโโ graph_store.json | |
โ โโโ image__vector_store.json | |
โ โโโ index_store.json | |
โโโ requirements.txt # Python dependencies | |
โโโ Dockerfile # Docker configuration | |
โโโ README.md # Project documentation | |
``` | |
## Example questions | |
- What is Few-NERD? | |
- What is the Few-NERD dataset used for? | |
- What are NER types in dataset? | |
- What role does "transfer learning" play in the proposed few-shot learning system? | |
- What metric does the paper use to evaluate the effectiveness of the few-shot model? |