Spaces:

Gepe55o
/

paper_based_rag

Sleeping

File size: 3,754 Bytes

---
title: "Document QA System"
emoji: "📄"
colorFrom: "blue"
colorTo: "indigo"
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
python_version: 3.11.0
models:
  - sentence-transformers/all-mpnet-base-v2
tags:
  - question-answering
  - gradio
  - LLM
  - document-processing
---

# Document QA System

Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment.

## Features

- **Document Indexing**: Efficiently processes and indexes documents for quick retrieval.
- **Interactive Interface**: Provides a user-friendly interface for querying documents.
- **Dockerization**: Easy to build and deploy using Docker.

## Technologies

- Data source
   - [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464)  located in the data directory are used as the data source for indexing.
- Chunking 
   - Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- LLM
   - The system utilizes the [Cohere Command R](https://cohere.com/command)  for generating responses
- Retriever, Reranker
   - [Cohere Command R](https://cohere.com/command) is used
- UI
   - The user interface is built with Gradio

## Installation

### Prerequisites

1. **Docker**:

   - [Install Docker](https://docs.docker.com/get-docker/)

2. **Set path to the data directory, index directory**:

   - Update the variables in `utils/constant.py`.

3. **Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key)**:

   - Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`.

### Using Docker

1. **Clone the Repository**:

   ```bash
   git clone <repository-url>
   cd <repository-folder>
    ```

2. **Build the Docker Image**:

   ```bash
   docker build -t doc-qa-system .
   ```

3. **Run the Docker Container**:

   ```bash
    docker run -p 7860:7860 doc-qa-system
    ```

4. **Access the Interface**:

   Open your browser and go to `http://localhost:7860`.

### Using Python

1. **Clone the Repository**:

   ```bash
   git clone <repository-url>
   cd <repository-folder>
    ```

2. **Install Dependencies**:
   
      ```bash
      pip install -r requirements.txt
      ```

3. **Run indexing data**:

   ```bash
   python index.py
   ```

4. **Run the Application**:

   ```bash
   python app.py
   ```

## Project structure
```bash
├── app.py                   # Gradio application
├── main.py                  # Main script for answering queries
├── utils/                   # Utility functions and helpers
│   ├── constant.py          # Constant values used in the project
│   ├── index.py             # Handles document indexing
│   ├── retriever.py         # Retrieves and ranks documents
│   ├── settings.py          # Configuration settings
├── data/                    # Directory containing documents to be indexed
├── index/                   # Stores the generated index files
│   ├── default__vector_store.json
│   ├── docstore.json
│   ├── graph_store.json
│   ├── image__vector_store.json
│   ├── index_store.json
├── requirements.txt         # Python dependencies
├── Dockerfile               # Docker configuration
├── README.md                # Project documentation 

```

## Example questions

- What is Few-NERD?
- What is the Few-NERD dataset used for?
- What are NER types in dataset?
- What role does "transfer learning" play in the proposed few-shot learning system?
- What metric does the paper use to evaluate the effectiveness of the few-shot model?