File size: 3,754 Bytes
4cf0e20
693d949
 
 
 
4cf0e20
 
 
693d949
 
 
 
 
 
 
 
4cf0e20
 
693d949
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
title: "Document QA System"
emoji: "πŸ“„"
colorFrom: "blue"
colorTo: "indigo"
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
python_version: 3.11.0
models:
  - sentence-transformers/all-mpnet-base-v2
tags:
  - question-answering
  - gradio
  - LLM
  - document-processing
---

# Document QA System

Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment.

## Features

- **Document Indexing**: Efficiently processes and indexes documents for quick retrieval.
- **Interactive Interface**: Provides a user-friendly interface for querying documents.
- **Dockerization**: Easy to build and deploy using Docker.

## Technologies

- Data source
   - [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464)  located in the data directory are used as the data source for indexing.
- Chunking 
   - Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- LLM
   - The system utilizes the [Cohere Command R](https://cohere.com/command)  for generating responses
- Retriever, Reranker
   - [Cohere Command R](https://cohere.com/command) is used
- UI
   - The user interface is built with Gradio

## Installation

### Prerequisites

1. **Docker**:

   - [Install Docker](https://docs.docker.com/get-docker/)

2. **Set path to the data directory, index directory**:

   - Update the variables in `utils/constant.py`.

3. **Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key)**:

   - Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`.

### Using Docker

1. **Clone the Repository**:

   ```bash
   git clone <repository-url>
   cd <repository-folder>
    ```

2. **Build the Docker Image**:

   ```bash
   docker build -t doc-qa-system .
   ```

3. **Run the Docker Container**:

   ```bash
    docker run -p 7860:7860 doc-qa-system
    ```

4. **Access the Interface**:

   Open your browser and go to `http://localhost:7860`.

### Using Python

1. **Clone the Repository**:

   ```bash
   git clone <repository-url>
   cd <repository-folder>
    ```

2. **Install Dependencies**:
   
      ```bash
      pip install -r requirements.txt
      ```

3. **Run indexing data**:

   ```bash
   python index.py
   ```

4. **Run the Application**:

   ```bash
   python app.py
   ```

## Project structure
```bash
β”œβ”€β”€ app.py                   # Gradio application
β”œβ”€β”€ main.py                  # Main script for answering queries
β”œβ”€β”€ utils/                   # Utility functions and helpers
β”‚   β”œβ”€β”€ constant.py          # Constant values used in the project
β”‚   β”œβ”€β”€ index.py             # Handles document indexing
β”‚   β”œβ”€β”€ retriever.py         # Retrieves and ranks documents
β”‚   β”œβ”€β”€ settings.py          # Configuration settings
β”œβ”€β”€ data/                    # Directory containing documents to be indexed
β”œβ”€β”€ index/                   # Stores the generated index files
β”‚   β”œβ”€β”€ default__vector_store.json
β”‚   β”œβ”€β”€ docstore.json
β”‚   β”œβ”€β”€ graph_store.json
β”‚   β”œβ”€β”€ image__vector_store.json
β”‚   β”œβ”€β”€ index_store.json
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ Dockerfile               # Docker configuration
β”œβ”€β”€ README.md                # Project documentation 

```

## Example questions

- What is Few-NERD?
- What is the Few-NERD dataset used for?
- What are NER types in dataset?
- What role does "transfer learning" play in the proposed few-shot learning system?
- What metric does the paper use to evaluate the effectiveness of the few-shot model?