metadata

title: Document QA System
emoji: 📄
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
python_version: 3.11.0
models:
  - sentence-transformers/all-mpnet-base-v2
tags:
  - question-answering
  - gradio
  - LLM
  - document-processing

Document QA System

Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment.

Features

Document Indexing: Efficiently processes and indexes documents for quick retrieval.
Interactive Interface: Provides a user-friendly interface for querying documents.
Dockerization: Easy to build and deploy using Docker.

Technologies

Data source
- Paper about Few-NERD dataset located in the data directory are used as the data source for indexing.
Chunking
- Document chunking is handled by all-mpnet-base-v2
LLM
- The system utilizes the Cohere Command R for generating responses
Retriever, Reranker
- Cohere Command R is used
UI
- The user interface is built with Gradio

Installation

Prerequisites

Docker:
- Install Docker
Set path to the data directory, index directory:
- Update the variables in utils/constant.py.
Set the API key for Cohere Command R and LLamaParse:
- Update the CO_API_KEY and LLAMA_CLOUD_API_KEY in utils/settings.py in function configure_settings.

Using Docker

Clone the Repository:

git clone <repository-url>
cd <repository-folder>

Build the Docker Image:
```
docker build -t doc-qa-system .
```
Run the Docker Container:
```
 docker run -p 7860:7860 doc-qa-system
```
Access the Interface:

Open your browser and go to http://localhost:7860.

Using Python

Clone the Repository:

git clone <repository-url>
cd <repository-folder>

Install Dependencies:
```
pip install -r requirements.txt
```
Run indexing data:
```
python index.py
```
Run the Application:
```
python app.py
```

Project structure

├── app.py                   # Gradio application
├── main.py                  # Main script for answering queries
├── utils/                   # Utility functions and helpers
│   ├── constant.py          # Constant values used in the project
│   ├── index.py             # Handles document indexing
│   ├── retriever.py         # Retrieves and ranks documents
│   ├── settings.py          # Configuration settings
├── data/                    # Directory containing documents to be indexed
├── index/                   # Stores the generated index files
│   ├── default__vector_store.json
│   ├── docstore.json
│   ├── graph_store.json
│   ├── image__vector_store.json
│   ├── index_store.json
├── requirements.txt         # Python dependencies
├── Dockerfile               # Docker configuration
├── README.md                # Project documentation

Example questions

What is Few-NERD?
What is the Few-NERD dataset used for?
What are NER types in dataset?
What role does "transfer learning" play in the proposed few-shot learning system?
What metric does the paper use to evaluate the effectiveness of the few-shot model?