---
title: "Paper-based RAG"
emoji: "📄"
colorFrom: "blue"
colorTo: "indigo"
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
python_version: 3.11.0
models:
  - sentence-transformers/all-mpnet-base-v2
tags:
  - question-answering
  - gradio
  - LLM
  - document-processing
---

# Document QA System

Document Question-Answering system that utilizes LlamaIndex for document indexing, generation, and retrieval and Gradio for the user interface. 

## Technologies

- Data source
   - [Paper about BERT](https://arxiv.org/pdf/1810.04805)  located in the data directory are used as the data source for indexing.
- Chunking 
   - Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- LLM
   - The system utilizes the `gpt-4o-mini`  for generating responses
- Retriever, Reranker
   - gpt-4o-mini is used
- UI
   - The user interface is built with Gradio

## Installation

### Prerequisites

1. **Docker**:

   - [Install Docker](https://docs.docker.com/get-docker/)

2. **API keys** 
   - [OpenAI](https://platform.openai.com/api-keys) 
   - [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key):

### Using HuggingFace Spaces

1. Follow the link to the [paper-based-rag](https://huggingface.co/spaces/Gepe55o/paper_based_rag) on Spaces.
2. Upload your paper for indexing or use the default [paper](https://arxiv.org/pdf/1810.04805) about BERT.

### Using Docker

1. **Build the Docker Image**:

   ```bash
   docker build -t doc-qa-system .
   ```

2. **Run the Docker Container**:

   ```bash
    docker run -p 7860:7860 doc-qa-system
    ```

4. **Access the Interface**:

   - Open your browser and go to `http://localhost:7860`.

### Using Python

1. **Install Dependencies**:
   
      ```bash
      pip install -r requirements.txt
      ```

2. **Add paper to the data directory**:

   - Add the paper you want to index to the `data` directory or use default [paper](https://arxiv.org/pdf/1810.04805) about BERT.

2. **Run indexing data**:

   ```bash
   python index.py
   ```

3. **Run the Application**:

   ```bash
   python app.py
   ```

## Project structure
```bash
├── app.py                   # Gradio application
├── main.py                  # Main script for answering queries
├── utils/                   # Utility functions and helpers
│   ├── constant.py          # Constant values used in the project
│   ├── index.py             # Handles document indexing
│   ├── retriever.py         # Retrieves and ranks documents
│   ├── settings.py          # Configuration settings
├── data/                    # Directory containing documents to be indexed
├── index/                   # Stores the generated index files
│   ├── default__vector_store.json
│   ├── docstore.json
│   ├── graph_store.json
│   ├── image__vector_store.json
│   ├── index_store.json
├── requirements.txt         # Python dependencies
├── Dockerfile               # Docker configuration
├── README.md                # Project documentation 

```

## Example questions

- What is the pre-training procedure for BERT, and how does it differ from traditional supervised learning?
- Can you describe how BERT can be fine-tuned for tasks like question answering or sentiment analysis?