--- title: "Paper-based RAG" emoji: "📄" colorFrom: "blue" colorTo: "indigo" sdk: gradio sdk_version: 5.8.0 app_file: app.py python_version: 3.11.0 models: - sentence-transformers/all-mpnet-base-v2 tags: - question-answering - gradio - LLM - document-processing --- # Document QA System Document Question-Answering system that utilizes LlamaIndex for document indexing, generation, and retrieval and Gradio for the user interface. ## Technologies - Data source - [Paper about BERT](https://arxiv.org/pdf/1810.04805) located in the data directory are used as the data source for indexing. - Chunking - Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) - LLM - The system utilizes the `gpt-4o-mini` for generating responses - Retriever, Reranker - gpt-4o-mini is used - UI - The user interface is built with Gradio ## Installation ### Prerequisites 1. **Docker**: - [Install Docker](https://docs.docker.com/get-docker/) 2. **API keys** - [OpenAI](https://platform.openai.com/api-keys) - [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key): ### Using HuggingFace Spaces 1. Follow the link to the [paper-based-rag](https://huggingface.co/spaces/Gepe55o/paper_based_rag) on Spaces. 2. Upload your paper for indexing or use the default [paper](https://arxiv.org/pdf/1810.04805) about BERT. ### Using Docker 1. **Build the Docker Image**: ```bash docker build -t doc-qa-system . ``` 2. **Run the Docker Container**: ```bash docker run -p 7860:7860 doc-qa-system ``` 4. **Access the Interface**: - Open your browser and go to `http://localhost:7860`. ### Using Python 1. **Install Dependencies**: ```bash pip install -r requirements.txt ``` 2. **Add paper to the data directory**: - Add the paper you want to index to the `data` directory or use default [paper](https://arxiv.org/pdf/1810.04805) about BERT. 2. **Run indexing data**: ```bash python index.py ``` 3. **Run the Application**: ```bash python app.py ``` ## Project structure ```bash ├── app.py # Gradio application ├── main.py # Main script for answering queries ├── utils/ # Utility functions and helpers │ ├── constant.py # Constant values used in the project │ ├── index.py # Handles document indexing │ ├── retriever.py # Retrieves and ranks documents │ ├── settings.py # Configuration settings ├── data/ # Directory containing documents to be indexed ├── index/ # Stores the generated index files │ ├── default__vector_store.json │ ├── docstore.json │ ├── graph_store.json │ ├── image__vector_store.json │ ├── index_store.json ├── requirements.txt # Python dependencies ├── Dockerfile # Docker configuration ├── README.md # Project documentation ``` ## Example questions - What is the pre-training procedure for BERT, and how does it differ from traditional supervised learning? - Can you describe how BERT can be fine-tuned for tasks like question answering or sentiment analysis?