--- title: "Document QA System" emoji: "📄" colorFrom: "blue" colorTo: "indigo" sdk: gradio sdk_version: 5.8.0 app_file: app.py python_version: 3.11.0 models: - sentence-transformers/all-mpnet-base-v2 tags: - question-answering - gradio - LLM - document-processing --- # Document QA System Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment. ## Features - **Document Indexing**: Efficiently processes and indexes documents for quick retrieval. - **Interactive Interface**: Provides a user-friendly interface for querying documents. - **Dockerization**: Easy to build and deploy using Docker. ## Technologies - Data source - [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464) located in the data directory are used as the data source for indexing. - Chunking - Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) - LLM - The system utilizes the [Cohere Command R](https://cohere.com/command) for generating responses - Retriever, Reranker - [Cohere Command R](https://cohere.com/command) is used - UI - The user interface is built with Gradio ## Installation ### Prerequisites 1. **Docker**: - [Install Docker](https://docs.docker.com/get-docker/) 2. **Set path to the data directory, index directory**: - Update the variables in `utils/constant.py`. 3. **Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key)**: - Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`. ### Using Docker 1. **Clone the Repository**: ```bash git clone cd ``` 2. **Build the Docker Image**: ```bash docker build -t doc-qa-system . ``` 3. **Run the Docker Container**: ```bash docker run -p 7860:7860 doc-qa-system ``` 4. **Access the Interface**: Open your browser and go to `http://localhost:7860`. ### Using Python 1. **Clone the Repository**: ```bash git clone cd ``` 2. **Install Dependencies**: ```bash pip install -r requirements.txt ``` 3. **Run indexing data**: ```bash python index.py ``` 4. **Run the Application**: ```bash python app.py ``` ## Project structure ```bash ├── app.py # Gradio application ├── main.py # Main script for answering queries ├── utils/ # Utility functions and helpers │ ├── constant.py # Constant values used in the project │ ├── index.py # Handles document indexing │ ├── retriever.py # Retrieves and ranks documents │ ├── settings.py # Configuration settings ├── data/ # Directory containing documents to be indexed ├── index/ # Stores the generated index files │ ├── default__vector_store.json │ ├── docstore.json │ ├── graph_store.json │ ├── image__vector_store.json │ ├── index_store.json ├── requirements.txt # Python dependencies ├── Dockerfile # Docker configuration ├── README.md # Project documentation ``` ## Example questions - What is Few-NERD? - What is the Few-NERD dataset used for? - What are NER types in dataset? - What role does "transfer learning" play in the proposed few-shot learning system? - What metric does the paper use to evaluate the effectiveness of the few-shot model?