Spaces:

Gepe55o
/

paper_based_rag

Sleeping

App Files Files Community

Юра Цепліцький commited on Dec 8, 2024

Commit

693d949

1 Parent(s): 4cf0e20

Initial commit

Browse files

Files changed (25) hide show

Dockerfile +13 -0
README.md +132 -7
__pycache__/constant.cpython-312.pyc +0 -0
__pycache__/index.cpython-312.pyc +0 -0
__pycache__/main.cpython-312.pyc +0 -0
__pycache__/rag.cpython-312.pyc +0 -0
__pycache__/retriever.cpython-312.pyc +0 -0
__pycache__/settings.cpython-312.pyc +0 -0
app.py +43 -0
data/2105.07464v6.pdf +0 -0
index/default__vector_store.json +0 -0
index/docstore.json +0 -0
index/graph_store.json +1 -0
index/image__vector_store.json +1 -0
index/index_store.json +1 -0
main.py +55 -0
requirements.txt +0 -0
utils/__pycache__/constant.cpython-312.pyc +0 -0
utils/__pycache__/index.cpython-312.pyc +0 -0
utils/__pycache__/retriever.cpython-312.pyc +0 -0
utils/__pycache__/settings.cpython-312.pyc +0 -0
utils/constant.py +7 -0
utils/index.py +53 -0
utils/retriever.py +64 -0
utils/settings.py +44 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,13 @@

+FROM python:3.11.0-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,13 +1,138 @@
 ---
-title: Paper Based Rag
-emoji: 🐠
-colorFrom: pink
-colorTo: yellow
 sdk: gradio
 sdk_version: 5.8.0
 app_file: app.py
-pinned: false
-short_description: Answer on your questions about provided paper.
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: "Document QA System"
+emoji: "📄"
+colorFrom: "blue"
+colorTo: "indigo"
 sdk: gradio
 sdk_version: 5.8.0
 app_file: app.py
+python_version: 3.11.0
+models:
+  - sentence-transformers/all-mpnet-base-v2
+tags:
+  - question-answering
+  - gradio
+  - LLM
+  - document-processing
 ---
+# Document QA System
+Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment.
+## Features
+- **Document Indexing**: Efficiently processes and indexes documents for quick retrieval.
+- **Interactive Interface**: Provides a user-friendly interface for querying documents.
+- **Dockerization**: Easy to build and deploy using Docker.
+## Technologies
+- Data source
+   - [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464)  located in the data directory are used as the data source for indexing.
+- Chunking
+   - Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
+- LLM
+   - The system utilizes the [Cohere Command R](https://cohere.com/command)  for generating responses
+- Retriever, Reranker
+   - [Cohere Command R](https://cohere.com/command) is used
+- UI
+   - The user interface is built with Gradio
+## Installation
+### Prerequisites
+1. **Docker**:
+   - [Install Docker](https://docs.docker.com/get-docker/)
+2. **Set path to the data directory, index directory**:
+   - Update the variables in `utils/constant.py`.
+3. **Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key)**:
+   - Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`.
+### Using Docker
+1. **Clone the Repository**:
+   ```bash
+   git clone <repository-url>
+   cd <repository-folder>
+    ```
+2. **Build the Docker Image**:
+   ```bash
+   docker build -t doc-qa-system .
+   ```
+3. **Run the Docker Container**:
+   ```bash
+    docker run -p 7860:7860 doc-qa-system
+    ```
+4. **Access the Interface**:
+   Open your browser and go to `http://localhost:7860`.
+### Using Python
+1. **Clone the Repository**:
+   ```bash
+   git clone <repository-url>
+   cd <repository-folder>
+    ```
+2. **Install Dependencies**:
+      ```bash
+      pip install -r requirements.txt
+      ```
+3. **Run indexing data**:
+   ```bash
+   python index.py
+   ```
+4. **Run the Application**:
+   ```bash
+   python app.py
+   ```
+## Project structure
+```bash
+├── app.py                   # Gradio application
+├── main.py                  # Main script for answering queries
+├── utils/                   # Utility functions and helpers
+│   ├── constant.py          # Constant values used in the project
+│   ├── index.py             # Handles document indexing
+│   ├── retriever.py         # Retrieves and ranks documents
+│   ├── settings.py          # Configuration settings
+├── data/                    # Directory containing documents to be indexed
+├── index/                   # Stores the generated index files
+│   ├── default__vector_store.json
+│   ├── docstore.json
+│   ├── graph_store.json
+│   ├── image__vector_store.json
+│   ├── index_store.json
+├── requirements.txt         # Python dependencies
+├── Dockerfile               # Docker configuration
+├── README.md                # Project documentation
+```
+## Example questions
+- What is Few-NERD?
+- What is the Few-NERD dataset used for?
+- What are NER types in dataset?
+- What role does "transfer learning" play in the proposed few-shot learning system?
+- What metric does the paper use to evaluate the effectiveness of the few-shot model?

__pycache__/constant.cpython-312.pyc ADDED Viewed

Binary file (376 Bytes). View file

__pycache__/index.cpython-312.pyc ADDED Viewed

Binary file (1.95 kB). View file

__pycache__/main.cpython-312.pyc ADDED Viewed

Binary file (2.67 kB). View file

__pycache__/rag.cpython-312.pyc ADDED Viewed

Binary file (608 Bytes). View file

__pycache__/retriever.cpython-312.pyc ADDED Viewed

Binary file (3.48 kB). View file

__pycache__/settings.cpython-312.pyc ADDED Viewed

Binary file (2.29 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import gradio as gr
+from main import (
+    answer_query,
+    set_keys,
+    process_file
+    )
+from pydantic import ConfigDict
+model_config = ConfigDict(protected_namespaces=())
+setting_keys = gr.Interface(
+    fn=set_keys,
+    inputs=[
+        gr.Textbox(label="Enter your Cohere API key"),
+        gr.Textbox(label="Enter your LLAMA_CLOUD_API_KEY"),
+    ],
+    outputs=gr.Textbox(label="Status")
+)
+uploading_files = gr.Interface(
+    fn=process_file,
+    inputs=gr.File(
+        label="Upload a file",
+        file_count="single",
+        file_types=["text", ".pdf"],
+    ),
+    outputs=gr.Textbox(label="Status")
+)
+qa = gr.Interface(
+    fn=answer_query,
+    inputs=gr.Textbox(label="Enter your question"),
+    outputs=gr.Textbox(label="Answer"),
+    title="Document Q&A System"
+)
+demo = gr.TabbedInterface(
+    interface_list=[setting_keys, uploading_files, qa],
+    tab_names=["Settings",  "Upload File", "Q&A System"]
+    )
+if __name__ == "__main__":
+    demo.launch()

data/2105.07464v6.pdf ADDED Viewed

Binary file (844 kB). View file

index/default__vector_store.json ADDED Viewed

The diff for this file is too large to render. See raw diff

index/docstore.json ADDED Viewed

The diff for this file is too large to render. See raw diff

index/graph_store.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"graph_dict": {}}

index/image__vector_store.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"embedding_dict": {}, "text_id_to_ref_doc_id": {}, "metadata_dict": {}}

index/index_store.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"index_store/data": {"8a968a2d-ad62-41b4-8e52-02e2e510beb6": {"__type__": "vector_store", "__data__": "{\"index_id\": \"8a968a2d-ad62-41b4-8e52-02e2e510beb6\", \"summary\": null, \"nodes_dict\": {\"3e9bf844-0a4e-4de1-8be3-8a00f47f9be1\": \"3e9bf844-0a4e-4de1-8be3-8a00f47f9be1\", \"d950eb15-82e3-4c1c-b8bb-d5a7249aadae\": \"d950eb15-82e3-4c1c-b8bb-d5a7249aadae\", \"89d6be11-da41-4dd7-899f-1340a92c4cd2\": \"89d6be11-da41-4dd7-899f-1340a92c4cd2\", \"3a410009-58ad-4f35-9627-bfaa50dd56d8\": \"3a410009-58ad-4f35-9627-bfaa50dd56d8\", \"08a4de1b-58e0-4975-a68c-99b215ddca75\": \"08a4de1b-58e0-4975-a68c-99b215ddca75\", \"703bb83a-4aea-4eb3-85a6-086d25555ccb\": \"703bb83a-4aea-4eb3-85a6-086d25555ccb\", \"f6c2c3db-ba3c-489e-9459-e6b4f579286b\": \"f6c2c3db-ba3c-489e-9459-e6b4f579286b\", \"547f541a-ed82-4d22-af00-51a95dc3f0e1\": \"547f541a-ed82-4d22-af00-51a95dc3f0e1\", \"5bd4a82d-022c-47e6-9bbb-8bdeef20f515\": \"5bd4a82d-022c-47e6-9bbb-8bdeef20f515\", \"de0a20d6-b6dc-4ff3-8b6e-f6ad19472b08\": \"de0a20d6-b6dc-4ff3-8b6e-f6ad19472b08\", \"39abd0c8-e1f5-4ee3-8da1-537353646ec6\": \"39abd0c8-e1f5-4ee3-8da1-537353646ec6\", \"ec59971c-cf54-40e2-9a55-c5de0cdbea76\": \"ec59971c-cf54-40e2-9a55-c5de0cdbea76\", \"ce1695d1-7872-48ae-8589-5b5ed5355234\": \"ce1695d1-7872-48ae-8589-5b5ed5355234\", \"5a2138f4-d397-4d63-9cac-d45d9fe4de7e\": \"5a2138f4-d397-4d63-9cac-d45d9fe4de7e\", \"a2435907-a143-49c8-b483-ee3e8a02ba74\": \"a2435907-a143-49c8-b483-ee3e8a02ba74\", \"b3793ecc-96fc-4f50-bc61-21be9868e23b\": \"b3793ecc-96fc-4f50-bc61-21be9868e23b\", \"c33b63d5-7341-40f1-9016-43201810afd5\": \"c33b63d5-7341-40f1-9016-43201810afd5\", \"ecacd21e-1829-48fa-95ab-5c90846e8dd3\": \"ecacd21e-1829-48fa-95ab-5c90846e8dd3\", \"95509f41-b5f0-4bc4-ba2c-886ad18a6046\": \"95509f41-b5f0-4bc4-ba2c-886ad18a6046\", \"b778bdc3-b7ac-4222-b5f9-8e068507f3a6\": \"b778bdc3-b7ac-4222-b5f9-8e068507f3a6\", \"810ba2d6-65c6-4378-91c4-4ba38f087746\": \"810ba2d6-65c6-4378-91c4-4ba38f087746\", \"27c32a2f-d0a1-4540-90a2-aed3847dc7e4\": \"27c32a2f-d0a1-4540-90a2-aed3847dc7e4\", \"c2ae573a-cfd8-4747-a7c2-ce1d55a0484b\": \"c2ae573a-cfd8-4747-a7c2-ce1d55a0484b\", \"9cc52dba-eaee-481f-b340-5c0a400c28e7\": \"9cc52dba-eaee-481f-b340-5c0a400c28e7\", \"f1116f47-ab33-4225-bb26-ddc62fe95589\": \"f1116f47-ab33-4225-bb26-ddc62fe95589\", \"e72dac24-34a6-4159-818b-d6f023d89f0c\": \"e72dac24-34a6-4159-818b-d6f023d89f0c\", \"00f9b9f2-a717-4ccb-a263-c9c92e3a0604\": \"00f9b9f2-a717-4ccb-a263-c9c92e3a0604\", \"0b352382-f3d6-4693-8571-1762bd92e288\": \"0b352382-f3d6-4693-8571-1762bd92e288\", \"812846d5-bd57-4218-8039-072d4826c457\": \"812846d5-bd57-4218-8039-072d4826c457\", \"c52e3f4a-332f-4829-9c57-c42ad62c4c61\": \"c52e3f4a-332f-4829-9c57-c42ad62c4c61\", \"e32886ff-2b1a-422c-b95b-e421bd43419f\": \"e32886ff-2b1a-422c-b95b-e421bd43419f\", \"fbb1da9d-8adb-456b-a269-3544ffe0f8c3\": \"fbb1da9d-8adb-456b-a269-3544ffe0f8c3\", \"5b74caa6-0e1a-4998-8fce-bc485614f693\": \"5b74caa6-0e1a-4998-8fce-bc485614f693\", \"ae5d7634-5d34-44d1-a4e7-8d200469f0db\": \"ae5d7634-5d34-44d1-a4e7-8d200469f0db\", \"51714cff-a266-4cf3-96f1-bbb555068ce9\": \"51714cff-a266-4cf3-96f1-bbb555068ce9\", \"22d3e563-46d5-4e6a-a7d5-84b175421878\": \"22d3e563-46d5-4e6a-a7d5-84b175421878\", \"04883a01-7aeb-46c7-ab74-6fa9337c61ee\": \"04883a01-7aeb-46c7-ab74-6fa9337c61ee\", \"b7178a9a-baa5-4df6-bf34-fe7e2076eb3f\": \"b7178a9a-baa5-4df6-bf34-fe7e2076eb3f\", \"5a13abdf-cef2-4d15-a4c6-2678fd859672\": \"5a13abdf-cef2-4d15-a4c6-2678fd859672\", \"310337fe-3f15-42a8-a1fd-8a9bfc87f6a4\": \"310337fe-3f15-42a8-a1fd-8a9bfc87f6a4\", \"91ed48ed-da65-4f77-98c0-99f800d0db39\": \"91ed48ed-da65-4f77-98c0-99f800d0db39\", \"5998e668-1c0b-4446-ba84-6386fe51b607\": \"5998e668-1c0b-4446-ba84-6386fe51b607\", \"06456051-5542-40dd-9ddd-87258d76aa23\": \"06456051-5542-40dd-9ddd-87258d76aa23\", \"492b4f97-d056-4cde-bbf6-d2fa2a5b21b0\": \"492b4f97-d056-4cde-bbf6-d2fa2a5b21b0\", \"fa1b2e06-8569-4c40-b557-50ab94a0728d\": \"fa1b2e06-8569-4c40-b557-50ab94a0728d\", \"f70535c7-2605-4c2f-b0fc-4e390501a1e4\": \"f70535c7-2605-4c2f-b0fc-4e390501a1e4\", \"76c294c4-2bf8-4452-9ad4-beb68c0848c3\": \"76c294c4-2bf8-4452-9ad4-beb68c0848c3\", \"c82a6593-cdd2-458f-915a-b0cbba22ba2a\": \"c82a6593-cdd2-458f-915a-b0cbba22ba2a\"}, \"doc_id_dict\": {}, \"embeddings_dict\": {}}"}}}

main.py ADDED Viewed

	@@ -0,0 +1,55 @@

+from utils.retriever import get_query_engine
+from utils.index import create_index
+from utils.constant import INDEX_PATH
+import os
+from pathlib import Path
+def set_keys(co_api_key: str, llama_cloud_api_key: str) -> str:
+    try:
+        os.environ["CO_API_KEY"] = co_api_key
+        os.environ["LLAMA_CLOUD_API_KEY"] = llama_cloud_api_key
+        return "Keys are set successfully"
+    except Exception as e:
+        return str(e)
+def process_file(file) -> str:
+    file_path = os.path.join("uploaded_files", file.name)
+    os.makedirs(os.path.dirname(file_path), exist_ok=True)
+    if not os.path.exists(file_path):
+        return f"File {file_path} does not exist after writing."
+    try:
+        filepath = Path(file_path)
+        if not filepath.parent.exists():
+            return f"Directory {filepath.parent} does not exist."
+        create_index(filepath, INDEX_PATH)
+        return "File indexed successfully"
+    except Exception as e:
+        return str(e)
+def answer_query(query: str) -> str:
+    query_engine = get_query_engine(semantic=True)
+    response = query_engine.query(query)
+    nodes = query_engine.retriever.retrieve(query)
+    for node in nodes:
+        score = node.get_score()
+        text = node.text
+        response += f"\nNode: {node.node_id}\nScore: {score:0.3f}\nText: {text}\n"
+    return response
+if __name__ == "__main__":
+    query = "What is Few-NERD"
+    response = answer_query(query)
+    print(response)

requirements.txt ADDED Viewed

Binary file (4.93 kB). View file

utils/__pycache__/constant.cpython-312.pyc ADDED Viewed

Binary file (382 Bytes). View file

utils/__pycache__/index.cpython-312.pyc ADDED Viewed

Binary file (1.96 kB). View file

utils/__pycache__/retriever.cpython-312.pyc ADDED Viewed

Binary file (3.54 kB). View file

utils/__pycache__/settings.cpython-312.pyc ADDED Viewed

Binary file (1.97 kB). View file

utils/constant.py ADDED Viewed

	@@ -0,0 +1,7 @@

+DOC_PATH = "./data"
+INDEX_PATH = "./index"
+TOP_K_RETRIEVAL = 10
+TOP_N_RERANKER = 3
+EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"

utils/index.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from utils.settings import configure_settings
+from utils.constant import *
+from llama_parse import LlamaParse
+from llama_index.core import Settings
+from llama_index.core import VectorStoreIndex
+from llama_index.core import SimpleDirectoryReader
+from llama_index.core import StorageContext, load_index_from_storage
+def get_documents(path: str):
+    print("Getting documents...")
+    parser = LlamaParse()
+    file_extractor = {".pdf": parser}
+    documents = SimpleDirectoryReader(
+        input_dir=path,
+        file_extractor=file_extractor
+    ).load_data()
+    return documents
+def create_index(doc_path: str, index_path: str):
+    print("Indexing documents...")
+    configure_settings()
+    documents = get_documents(doc_path)
+    nodes = Settings.node_parser.get_nodes_from_documents(documents)
+    vector_index = VectorStoreIndex(nodes, show_progress=True)
+    vector_index.storage_context.persist(persist_dir=index_path)
+    return vector_index
+def load_index(path: str):
+    print("Loading index...")
+    storage_context = StorageContext.from_defaults(persist_dir=path)
+    index = load_index_from_storage(storage_context)
+    return index
+if __name__ == "__main__":
+    doc_path = DOC_PATH
+    index_path = INDEX_PATH
+    create_index(doc_path, index_path)

utils/retriever.py ADDED Viewed

	@@ -0,0 +1,64 @@

+from utils.settings import configure_settings
+from utils.index import load_index
+from utils.constant import INDEX_PATH, TOP_K_RETRIEVAL, TOP_N_RERANKER
+from llama_index.core import PromptTemplate
+from llama_index.retrievers.bm25 import BM25Retriever
+from llama_index.core.postprocessor import LLMRerank
+from llama_index.core.query_engine import RetrieverQueryEngine
+import Stemmer
+class QueryEngineManager:
+    _instance = None
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super(QueryEngineManager, cls).__new__(cls)
+            cls._instance._initialized = False
+        return cls._instance
+    def __init__(self):
+        if not self._initialized:
+            self._initialized = True
+            self.index = None
+            self.retriever = None
+            self.reranker = None
+            self.query_engine = None
+            self._configure()
+    def _configure(self):
+        configure_settings()
+        self.index = load_index(path=INDEX_PATH)
+        self.nodes = list(self.index.docstore.docs.values())
+        self.reranker = LLMRerank(top_n=TOP_K_RETRIEVAL, choice_batch_size=1)
+    def get_engine(self, bm25: bool = False, semantic: bool = False):
+        if bm25:
+            self.retriever = BM25Retriever.from_defaults(
+                nodes=self.nodes,
+                stemmer=Stemmer.Stemmer("english"),
+                similarity_top_k=TOP_K_RETRIEVAL,
+                language="english"
+            )
+        elif semantic:
+            self.retriever = self.index.as_retriever(similarity_top_k=TOP_K_RETRIEVAL)
+        qa_template = PromptTemplate(
+            """Given the following context and question, provide a detailed response.
+            Context: {context_str}
+            Question: {query_str}
+            Let me explain this in detail:""",
+            prompt_type="text_qa"
+        )
+        self.query_engine = RetrieverQueryEngine.from_args(
+            retriever=self.retriever,
+            text_qa_template=qa_template,
+            #node_postprocessors=[self.reranker]
+        )
+        return self.query_engine
+def get_query_engine(bm25: bool = False, semantic: bool = False):
+    engine_manager = QueryEngineManager()
+    return engine_manager.get_engine(bm25, semantic)

utils/settings.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from llama_index.core import Settings
+from llama_index.llms.cohere import Cohere
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from llama_index.core.node_parser import SemanticSplitterNodeParser
+def load_llm():
+    print("Loading LLM model...")
+    system_prompt = '''
+    You are an academic assistant specialized in synthesizing and analyzing information from scholarly papers provided by the user.
+    Your roles is to:
+    - Base your answers solely on the content of these papers.
+    - Ensure that your explanations are clear, concise, and accurately reflect the information and insights contained within the supplied documents.
+    - Integrate information from the relevant papers seamlessly, if a question pertains to multiple topics
+    - Do not include information from external sources not provided by the user.
+    '''
+    llm = Cohere(
+        system_prompt=system_prompt,
+    )
+    return llm
+def load_embed_model():
+    print("Loading embedding model...")
+    embed_model = HuggingFaceEmbedding(
+        model_name="sentence-transformers/all-mpnet-base-v2",
+    )
+    return embed_model
+def configure_settings():
+    print("Configuring settings...")
+    llm_replicate = load_llm()
+    embed_model = load_embed_model()
+    Settings.llm = llm_replicate
+    Settings.embed_model = embed_model
+    Settings.node_parser = SemanticSplitterNodeParser(
+        embed_model=Settings.embed_model,
+    )