Spaces:
Sleeping
Sleeping
Юра Цепліцький
commited on
Commit
·
693d949
1
Parent(s):
4cf0e20
Initial commit
Browse files- Dockerfile +13 -0
- README.md +132 -7
- __pycache__/constant.cpython-312.pyc +0 -0
- __pycache__/index.cpython-312.pyc +0 -0
- __pycache__/main.cpython-312.pyc +0 -0
- __pycache__/rag.cpython-312.pyc +0 -0
- __pycache__/retriever.cpython-312.pyc +0 -0
- __pycache__/settings.cpython-312.pyc +0 -0
- app.py +43 -0
- data/2105.07464v6.pdf +0 -0
- index/default__vector_store.json +0 -0
- index/docstore.json +0 -0
- index/graph_store.json +1 -0
- index/image__vector_store.json +1 -0
- index/index_store.json +1 -0
- main.py +55 -0
- requirements.txt +0 -0
- utils/__pycache__/constant.cpython-312.pyc +0 -0
- utils/__pycache__/index.cpython-312.pyc +0 -0
- utils/__pycache__/retriever.cpython-312.pyc +0 -0
- utils/__pycache__/settings.cpython-312.pyc +0 -0
- utils/constant.py +7 -0
- utils/index.py +53 -0
- utils/retriever.py +64 -0
- utils/settings.py +44 -0
Dockerfile
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM python:3.11.0-slim
|
2 |
+
|
3 |
+
WORKDIR /app
|
4 |
+
|
5 |
+
COPY requirements.txt .
|
6 |
+
|
7 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
8 |
+
|
9 |
+
COPY . .
|
10 |
+
|
11 |
+
EXPOSE 7860
|
12 |
+
|
13 |
+
CMD ["python", "app.py"]
|
README.md
CHANGED
@@ -1,13 +1,138 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.8.0
|
8 |
app_file: app.py
|
9 |
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: "Document QA System"
|
3 |
+
emoji: "📄"
|
4 |
+
colorFrom: "blue"
|
5 |
+
colorTo: "indigo"
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.8.0
|
8 |
app_file: app.py
|
9 |
+
python_version: 3.11.0
|
10 |
+
models:
|
11 |
+
- sentence-transformers/all-mpnet-base-v2
|
12 |
+
tags:
|
13 |
+
- question-answering
|
14 |
+
- gradio
|
15 |
+
- LLM
|
16 |
+
- document-processing
|
17 |
---
|
18 |
|
19 |
+
# Document QA System
|
20 |
+
|
21 |
+
Document Question-Answering system that utilizes Gradio for the interface and Docker for deployment.
|
22 |
+
|
23 |
+
## Features
|
24 |
+
|
25 |
+
- **Document Indexing**: Efficiently processes and indexes documents for quick retrieval.
|
26 |
+
- **Interactive Interface**: Provides a user-friendly interface for querying documents.
|
27 |
+
- **Dockerization**: Easy to build and deploy using Docker.
|
28 |
+
|
29 |
+
## Technologies
|
30 |
+
|
31 |
+
- Data source
|
32 |
+
- [Paper about Few-NERD dataset](https://arxiv.org/pdf/2105.07464) located in the data directory are used as the data source for indexing.
|
33 |
+
- Chunking
|
34 |
+
- Document chunking is handled by [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
|
35 |
+
- LLM
|
36 |
+
- The system utilizes the [Cohere Command R](https://cohere.com/command) for generating responses
|
37 |
+
- Retriever, Reranker
|
38 |
+
- [Cohere Command R](https://cohere.com/command) is used
|
39 |
+
- UI
|
40 |
+
- The user interface is built with Gradio
|
41 |
+
|
42 |
+
## Installation
|
43 |
+
|
44 |
+
### Prerequisites
|
45 |
+
|
46 |
+
1. **Docker**:
|
47 |
+
|
48 |
+
- [Install Docker](https://docs.docker.com/get-docker/)
|
49 |
+
|
50 |
+
2. **Set path to the data directory, index directory**:
|
51 |
+
|
52 |
+
- Update the variables in `utils/constant.py`.
|
53 |
+
|
54 |
+
3. **Set the API key for [Cohere Command](https://dashboard.cohere.com/api-keys) R and [LLamaParse](https://docs.cloud.llamaindex.ai/llamaparse/getting_started/get_an_api_key)**:
|
55 |
+
|
56 |
+
- Update the `CO_API_KEY` and `LLAMA_CLOUD_API_KEY` in `utils/settings.py` in function `configure_settings`.
|
57 |
+
|
58 |
+
### Using Docker
|
59 |
+
|
60 |
+
1. **Clone the Repository**:
|
61 |
+
|
62 |
+
```bash
|
63 |
+
git clone <repository-url>
|
64 |
+
cd <repository-folder>
|
65 |
+
```
|
66 |
+
|
67 |
+
2. **Build the Docker Image**:
|
68 |
+
|
69 |
+
```bash
|
70 |
+
docker build -t doc-qa-system .
|
71 |
+
```
|
72 |
+
|
73 |
+
3. **Run the Docker Container**:
|
74 |
+
|
75 |
+
```bash
|
76 |
+
docker run -p 7860:7860 doc-qa-system
|
77 |
+
```
|
78 |
+
|
79 |
+
4. **Access the Interface**:
|
80 |
+
|
81 |
+
Open your browser and go to `http://localhost:7860`.
|
82 |
+
|
83 |
+
### Using Python
|
84 |
+
|
85 |
+
1. **Clone the Repository**:
|
86 |
+
|
87 |
+
```bash
|
88 |
+
git clone <repository-url>
|
89 |
+
cd <repository-folder>
|
90 |
+
```
|
91 |
+
|
92 |
+
2. **Install Dependencies**:
|
93 |
+
|
94 |
+
```bash
|
95 |
+
pip install -r requirements.txt
|
96 |
+
```
|
97 |
+
|
98 |
+
3. **Run indexing data**:
|
99 |
+
|
100 |
+
```bash
|
101 |
+
python index.py
|
102 |
+
```
|
103 |
+
|
104 |
+
4. **Run the Application**:
|
105 |
+
|
106 |
+
```bash
|
107 |
+
python app.py
|
108 |
+
```
|
109 |
+
|
110 |
+
## Project structure
|
111 |
+
```bash
|
112 |
+
├── app.py # Gradio application
|
113 |
+
├── main.py # Main script for answering queries
|
114 |
+
├── utils/ # Utility functions and helpers
|
115 |
+
│ ├── constant.py # Constant values used in the project
|
116 |
+
│ ├── index.py # Handles document indexing
|
117 |
+
│ ├── retriever.py # Retrieves and ranks documents
|
118 |
+
│ ├── settings.py # Configuration settings
|
119 |
+
├── data/ # Directory containing documents to be indexed
|
120 |
+
├── index/ # Stores the generated index files
|
121 |
+
│ ├── default__vector_store.json
|
122 |
+
│ ├── docstore.json
|
123 |
+
│ ├── graph_store.json
|
124 |
+
│ ├── image__vector_store.json
|
125 |
+
│ ├── index_store.json
|
126 |
+
├── requirements.txt # Python dependencies
|
127 |
+
├── Dockerfile # Docker configuration
|
128 |
+
├── README.md # Project documentation
|
129 |
+
|
130 |
+
```
|
131 |
+
|
132 |
+
## Example questions
|
133 |
+
|
134 |
+
- What is Few-NERD?
|
135 |
+
- What is the Few-NERD dataset used for?
|
136 |
+
- What are NER types in dataset?
|
137 |
+
- What role does "transfer learning" play in the proposed few-shot learning system?
|
138 |
+
- What metric does the paper use to evaluate the effectiveness of the few-shot model?
|
__pycache__/constant.cpython-312.pyc
ADDED
Binary file (376 Bytes). View file
|
|
__pycache__/index.cpython-312.pyc
ADDED
Binary file (1.95 kB). View file
|
|
__pycache__/main.cpython-312.pyc
ADDED
Binary file (2.67 kB). View file
|
|
__pycache__/rag.cpython-312.pyc
ADDED
Binary file (608 Bytes). View file
|
|
__pycache__/retriever.cpython-312.pyc
ADDED
Binary file (3.48 kB). View file
|
|
__pycache__/settings.cpython-312.pyc
ADDED
Binary file (2.29 kB). View file
|
|
app.py
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
from main import (
|
3 |
+
answer_query,
|
4 |
+
set_keys,
|
5 |
+
process_file
|
6 |
+
)
|
7 |
+
|
8 |
+
from pydantic import ConfigDict
|
9 |
+
model_config = ConfigDict(protected_namespaces=())
|
10 |
+
|
11 |
+
setting_keys = gr.Interface(
|
12 |
+
fn=set_keys,
|
13 |
+
inputs=[
|
14 |
+
gr.Textbox(label="Enter your Cohere API key"),
|
15 |
+
gr.Textbox(label="Enter your LLAMA_CLOUD_API_KEY"),
|
16 |
+
],
|
17 |
+
outputs=gr.Textbox(label="Status")
|
18 |
+
)
|
19 |
+
|
20 |
+
uploading_files = gr.Interface(
|
21 |
+
fn=process_file,
|
22 |
+
inputs=gr.File(
|
23 |
+
label="Upload a file",
|
24 |
+
file_count="single",
|
25 |
+
file_types=["text", ".pdf"],
|
26 |
+
),
|
27 |
+
outputs=gr.Textbox(label="Status")
|
28 |
+
)
|
29 |
+
|
30 |
+
qa = gr.Interface(
|
31 |
+
fn=answer_query,
|
32 |
+
inputs=gr.Textbox(label="Enter your question"),
|
33 |
+
outputs=gr.Textbox(label="Answer"),
|
34 |
+
title="Document Q&A System"
|
35 |
+
)
|
36 |
+
|
37 |
+
demo = gr.TabbedInterface(
|
38 |
+
interface_list=[setting_keys, uploading_files, qa],
|
39 |
+
tab_names=["Settings", "Upload File", "Q&A System"]
|
40 |
+
)
|
41 |
+
|
42 |
+
if __name__ == "__main__":
|
43 |
+
demo.launch()
|
data/2105.07464v6.pdf
ADDED
Binary file (844 kB). View file
|
|
index/default__vector_store.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
index/docstore.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
index/graph_store.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"graph_dict": {}}
|
index/image__vector_store.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"embedding_dict": {}, "text_id_to_ref_doc_id": {}, "metadata_dict": {}}
|
index/index_store.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"index_store/data": {"8a968a2d-ad62-41b4-8e52-02e2e510beb6": {"__type__": "vector_store", "__data__": "{\"index_id\": \"8a968a2d-ad62-41b4-8e52-02e2e510beb6\", \"summary\": null, \"nodes_dict\": {\"3e9bf844-0a4e-4de1-8be3-8a00f47f9be1\": \"3e9bf844-0a4e-4de1-8be3-8a00f47f9be1\", \"d950eb15-82e3-4c1c-b8bb-d5a7249aadae\": \"d950eb15-82e3-4c1c-b8bb-d5a7249aadae\", \"89d6be11-da41-4dd7-899f-1340a92c4cd2\": \"89d6be11-da41-4dd7-899f-1340a92c4cd2\", \"3a410009-58ad-4f35-9627-bfaa50dd56d8\": \"3a410009-58ad-4f35-9627-bfaa50dd56d8\", \"08a4de1b-58e0-4975-a68c-99b215ddca75\": \"08a4de1b-58e0-4975-a68c-99b215ddca75\", \"703bb83a-4aea-4eb3-85a6-086d25555ccb\": \"703bb83a-4aea-4eb3-85a6-086d25555ccb\", \"f6c2c3db-ba3c-489e-9459-e6b4f579286b\": \"f6c2c3db-ba3c-489e-9459-e6b4f579286b\", \"547f541a-ed82-4d22-af00-51a95dc3f0e1\": \"547f541a-ed82-4d22-af00-51a95dc3f0e1\", \"5bd4a82d-022c-47e6-9bbb-8bdeef20f515\": \"5bd4a82d-022c-47e6-9bbb-8bdeef20f515\", \"de0a20d6-b6dc-4ff3-8b6e-f6ad19472b08\": \"de0a20d6-b6dc-4ff3-8b6e-f6ad19472b08\", \"39abd0c8-e1f5-4ee3-8da1-537353646ec6\": \"39abd0c8-e1f5-4ee3-8da1-537353646ec6\", \"ec59971c-cf54-40e2-9a55-c5de0cdbea76\": \"ec59971c-cf54-40e2-9a55-c5de0cdbea76\", \"ce1695d1-7872-48ae-8589-5b5ed5355234\": \"ce1695d1-7872-48ae-8589-5b5ed5355234\", \"5a2138f4-d397-4d63-9cac-d45d9fe4de7e\": \"5a2138f4-d397-4d63-9cac-d45d9fe4de7e\", \"a2435907-a143-49c8-b483-ee3e8a02ba74\": \"a2435907-a143-49c8-b483-ee3e8a02ba74\", \"b3793ecc-96fc-4f50-bc61-21be9868e23b\": \"b3793ecc-96fc-4f50-bc61-21be9868e23b\", \"c33b63d5-7341-40f1-9016-43201810afd5\": \"c33b63d5-7341-40f1-9016-43201810afd5\", \"ecacd21e-1829-48fa-95ab-5c90846e8dd3\": \"ecacd21e-1829-48fa-95ab-5c90846e8dd3\", \"95509f41-b5f0-4bc4-ba2c-886ad18a6046\": \"95509f41-b5f0-4bc4-ba2c-886ad18a6046\", \"b778bdc3-b7ac-4222-b5f9-8e068507f3a6\": \"b778bdc3-b7ac-4222-b5f9-8e068507f3a6\", \"810ba2d6-65c6-4378-91c4-4ba38f087746\": \"810ba2d6-65c6-4378-91c4-4ba38f087746\", \"27c32a2f-d0a1-4540-90a2-aed3847dc7e4\": \"27c32a2f-d0a1-4540-90a2-aed3847dc7e4\", \"c2ae573a-cfd8-4747-a7c2-ce1d55a0484b\": \"c2ae573a-cfd8-4747-a7c2-ce1d55a0484b\", \"9cc52dba-eaee-481f-b340-5c0a400c28e7\": \"9cc52dba-eaee-481f-b340-5c0a400c28e7\", \"f1116f47-ab33-4225-bb26-ddc62fe95589\": \"f1116f47-ab33-4225-bb26-ddc62fe95589\", \"e72dac24-34a6-4159-818b-d6f023d89f0c\": \"e72dac24-34a6-4159-818b-d6f023d89f0c\", \"00f9b9f2-a717-4ccb-a263-c9c92e3a0604\": \"00f9b9f2-a717-4ccb-a263-c9c92e3a0604\", \"0b352382-f3d6-4693-8571-1762bd92e288\": \"0b352382-f3d6-4693-8571-1762bd92e288\", \"812846d5-bd57-4218-8039-072d4826c457\": \"812846d5-bd57-4218-8039-072d4826c457\", \"c52e3f4a-332f-4829-9c57-c42ad62c4c61\": \"c52e3f4a-332f-4829-9c57-c42ad62c4c61\", \"e32886ff-2b1a-422c-b95b-e421bd43419f\": \"e32886ff-2b1a-422c-b95b-e421bd43419f\", \"fbb1da9d-8adb-456b-a269-3544ffe0f8c3\": \"fbb1da9d-8adb-456b-a269-3544ffe0f8c3\", \"5b74caa6-0e1a-4998-8fce-bc485614f693\": \"5b74caa6-0e1a-4998-8fce-bc485614f693\", \"ae5d7634-5d34-44d1-a4e7-8d200469f0db\": \"ae5d7634-5d34-44d1-a4e7-8d200469f0db\", \"51714cff-a266-4cf3-96f1-bbb555068ce9\": \"51714cff-a266-4cf3-96f1-bbb555068ce9\", \"22d3e563-46d5-4e6a-a7d5-84b175421878\": \"22d3e563-46d5-4e6a-a7d5-84b175421878\", \"04883a01-7aeb-46c7-ab74-6fa9337c61ee\": \"04883a01-7aeb-46c7-ab74-6fa9337c61ee\", \"b7178a9a-baa5-4df6-bf34-fe7e2076eb3f\": \"b7178a9a-baa5-4df6-bf34-fe7e2076eb3f\", \"5a13abdf-cef2-4d15-a4c6-2678fd859672\": \"5a13abdf-cef2-4d15-a4c6-2678fd859672\", \"310337fe-3f15-42a8-a1fd-8a9bfc87f6a4\": \"310337fe-3f15-42a8-a1fd-8a9bfc87f6a4\", \"91ed48ed-da65-4f77-98c0-99f800d0db39\": \"91ed48ed-da65-4f77-98c0-99f800d0db39\", \"5998e668-1c0b-4446-ba84-6386fe51b607\": \"5998e668-1c0b-4446-ba84-6386fe51b607\", \"06456051-5542-40dd-9ddd-87258d76aa23\": \"06456051-5542-40dd-9ddd-87258d76aa23\", \"492b4f97-d056-4cde-bbf6-d2fa2a5b21b0\": \"492b4f97-d056-4cde-bbf6-d2fa2a5b21b0\", \"fa1b2e06-8569-4c40-b557-50ab94a0728d\": \"fa1b2e06-8569-4c40-b557-50ab94a0728d\", \"f70535c7-2605-4c2f-b0fc-4e390501a1e4\": \"f70535c7-2605-4c2f-b0fc-4e390501a1e4\", \"76c294c4-2bf8-4452-9ad4-beb68c0848c3\": \"76c294c4-2bf8-4452-9ad4-beb68c0848c3\", \"c82a6593-cdd2-458f-915a-b0cbba22ba2a\": \"c82a6593-cdd2-458f-915a-b0cbba22ba2a\"}, \"doc_id_dict\": {}, \"embeddings_dict\": {}}"}}}
|
main.py
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from utils.retriever import get_query_engine
|
2 |
+
from utils.index import create_index
|
3 |
+
from utils.constant import INDEX_PATH
|
4 |
+
import os
|
5 |
+
from pathlib import Path
|
6 |
+
|
7 |
+
def set_keys(co_api_key: str, llama_cloud_api_key: str) -> str:
|
8 |
+
try:
|
9 |
+
os.environ["CO_API_KEY"] = co_api_key
|
10 |
+
os.environ["LLAMA_CLOUD_API_KEY"] = llama_cloud_api_key
|
11 |
+
return "Keys are set successfully"
|
12 |
+
|
13 |
+
except Exception as e:
|
14 |
+
return str(e)
|
15 |
+
|
16 |
+
def process_file(file) -> str:
|
17 |
+
file_path = os.path.join("uploaded_files", file.name)
|
18 |
+
|
19 |
+
os.makedirs(os.path.dirname(file_path), exist_ok=True)
|
20 |
+
|
21 |
+
if not os.path.exists(file_path):
|
22 |
+
return f"File {file_path} does not exist after writing."
|
23 |
+
|
24 |
+
try:
|
25 |
+
filepath = Path(file_path)
|
26 |
+
|
27 |
+
if not filepath.parent.exists():
|
28 |
+
return f"Directory {filepath.parent} does not exist."
|
29 |
+
|
30 |
+
create_index(filepath, INDEX_PATH)
|
31 |
+
return "File indexed successfully"
|
32 |
+
except Exception as e:
|
33 |
+
return str(e)
|
34 |
+
|
35 |
+
def answer_query(query: str) -> str:
|
36 |
+
|
37 |
+
query_engine = get_query_engine(semantic=True)
|
38 |
+
response = query_engine.query(query)
|
39 |
+
|
40 |
+
nodes = query_engine.retriever.retrieve(query)
|
41 |
+
|
42 |
+
for node in nodes:
|
43 |
+
|
44 |
+
score = node.get_score()
|
45 |
+
text = node.text
|
46 |
+
|
47 |
+
response += f"\nNode: {node.node_id}\nScore: {score:0.3f}\nText: {text}\n"
|
48 |
+
|
49 |
+
return response
|
50 |
+
|
51 |
+
if __name__ == "__main__":
|
52 |
+
|
53 |
+
query = "What is Few-NERD"
|
54 |
+
response = answer_query(query)
|
55 |
+
print(response)
|
requirements.txt
ADDED
Binary file (4.93 kB). View file
|
|
utils/__pycache__/constant.cpython-312.pyc
ADDED
Binary file (382 Bytes). View file
|
|
utils/__pycache__/index.cpython-312.pyc
ADDED
Binary file (1.96 kB). View file
|
|
utils/__pycache__/retriever.cpython-312.pyc
ADDED
Binary file (3.54 kB). View file
|
|
utils/__pycache__/settings.cpython-312.pyc
ADDED
Binary file (1.97 kB). View file
|
|
utils/constant.py
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
DOC_PATH = "./data"
|
2 |
+
INDEX_PATH = "./index"
|
3 |
+
|
4 |
+
TOP_K_RETRIEVAL = 10
|
5 |
+
TOP_N_RERANKER = 3
|
6 |
+
|
7 |
+
EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"
|
utils/index.py
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from utils.settings import configure_settings
|
2 |
+
from utils.constant import *
|
3 |
+
|
4 |
+
from llama_parse import LlamaParse
|
5 |
+
|
6 |
+
from llama_index.core import Settings
|
7 |
+
from llama_index.core import VectorStoreIndex
|
8 |
+
from llama_index.core import SimpleDirectoryReader
|
9 |
+
from llama_index.core import StorageContext, load_index_from_storage
|
10 |
+
|
11 |
+
|
12 |
+
def get_documents(path: str):
|
13 |
+
print("Getting documents...")
|
14 |
+
|
15 |
+
parser = LlamaParse()
|
16 |
+
file_extractor = {".pdf": parser}
|
17 |
+
|
18 |
+
documents = SimpleDirectoryReader(
|
19 |
+
input_dir=path,
|
20 |
+
file_extractor=file_extractor
|
21 |
+
).load_data()
|
22 |
+
|
23 |
+
return documents
|
24 |
+
|
25 |
+
def create_index(doc_path: str, index_path: str):
|
26 |
+
|
27 |
+
print("Indexing documents...")
|
28 |
+
|
29 |
+
configure_settings()
|
30 |
+
|
31 |
+
documents = get_documents(doc_path)
|
32 |
+
nodes = Settings.node_parser.get_nodes_from_documents(documents)
|
33 |
+
vector_index = VectorStoreIndex(nodes, show_progress=True)
|
34 |
+
|
35 |
+
vector_index.storage_context.persist(persist_dir=index_path)
|
36 |
+
|
37 |
+
return vector_index
|
38 |
+
|
39 |
+
def load_index(path: str):
|
40 |
+
|
41 |
+
print("Loading index...")
|
42 |
+
|
43 |
+
storage_context = StorageContext.from_defaults(persist_dir=path)
|
44 |
+
index = load_index_from_storage(storage_context)
|
45 |
+
|
46 |
+
return index
|
47 |
+
|
48 |
+
if __name__ == "__main__":
|
49 |
+
|
50 |
+
doc_path = DOC_PATH
|
51 |
+
index_path = INDEX_PATH
|
52 |
+
|
53 |
+
create_index(doc_path, index_path)
|
utils/retriever.py
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from utils.settings import configure_settings
|
2 |
+
from utils.index import load_index
|
3 |
+
from utils.constant import INDEX_PATH, TOP_K_RETRIEVAL, TOP_N_RERANKER
|
4 |
+
|
5 |
+
from llama_index.core import PromptTemplate
|
6 |
+
from llama_index.retrievers.bm25 import BM25Retriever
|
7 |
+
from llama_index.core.postprocessor import LLMRerank
|
8 |
+
from llama_index.core.query_engine import RetrieverQueryEngine
|
9 |
+
import Stemmer
|
10 |
+
|
11 |
+
class QueryEngineManager:
|
12 |
+
_instance = None
|
13 |
+
|
14 |
+
def __new__(cls):
|
15 |
+
if cls._instance is None:
|
16 |
+
cls._instance = super(QueryEngineManager, cls).__new__(cls)
|
17 |
+
cls._instance._initialized = False
|
18 |
+
return cls._instance
|
19 |
+
|
20 |
+
def __init__(self):
|
21 |
+
if not self._initialized:
|
22 |
+
self._initialized = True
|
23 |
+
self.index = None
|
24 |
+
self.retriever = None
|
25 |
+
self.reranker = None
|
26 |
+
self.query_engine = None
|
27 |
+
self._configure()
|
28 |
+
|
29 |
+
def _configure(self):
|
30 |
+
configure_settings()
|
31 |
+
self.index = load_index(path=INDEX_PATH)
|
32 |
+
self.nodes = list(self.index.docstore.docs.values())
|
33 |
+
self.reranker = LLMRerank(top_n=TOP_K_RETRIEVAL, choice_batch_size=1)
|
34 |
+
|
35 |
+
def get_engine(self, bm25: bool = False, semantic: bool = False):
|
36 |
+
if bm25:
|
37 |
+
self.retriever = BM25Retriever.from_defaults(
|
38 |
+
nodes=self.nodes,
|
39 |
+
stemmer=Stemmer.Stemmer("english"),
|
40 |
+
similarity_top_k=TOP_K_RETRIEVAL,
|
41 |
+
language="english"
|
42 |
+
)
|
43 |
+
elif semantic:
|
44 |
+
self.retriever = self.index.as_retriever(similarity_top_k=TOP_K_RETRIEVAL)
|
45 |
+
|
46 |
+
qa_template = PromptTemplate(
|
47 |
+
"""Given the following context and question, provide a detailed response.
|
48 |
+
Context: {context_str}
|
49 |
+
Question: {query_str}
|
50 |
+
Let me explain this in detail:""",
|
51 |
+
prompt_type="text_qa"
|
52 |
+
)
|
53 |
+
|
54 |
+
self.query_engine = RetrieverQueryEngine.from_args(
|
55 |
+
retriever=self.retriever,
|
56 |
+
text_qa_template=qa_template,
|
57 |
+
#node_postprocessors=[self.reranker]
|
58 |
+
)
|
59 |
+
|
60 |
+
return self.query_engine
|
61 |
+
|
62 |
+
def get_query_engine(bm25: bool = False, semantic: bool = False):
|
63 |
+
engine_manager = QueryEngineManager()
|
64 |
+
return engine_manager.get_engine(bm25, semantic)
|
utils/settings.py
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from llama_index.core import Settings
|
2 |
+
from llama_index.llms.cohere import Cohere
|
3 |
+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
4 |
+
from llama_index.core.node_parser import SemanticSplitterNodeParser
|
5 |
+
|
6 |
+
def load_llm():
|
7 |
+
print("Loading LLM model...")
|
8 |
+
|
9 |
+
system_prompt = '''
|
10 |
+
You are an academic assistant specialized in synthesizing and analyzing information from scholarly papers provided by the user.
|
11 |
+
Your roles is to:
|
12 |
+
- Base your answers solely on the content of these papers.
|
13 |
+
- Ensure that your explanations are clear, concise, and accurately reflect the information and insights contained within the supplied documents.
|
14 |
+
- Integrate information from the relevant papers seamlessly, if a question pertains to multiple topics
|
15 |
+
- Do not include information from external sources not provided by the user.
|
16 |
+
'''
|
17 |
+
|
18 |
+
llm = Cohere(
|
19 |
+
system_prompt=system_prompt,
|
20 |
+
|
21 |
+
)
|
22 |
+
|
23 |
+
return llm
|
24 |
+
|
25 |
+
def load_embed_model():
|
26 |
+
print("Loading embedding model...")
|
27 |
+
|
28 |
+
embed_model = HuggingFaceEmbedding(
|
29 |
+
model_name="sentence-transformers/all-mpnet-base-v2",
|
30 |
+
)
|
31 |
+
|
32 |
+
return embed_model
|
33 |
+
|
34 |
+
def configure_settings():
|
35 |
+
print("Configuring settings...")
|
36 |
+
|
37 |
+
llm_replicate = load_llm()
|
38 |
+
embed_model = load_embed_model()
|
39 |
+
|
40 |
+
Settings.llm = llm_replicate
|
41 |
+
Settings.embed_model = embed_model
|
42 |
+
Settings.node_parser = SemanticSplitterNodeParser(
|
43 |
+
embed_model=Settings.embed_model,
|
44 |
+
)
|