RakeshUtekar commited on
Commit
65403b0
·
verified ·
1 Parent(s): a72afcb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -2
README.md CHANGED
@@ -6,8 +6,93 @@ colorTo: red
6
  sdk: streamlit
7
  sdk_version: 1.36.0
8
  app_file: app.py
9
- pinned: false
10
  license: mit
 
11
  ---
 
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  sdk: streamlit
7
  sdk_version: 1.36.0
8
  app_file: app.py
9
+ pinned: true
10
  license: mit
11
+ short_description: Upload a PDF and ask question about it
12
  ---
13
+ # RAG-based PDF Query System
14
 
15
+ This project implements a Retrieval-Augmented Generation (RAG) system that allows users to upload multiple PDF files, extract and preprocess the text, and then query the contents of those PDFs using OpenAI's GPT-3.5-turbo model. The system combines the strengths of information retrieval and text generation to provide accurate and context-aware responses to user queries.
16
+
17
+ ## Description
18
+
19
+ The RAG-based PDF Query System is designed to:
20
+ 1. **Extract Text from PDFs:** Utilize `pdfplumber` to accurately extract text from multiple PDF files.
21
+ 2. **Preprocess Text:** Clean and tokenize the extracted text for better processing.
22
+ 3. **Create a Knowledge Base:** Use TF-IDF vectorization to create a searchable knowledge base from the extracted text.
23
+ 4. **Retrieve Relevant Texts:** Retrieve the most relevant texts based on the user query using cosine similarity.
24
+ 5. **Generate Responses:** Use OpenAI's GPT-3.5-turbo model to generate responses based on the retrieved texts and user query.
25
+
26
+ ### Key Components and Technologies Used
27
+
28
+ - **Streamlit:** For building an interactive web application.
29
+ - **pdfplumber:** For extracting text from PDF files.
30
+ - **NLTK:** For text preprocessing tasks such as tokenization.
31
+ - **Scikit-learn:** For TF-IDF vectorization and text retrieval.
32
+ - **OpenAI GPT-3.5-turbo:** For generating context-aware responses to user queries.
33
+
34
+ ### Why This Project?
35
+
36
+ - **Combining Retrieval and Generation:** The project combines information retrieval with advanced text generation, providing users with accurate and context-aware responses.
37
+ - **Interactive Interface:** Streamlit offers an easy-to-use interface for uploading PDFs and querying their contents.
38
+ - **Advanced Text Extraction:** `pdfplumber` ensures accurate extraction of text from PDFs, even from complex layouts.
39
+ - **State-of-the-art Language Model:** OpenAI's GPT-3.5-turbo is one of the most advanced language models, ensuring high-quality responses.
40
+
41
+ ## How to Run
42
+
43
+ ### Prerequisites
44
+
45
+ - Python 3.7 or higher
46
+ - OpenAI API Key (you can get it from the [OpenAI website](https://beta.openai.com/signup/))
47
+
48
+ ### Installation
49
+
50
+ 1. **Clone the repository:**
51
+ ```bash
52
+ git clone https://github.com/your-username/rag-pdf-query-system.git
53
+ cd rag-pdf-query-system
54
+ ```
55
+
56
+ 2. **Create a virtual environment and activate it:**
57
+ ```bash
58
+ python -m venv env
59
+ source env/bin/activate # On Windows use `env\Scripts\activate`
60
+ ```
61
+
62
+ 3. **Install the required packages:**
63
+ ```bash
64
+ pip install -r requirements.txt
65
+ ```
66
+
67
+ 4. **Download NLTK data:**
68
+ ```python
69
+ import nltk
70
+ nltk.download('punkt')
71
+ ```
72
+
73
+ 5. **Create a `.env` file in the project root directory:**
74
+ ```text
75
+ OPENAI_API_KEY=your_openai_api_key_here
76
+ ```
77
+
78
+ ### Running the Application
79
+
80
+ 1. **Run the Streamlit application:**
81
+ ```bash
82
+ streamlit run app.py
83
+ ```
84
+
85
+ 2. **Use the Application:**
86
+ - Open the URL provided by Streamlit (usually `http://localhost:8501`) in your web browser.
87
+ - Upload one or more PDF files.
88
+ - Enter your query in the input box.
89
+ - View the generated response based on the contents of the uploaded PDFs.
90
+
91
+ ### Notes
92
+
93
+ - The progress bar in the Streamlit application provides real-time feedback during the PDF processing stages.
94
+ - Ensure you have a stable internet connection to interact with the OpenAI API for generating responses.
95
+
96
+ This project demonstrates the integration of various tools and libraries to create a powerful and interactive query system for PDF documents.
97
+
98
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference