Spaces:

MilindChawre
/

simple-transformer

Sleeping

App Files Files Community

MilindChawre commited on Jan 18

Commit

61d0253

1 Parent(s): 1b4fddf

Adding code for transformer model

Browse files

Files changed (7) hide show

README.md +82 -1
app.py +63 -0
checkpoint.pt +3 -0
input.txt +0 -0
trained_model_quantized.pt +3 -0
training.log +253 -0
transformer.py +353 -0

README.md CHANGED Viewed

@@ -10,4 +10,85 @@ pinned: false
 short_description: Transformer trained on Shakespeare play dataset
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: Transformer trained on Shakespeare play dataset
 ---
+# Transformer Model Training
+This project implements a transformer-based language model using PyTorch. The model is designed to learn from a text corpus and can be trained and fine-tuned for various natural language processing tasks.
+## Table of Contents
+- [Features](#features)
+- [Requirements](#requirements)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Training](#training)
+- [Actual Training](#actual-training)
+- [Checkpointing](#checkpointing)
+- [Model Compression](#model-compression)
+- [License](#license)
+- [Acknowledgments](#acknowledgments)
+## Features
+- Transformer architecture with causal self-attention and feedforward layers.
+- Efficient data loading and batching.
+- Checkpointing to resume training.
+- Support for multiple devices (CPU, CUDA, MPS).
+- Model compression for reduced file size.
+- Streamlit application for text generation using the trained model.
+## Requirements
+- Python 3.6 or higher
+- PyTorch 1.7 or higher
+- tqdm
+- tiktoken
+- streamlit
+- transformers
+## Installation
+1. Clone the repository:
+   ```bash
+   git clone https://github.com/yourusername/transformer-model-training.git
+   cd transformer-model-training
+   ```
+2. Install the required packages:
+   ```bash
+   pip install -r requirements.txt
+   ```
+## Usage
+1. Prepare your text data in a file named `input.txt`. The model will read this file to load tokens for training.
+2. Run the training script:
+   ```bash
+   python transformer.py
+   ```
+3. The model will save checkpoints after each epoch in `checkpoint.pt` and the final model in `trained_model_quantized.pt`.
+4. To generate text using the trained model, run the Streamlit application:
+   ```bash
+   streamlit run app.py
+   ```
+5. Enter your text and specify the length of additional text to generate in the Streamlit interface.
+## Training
+- The model is trained using a batch size of 4 and a learning rate of 3e-4.
+- The training loop includes loss calculation, backpropagation, and optimizer steps.
+- The loss is monitored, and checkpoints are saved to allow for resuming training.
+- The training process is logged in `training.log`, which contains detailed statistics for each epoch, including loss values and checkpointing information.
+## Actual Training
+The model was trained for a total of **78 epochs**. The final loss achieved at the end of training was approximately **0.904894**. The training log file contains detailed statistics for each epoch, including loss values and checkpointing information. You can find the log file named `training.log` in the project directory.
+## Checkpointing
+- The model state and current epoch are saved in a single checkpoint file (`checkpoint.pt`).
+- To resume training from the last checkpoint, simply run the training script again. The model will automatically load the latest checkpoint.
+## Model Compression
+- The final model is saved with compression to reduce file size. The model file will be saved as `trained_model_quantized.pt`.
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+## Acknowledgments
+- This project is inspired by the original GPT architecture and various resources available in the NLP community.

app.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import streamlit as st
+import torch
+import tiktoken
+from transformer import GPT, GPTConfig  # Ensure you import your model class
+# Load the trained model
+@st.cache_resource
+def load_model():
+    config = GPTConfig()
+    model = GPT(config)
+    try:
+        model.load_state_dict(torch.load('trained_model_quantized.pt'))
+        model.eval()  # Set the model to evaluation mode
+        st.success("Model loaded successfully!")
+    except Exception as e:
+        st.error(f"Error loading model: {e}")
+    return model
+# Load the tokenizer
+def load_tokenizer():
+    return tiktoken.get_encoding('gpt2')
+# Generate text function
+def generate_text(model, tokenizer, input_text, length, num_sequences):
+    # Encode the input text
+    input_ids = tokenizer.encode(input_text)
+    input_tensor = torch.tensor(input_ids).unsqueeze(0)  # Add batch dimension
+    generated_sequences = []
+    for _ in range(num_sequences):
+        # Generate additional tokens
+        with torch.no_grad():
+            for _ in range(length):
+                logits = model(input_tensor)[0]  # Get logits
+                next_token_logits = logits[:, -1, :]  # Get the last token's logits
+                next_token_probs = torch.softmax(next_token_logits, dim=-1)
+                next_token = torch.multinomial(next_token_probs, num_samples=1)  # Sample from the distribution
+                input_tensor = torch.cat((input_tensor, next_token.unsqueeze(0)), dim=1)  # Append the new token
+        # Decode the generated tokens
+        generated_sequences.append(tokenizer.decode(input_tensor[0].tolist()))
+    return generated_sequences
+# Streamlit app layout
+st.title("GPT Text Generator")
+st.write("Enter your text and specify the length of additional text to generate.")
+input_text = st.text_area("Input Text", "Once upon a time", max_chars=512)  # Limit to 512 characters
+length = st.slider("Predict Additional Text of Length", 1, 50, 10)
+num_sequences = st.slider("Number of Sequences to Generate", 1, 5, 1)
+if st.button("Generate"):
+    model = load_model()
+    tokenizer = load_tokenizer()
+    st.write("Generating text...")
+    generated_texts = generate_text(model, tokenizer, input_text, length, num_sequences)
+    st.write("Text generation complete.")
+    st.write("Generated Texts:")
+    for i, text in enumerate(generated_texts):
+        st.subheader(f"Sequence {i + 1}")
+        st.write(text)

checkpoint.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9f02348249d0b8457a59bc3331ac807b879f7d32b35886d60c8ab15d18fa6bd
+size 548146590

input.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

trained_model_quantized.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8fec31d4b4fa71331f80d77d4066bb10a71d6118c0c757e341a143b630be08a6
+size 331982620

training.log ADDED Viewed

	@@ -0,0 +1,253 @@

+using device: cuda
+loaded 338025 tokens
+1 epoch = 82 batches
+Number of model parameters: 124439808
+Epoch 1/70: 100% 82/82 [01:38<00:00,  1.20s/it]
+Epoch 1/70, Loss: 6.169636
+Checkpoint saved to checkpoint.pt
+Epoch 2/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 2/70, Loss: 5.720689
+Checkpoint saved to checkpoint.pt
+Epoch 3/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 3/70, Loss: 5.390238
+Checkpoint saved to checkpoint.pt
+Epoch 4/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 4/70, Loss: 5.164030
+Checkpoint saved to checkpoint.pt
+Epoch 5/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 5/70, Loss: 5.051653
+Checkpoint saved to checkpoint.pt
+Epoch 6/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 6/70, Loss: 4.947546
+Checkpoint saved to checkpoint.pt
+Epoch 7/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 7/70, Loss: 4.893464
+Checkpoint saved to checkpoint.pt
+Epoch 8/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 8/70, Loss: 4.785249
+Checkpoint saved to checkpoint.pt
+Epoch 9/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 9/70, Loss: 4.773346
+Checkpoint saved to checkpoint.pt
+Epoch 10/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 10/70, Loss: 4.669469
+Checkpoint saved to checkpoint.pt
+Epoch 11/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 11/70, Loss: 4.617172
+Checkpoint saved to checkpoint.pt
+Epoch 12/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 12/70, Loss: 4.594382
+Checkpoint saved to checkpoint.pt
+Epoch 13/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 13/70, Loss: 4.554847
+Checkpoint saved to checkpoint.pt
+Epoch 14/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 14/70, Loss: 4.506260
+Checkpoint saved to checkpoint.pt
+Epoch 15/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 15/70, Loss: 4.416086
+Checkpoint saved to checkpoint.pt
+Epoch 16/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 16/70, Loss: 4.370214
+Checkpoint saved to checkpoint.pt
+Epoch 17/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 17/70, Loss: 4.278370
+Checkpoint saved to checkpoint.pt
+Epoch 18/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 18/70, Loss: 4.304771
+Checkpoint saved to checkpoint.pt
+Epoch 19/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 19/70, Loss: 4.209321
+Checkpoint saved to checkpoint.pt
+Epoch 20/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 20/70, Loss: 4.175936
+Checkpoint saved to checkpoint.pt
+Epoch 21/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 21/70, Loss: 4.071361
+Checkpoint saved to checkpoint.pt
+Epoch 22/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 22/70, Loss: 4.071530
+Checkpoint saved to checkpoint.pt
+Epoch 23/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 23/70, Loss: 4.053171
+Checkpoint saved to checkpoint.pt
+Epoch 24/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 24/70, Loss: 3.923664
+Checkpoint saved to checkpoint.pt
+Epoch 25/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 25/70, Loss: 3.827437
+Checkpoint saved to checkpoint.pt
+Epoch 26/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 26/70, Loss: 3.767063
+Checkpoint saved to checkpoint.pt
+Epoch 27/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 27/70, Loss: 3.711340
+Checkpoint saved to checkpoint.pt
+Epoch 28/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 28/70, Loss: 3.622302
+Checkpoint saved to checkpoint.pt
+Epoch 29/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 29/70, Loss: 3.583114
+Checkpoint saved to checkpoint.pt
+Epoch 30/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 30/70, Loss: 3.517573
+Checkpoint saved to checkpoint.pt
+Epoch 31/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 31/70, Loss: 3.445611
+Checkpoint saved to checkpoint.pt
+Epoch 32/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 32/70, Loss: 3.410571
+Checkpoint saved to checkpoint.pt
+Epoch 33/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 33/70, Loss: 3.282128
+Checkpoint saved to checkpoint.pt
+Epoch 34/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 34/70, Loss: 3.307455
+Checkpoint saved to checkpoint.pt
+Epoch 35/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 35/70, Loss: 3.126928
+Checkpoint saved to checkpoint.pt
+Epoch 36/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 36/70, Loss: 3.057953
+Checkpoint saved to checkpoint.pt
+Epoch 37/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 37/70, Loss: 3.082567
+Checkpoint saved to checkpoint.pt
+Epoch 38/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 38/70, Loss: 3.066772
+Checkpoint saved to checkpoint.pt
+Epoch 39/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 39/70, Loss: 2.943954
+Checkpoint saved to checkpoint.pt
+Epoch 40/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 40/70, Loss: 2.874876
+Checkpoint saved to checkpoint.pt
+Epoch 41/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 41/70, Loss: 2.781206
+Checkpoint saved to checkpoint.pt
+Epoch 42/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 42/70, Loss: 2.729423
+Checkpoint saved to checkpoint.pt
+Epoch 43/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 43/70, Loss: 2.656427
+Checkpoint saved to checkpoint.pt
+Epoch 44/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 44/70, Loss: 2.641519
+Checkpoint saved to checkpoint.pt
+Epoch 45/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 45/70, Loss: 2.593380
+Checkpoint saved to checkpoint.pt
+Epoch 46/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 46/70, Loss: 2.504074
+Checkpoint saved to checkpoint.pt
+Epoch 47/70: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 47/70, Loss: 2.510426
+Checkpoint saved to checkpoint.pt
+Epoch 48/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 48/70, Loss: 2.465840
+Checkpoint saved to checkpoint.pt
+Epoch 49/70: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 49/70, Loss: 2.339541
+Checkpoint saved to checkpoint.pt
+Epoch 50/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 50/70, Loss: 2.288784
+Checkpoint saved to checkpoint.pt
+Epoch 51/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 51/70, Loss: 2.272939
+Checkpoint saved to checkpoint.pt
+Epoch 52/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 52/70, Loss: 2.150897
+Checkpoint saved to checkpoint.pt
+Epoch 53/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 53/70, Loss: 2.096288
+Checkpoint saved to checkpoint.pt
+Epoch 54/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 54/70, Loss: 2.057416
+Checkpoint saved to checkpoint.pt
+Epoch 55/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 55/70, Loss: 1.962530
+Checkpoint saved to checkpoint.pt
+Epoch 56/70: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 56/70, Loss: 1.930993
+Checkpoint saved to checkpoint.pt
+Epoch 57/70: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 57/70, Loss: 1.854412
+Checkpoint saved to checkpoint.pt
+Epoch 58/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 58/70, Loss: 1.818957
+Checkpoint saved to checkpoint.pt
+Epoch 59/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 59/70, Loss: 1.764919
+Checkpoint saved to checkpoint.pt
+Epoch 60/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 60/70, Loss: 1.741000
+Checkpoint saved to checkpoint.pt
+Epoch 61/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 61/70, Loss: 1.694582
+Checkpoint saved to checkpoint.pt
+Epoch 62/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 62/70, Loss: 1.751990
+Checkpoint saved to checkpoint.pt
+Epoch 63/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 63/70, Loss: 1.664971
+Checkpoint saved to checkpoint.pt
+Epoch 64/70: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 64/70, Loss: 1.557876
+Checkpoint saved to checkpoint.pt
+Epoch 65/70: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 65/70, Loss: 1.543549
+Checkpoint saved to checkpoint.pt
+Epoch 66/70: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 66/70, Loss: 1.436256
+Checkpoint saved to checkpoint.pt
+Epoch 67/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 67/70, Loss: 1.352293
+Checkpoint saved to checkpoint.pt
+Epoch 68/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 68/70, Loss: 1.361581
+Checkpoint saved to checkpoint.pt
+Epoch 69/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 69/70, Loss: 1.308131
+Checkpoint saved to checkpoint.pt
+Epoch 70/70: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 70/70, Loss: 1.287876
+Checkpoint saved to checkpoint.pt
+Total training time: 127 minutes and 37 seconds
+Model saved to trained_model_quantized.pt with quantization and compression.
+==================================================
+Increased epoch to 78 to reach loss < 0.99999
+==================================================
+using device: cuda
+loaded 338025 tokens
+1 epoch = 82 batches
+Number of model parameters: 124439808
+Loading checkpoint from checkpoint.pt
+/content/erav3-s12-transformer-model/erav3-s12-transformer-model/transformer.py:262: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
+  checkpoint = torch.load(checkpoint_file)
+Epoch 71/78: 100% 82/82 [01:36<00:00,  1.18s/it]
+Epoch 71/78, Loss: 1.453567
+Checkpoint saved to checkpoint.pt
+Epoch 72/78: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 72/78, Loss: 1.162141
+Checkpoint saved to checkpoint.pt
+Epoch 73/78: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 73/78, Loss: 1.174683
+Checkpoint saved to checkpoint.pt
+Epoch 74/78: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 74/78, Loss: 1.089287
+Checkpoint saved to checkpoint.pt
+Epoch 75/78: 100% 82/82 [01:42<00:00,  1.25s/it]
+Epoch 75/78, Loss: 1.010704
+Checkpoint saved to checkpoint.pt
+Epoch 76/78: 100% 82/82 [01:42<00:00,  1.24s/it]
+Epoch 76/78, Loss: 0.979691
+Checkpoint saved to checkpoint.pt
+Epoch 77/78: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 77/78, Loss: 0.918769
+Checkpoint saved to checkpoint.pt
+Epoch 78/78: 100% 82/82 [01:41<00:00,  1.24s/it]
+Epoch 78/78, Loss: 0.904894
+Checkpoint saved to checkpoint.pt
+Total training time: 14 minutes and 37 seconds
+Model saved to trained_model_quantized.pt with quantization and compression.

transformer.py ADDED Viewed

	@@ -0,0 +1,353 @@

+# Solving for residual std scaling issue
+import os
+import math
+import time
+from dataclasses import dataclass
+import torch
+import torch.nn as nn
+from torch.nn import functional as F
+from tqdm import tqdm  # Import tqdm for progress bar
+import torch.quantization  # Import quantization module
+import torch.nn.utils.prune as prune
+import tiktoken
+class CausalSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.n_embd % config.n_head == 0
+        # key, query, value projections for all heads, but in a batch
+        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
+        # output projection
+        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
+        self.c_proj.NANGPT_SCALE_INIT = 1
+        # regularization
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size))
+    def forward(self, x):
+        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
+        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
+        # nh is "number of heads", hs is "head size", and C (number of channels) = nh * hs
+        # e.g. in GPT-2 (124M), n_head=12, hs=64, so nh*hs=C=768 channels in the Transformer
+        qkv = self.c_attn(x)
+        q, k, v = qkv.split(self.n_embd, dim=2)
+        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
+        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
+        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
+        att = F.softmax(att, dim=-1)
+        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
+        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
+        # output projection
+        y = self.c_proj(y)
+        return y
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd)
+        self.gelu    = nn.GELU(approximate='tanh')
+        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd)
+        self.c_proj.NANOGPT_SCALE_INIT = 1
+    def forward(self, x):
+        x = self.c_fc(x)
+        x = self.gelu(x)
+        x = self.c_proj(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.n_embd)
+        self.attn = CausalSelfAttention(config)
+        self.ln_2 = nn.LayerNorm(config.n_embd)
+        self.mlp = MLP(config)
+    def forward(self, x):
+        x = x + self.attn(self.ln_1(x))
+        x = x + self.mlp(self.ln_2(x))
+        return x
+@dataclass
+class GPTConfig:
+    block_size: int = 1024 # max sequence length
+    vocab_size: int = 50257 # number of tokens: 50,000 BPE merges + 256 bytes tokens + 1 <|endoftext|> token
+    n_layer: int = 12 # number of layers
+    n_head: int = 12 # number of heads
+    n_embd: int = 768 # embedding dimension
+class GPT(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.transformer = nn.ModuleDict(dict(
+            wte = nn.Embedding(config.vocab_size, config.n_embd),
+            wpe = nn.Embedding(config.block_size, config.n_embd),
+            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
+            ln_f = nn.LayerNorm(config.n_embd),
+        ))
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        # weight sharing
+        self.transformer.wte.weight = self.lm_head.weight
+        # weight initialization
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            std = 0.02
+            if hasattr(module, 'NANGPT_SCALE_INIT'):
+                std *= (2 * self.config.n_layer) ** -0.5
+            torch.nn.init.normal_(module.weight, mean = 0.0, std = std)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std = 0.02)
+    def print_num_parameters(self):
+        num_params = sum(p.numel() for p in self.parameters())
+        print(f"Number of model parameters: {num_params}")
+    def forward(self, idx, targets=None):
+        # idx is of shape (B, T)
+        B, T = idx.size()
+        assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"
+        # forward the token and posisition embeddings
+        pos = torch.arange(0, T, dtype=torch.long, device=idx.device) # shape (T)
+        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (T, n_embd)
+        tok_emb = self.transformer.wte(idx) # token embeddings of shape (B, T, n_embd)
+        x = tok_emb + pos_emb
+        # forward the blocks of the transformer
+        for block in self.transformer.h:
+            x = block(x)
+        # forward the final layernorm and the classifier
+        x = self.transformer.ln_f(x)
+        logits = self.lm_head(x) # (B, T, vocab_size)
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
+        return logits, loss
+    @classmethod
+    def from_pretrained(cls, model_type):
+        """Loads pretrained GPT-2 model weights from huggingface"""
+        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
+        from transformers import GPT2LMHeadModel
+        print("loading weights from pretrained gpt: %s" % model_type)
+        # n_layer, n_head and n_embd are determined from model_type
+        config_args = {
+            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
+            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
+            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
+            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
+        }[model_type]
+        config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
+        config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints
+        # create a from-scratch initialized minGPT model
+        config = GPTConfig(**config_args)
+        model = GPT(config)
+        sd = model.state_dict()
+        sd_keys = sd.keys()
+        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param
+        # init a huggingface/transformers model
+        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
+        sd_hf = model_hf.state_dict()
+        # copy while ensuring all of the parameters are aligned and match in names and shapes
+        sd_keys_hf = sd_hf.keys()
+        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
+        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
+        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
+        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
+        # this means that we have to transpose these weights when we import them
+        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
+        for k in sd_keys_hf:
+            if any(k.endswith(w) for w in transposed):
+                # special treatment for the Conv1D weights we need to transpose
+                assert sd_hf[k].shape[::-1] == sd[k].shape
+                with torch.no_grad():
+                    sd[k].copy_(sd_hf[k].t())
+            else:
+                # vanilla copy over the other parameters
+                assert sd_hf[k].shape == sd[k].shape
+                with torch.no_grad():
+                    sd[k].copy_(sd_hf[k])
+        return model
+device = 'cpu'
+if torch.cuda.is_available():
+    device = 'cuda'
+elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+    device = "mps"
+print(f"using device: {device}")
+# SEED
+torch.manual_seed(1337)
+if torch.cuda.is_available():
+    torch.cuda.manual_seed(1337)
+class DataLoaderLite:
+    def __init__(self, B, T):
+        self.B = B
+        self.T = T
+        # at init load tokens from disk and store them in memory
+        with open('input.txt', 'r') as f:
+            text = f.read()
+        enc = tiktoken.get_encoding('gpt2')
+        tokens = enc.encode(text)
+        self.tokens = torch.tensor(tokens)
+        print(f'loaded {len(self.tokens)} tokens')
+        print(f'1 epoch = {len(self.tokens) // (B * T)} batches')
+        # state
+        self.current_position = 0
+    def next_batch(self):
+        B, T = self.B, self.T
+        buf = self.tokens[self.current_position: self.current_position + B * T + 1]
+        x = (buf[:-1]).view(B, T) # inputs
+        y = (buf[1:]).view(B, T) # targets
+        # advance the position in the tensor
+        self.current_position += B*T
+        # if loading the next batch would be out of bounds, reset
+        if self.current_position + (B * T + 1) > len(self.tokens):
+            self.current_position = 0
+        return x, y
+# Initialize the data loader with batch size 4 and sequence length 1024
+train_loader = DataLoaderLite(B=4, T=1024)
+# Initialize the model
+model = GPT(GPTConfig())
+model.to(device)
+# Print number of model parameters
+model.print_num_parameters()
+# Define the optimizer
+optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
+# Function to load the most recent checkpoint
+def load_latest_checkpoint(model):
+    # Find the checkpoint file
+    checkpoint_file = 'checkpoint.pt'
+    if not os.path.exists(checkpoint_file):
+        return 0  # No checkpoint found, start from epoch 0
+    print(f'Loading checkpoint from {checkpoint_file}')
+    # Load the model state and epoch number
+    checkpoint = torch.load(checkpoint_file)
+    # Ensure the checkpoint contains the expected keys
+    if 'model_state_dict' not in checkpoint or 'epoch' not in checkpoint:
+        raise KeyError("Checkpoint does not contain required keys.")
+    model.load_state_dict(checkpoint['model_state_dict'])
+    # Return the epoch number
+    return checkpoint['epoch']
+# Load the latest checkpoint if available
+start_epoch = load_latest_checkpoint(model)
+# NEW CODE: Training loop until loss is less than 0.099999
+loss = float('inf')  # Initialize loss to a large value
+num_epochs = 78  # Set the number of epochs to 78
+# Start time tracking
+start_time = time.time()
+for epoch in range(start_epoch, num_epochs):  # Start from the loaded epoch
+    epoch_loss = 0.0  # Initialize epoch loss
+    num_steps = 0  # Initialize step counter for the epoch
+    last_loss = None  # Variable to store the last loss
+    # Calculate total steps for the progress bar
+    total_steps = len(train_loader.tokens) // (train_loader.B * train_loader.T)
+    # Use tqdm to create a progress bar
+    with tqdm(total=total_steps, desc=f'Epoch {epoch + 1}/{num_epochs}') as pbar:
+        for step in range(total_steps):  # Iterate over the number of steps
+            x, y = train_loader.next_batch()
+            x, y = x.to(device), y.to(device)
+            optimizer.zero_grad()
+            logits, loss = model(x, y)
+            loss.backward()
+            optimizer.step()
+            epoch_loss += loss.item()  # Accumulate loss
+            num_steps += 1  # Increment step counter
+            last_loss = loss.item()  # Store the last loss
+            pbar.update(1)  # Update progress bar
+            # Check if the loss is below the threshold
+            if last_loss < 0.099999:
+                print(f'Loss below threshold: {last_loss:.6f}')  # Print loss before breaking
+                break  # Exit the loop if the loss condition is met
+    # Print the loss at the end of the epoch
+    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {last_loss:.6f}')
+    # Check if the loss condition was met to break out of the epoch loop
+    if last_loss < 0.099999:
+        print(f'Early stopping at epoch {epoch + 1} due to loss condition met.')
+        break  # Exit the epoch loop if the loss condition is met
+    # Checkpointing: Save the model and the current epoch after each epoch
+    checkpoint_path = 'checkpoint.pt'  # Save to a single checkpoint file
+    torch.save({
+        'epoch': epoch + 1,  # Save the current epoch number
+        'model_state_dict': model.state_dict(),  # Save the model state
+    }, checkpoint_path)
+    print(f'Checkpoint saved to {checkpoint_path}')
+# End time tracking
+end_time = time.time()
+training_duration = end_time - start_time
+# Convert training duration to minutes and seconds
+minutes = int(training_duration // 60)
+seconds = int(training_duration % 60)
+# Print the total training time in minute:second format
+print(f'Total training time: {minutes} minutes and {seconds} seconds')
+# After training your model, apply quantization and save it with compression
+def save_model_with_quantization(model, file_path):
+    # Switch model to evaluation mode
+    model.eval()
+    # Apply dynamic quantization
+    quantized_model = torch.quantization.quantize_dynamic(
+        model,  # the model to be quantized
+        {nn.Linear},  # layers to quantize
+        dtype=torch.qint8  # quantization type
+    )
+    # Save the quantized model with compression
+    torch.save(quantized_model.state_dict(), file_path, _use_new_zipfile_serialization=True)
+    print(f'Model saved to {file_path} with quantization and compression.')
+# Call this function after training your model
+save_model_with_quantization(model, 'trained_model_quantized.pt')