MilindChawre commited on
Commit
ac5a860
·
1 Parent(s): b7ca7fe

Adding changes in README

Browse files
Files changed (3) hide show
  1. README.md +22 -5
  2. train.py +1 -1
  3. training.log +57 -19
README.md CHANGED
@@ -23,6 +23,7 @@ This project implements a transformer-based language model using PyTorch. The mo
23
  - [Actual Training](#actual-training)
24
  - [Checkpointing](#checkpointing)
25
  - [Model Compression](#model-compression)
 
26
  - [License](#license)
27
  - [Acknowledgments](#acknowledgments)
28
 
@@ -48,17 +49,18 @@ This project implements a transformer-based language model using PyTorch. The mo
48
  git clone https://github.com/yourusername/transformer-model-training.git
49
  cd transformer-model-training
50
  ```
51
- 2. To train the model, run the training script:
 
52
  ```bash
53
- python train.py
54
  ```
55
 
56
  ## Usage
57
  1. Prepare your text data in a file named `input.txt`. The model will read this file to load tokens for training.
58
 
59
- 2. Run the training script:
60
  ```bash
61
- python transformer.py
62
  ```
63
 
64
  3. The model will save checkpoints after each epoch in `checkpoint.pt` and the final model in `trained_model_quantized.pt`.
@@ -75,9 +77,17 @@ This project implements a transformer-based language model using PyTorch. The mo
75
  - The training loop includes loss calculation, backpropagation, and optimizer steps.
76
  - The loss is monitored, and checkpoints are saved to allow for resuming training.
77
  - The training process is logged in `training.log`, which contains detailed statistics for each epoch, including loss values and checkpointing information.
 
78
 
79
  ## Actual Training
80
- The model was trained for a total of **78 epochs**. The final loss achieved at the end of training was approximately **0.904894**. The training log file contains detailed statistics for each epoch, including loss values and checkpointing information. You can find the log file named `training.log` in the project directory.
 
 
 
 
 
 
 
81
 
82
  ## Checkpointing
83
  - The model state and current epoch are saved in a single checkpoint file (`checkpoint.pt`).
@@ -86,6 +96,13 @@ The model was trained for a total of **78 epochs**. The final loss achieved at t
86
  ## Model Compression
87
  - The final model is saved with compression to reduce file size. The model file will be saved as `trained_model_quantized.pt`.
88
 
 
 
 
 
 
 
 
89
  ## License
90
  This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
91
 
 
23
  - [Actual Training](#actual-training)
24
  - [Checkpointing](#checkpointing)
25
  - [Model Compression](#model-compression)
26
+ - [Working Demo](#working-demo)
27
  - [License](#license)
28
  - [Acknowledgments](#acknowledgments)
29
 
 
49
  git clone https://github.com/yourusername/transformer-model-training.git
50
  cd transformer-model-training
51
  ```
52
+
53
+ 2. Install the required packages:
54
  ```bash
55
+ pip install -r requirements.txt
56
  ```
57
 
58
  ## Usage
59
  1. Prepare your text data in a file named `input.txt`. The model will read this file to load tokens for training.
60
 
61
+ 2. To train the model, run the training script:
62
  ```bash
63
+ python train.py
64
  ```
65
 
66
  3. The model will save checkpoints after each epoch in `checkpoint.pt` and the final model in `trained_model_quantized.pt`.
 
77
  - The training loop includes loss calculation, backpropagation, and optimizer steps.
78
  - The loss is monitored, and checkpoints are saved to allow for resuming training.
79
  - The training process is logged in `training.log`, which contains detailed statistics for each epoch, including loss values and checkpointing information.
80
+ - The training process supports early stopping if the loss falls below a specified threshold (0.099999), which can help prevent overfitting and reduce training time.
81
 
82
  ## Actual Training
83
+ The model was trained for a total of **91 epochs**. The training process involved the following steps:
84
+ - **Data Preparation**: The model reads and encodes text data from `input.txt`, loading a total of **338,025 tokens**.
85
+ - **Batch Processing**: Each epoch consists of **82 batches**, with each batch containing sequences of tokens for training.
86
+ - **Loss Monitoring**: The loss is calculated at each step, and the model's performance is tracked throughout the training process.
87
+ - **Checkpointing**: The model state and current epoch are saved in a single checkpoint file (`checkpoint.pt`) after each epoch, allowing for recovery in case of interruptions.
88
+ - **Final Model**: After training, the model is saved with quantization and compression as `trained_model_quantized.pt`, reducing the file size for easier deployment. The final loss achieved at the end of training was approximately 0.089421.
89
+
90
+ The training log file contains detailed statistics for each epoch, including loss values and checkpointing information. You can find the log file named `training.log` in the project directory.
91
 
92
  ## Checkpointing
93
  - The model state and current epoch are saved in a single checkpoint file (`checkpoint.pt`).
 
96
  ## Model Compression
97
  - The final model is saved with compression to reduce file size. The model file will be saved as `trained_model_quantized.pt`.
98
 
99
+ ## Working Demo
100
+ You can try out the working demo of the model on Hugging Face Spaces:
101
+
102
+ ![Hugging Face Spaces Demo](https://link-to-your-image.com/demo-image.png)
103
+
104
+ [Play with the Demo Here](https://huggingface.co/spaces/yourusername/your-demo)
105
+
106
  ## License
107
  This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
108
 
train.py CHANGED
@@ -26,7 +26,7 @@ def load_latest_checkpoint(model):
26
  start_epoch = load_latest_checkpoint(model)
27
 
28
  # Training loop
29
- num_epochs = 78
30
 
31
  # Start time tracking
32
  start_time = time.time()
 
26
  start_epoch = load_latest_checkpoint(model)
27
 
28
  # Training loop
29
+ num_epochs = 91
30
 
31
  # Start time tracking
32
  start_time = time.time()
training.log CHANGED
@@ -215,7 +215,7 @@ Checkpoint saved to checkpoint.pt
215
  Total training time: 127 minutes and 37 seconds
216
  Model saved to trained_model_quantized.pt with quantization and compression.
217
  ==================================================
218
- Increased epoch to 78 to reach loss < 0.99999
219
  ==================================================
220
  using device: cuda
221
  loaded 338025 tokens
@@ -224,30 +224,68 @@ Number of model parameters: 124439808
224
  Loading checkpoint from checkpoint.pt
225
  /content/erav3-s12-transformer-model/erav3-s12-transformer-model/transformer.py:262: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
226
  checkpoint = torch.load(checkpoint_file)
227
- Epoch 71/78: 100% 82/82 [01:36<00:00, 1.18s/it]
228
- Epoch 71/78, Loss: 1.453567
229
  Checkpoint saved to checkpoint.pt
230
- Epoch 72/78: 100% 82/82 [01:42<00:00, 1.25s/it]
231
- Epoch 72/78, Loss: 1.162141
232
  Checkpoint saved to checkpoint.pt
233
- Epoch 73/78: 100% 82/82 [01:42<00:00, 1.24s/it]
234
- Epoch 73/78, Loss: 1.174683
235
  Checkpoint saved to checkpoint.pt
236
- Epoch 74/78: 100% 82/82 [01:42<00:00, 1.25s/it]
237
- Epoch 74/78, Loss: 1.089287
238
  Checkpoint saved to checkpoint.pt
239
- Epoch 75/78: 100% 82/82 [01:42<00:00, 1.25s/it]
240
- Epoch 75/78, Loss: 1.010704
241
  Checkpoint saved to checkpoint.pt
242
- Epoch 76/78: 100% 82/82 [01:42<00:00, 1.24s/it]
243
- Epoch 76/78, Loss: 0.979691
244
  Checkpoint saved to checkpoint.pt
245
- Epoch 77/78: 100% 82/82 [01:41<00:00, 1.24s/it]
246
- Epoch 77/78, Loss: 0.918769
247
  Checkpoint saved to checkpoint.pt
248
- Epoch 78/78: 100% 82/82 [01:41<00:00, 1.24s/it]
249
- Epoch 78/78, Loss: 0.904894
250
  Checkpoint saved to checkpoint.pt
251
- Total training time: 14 minutes and 37 seconds
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252
  Model saved to trained_model_quantized.pt with quantization and compression.
253
-
 
215
  Total training time: 127 minutes and 37 seconds
216
  Model saved to trained_model_quantized.pt with quantization and compression.
217
  ==================================================
218
+ Increased epoch to 91 to reach loss < 0.099999
219
  ==================================================
220
  using device: cuda
221
  loaded 338025 tokens
 
224
  Loading checkpoint from checkpoint.pt
225
  /content/erav3-s12-transformer-model/erav3-s12-transformer-model/transformer.py:262: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
226
  checkpoint = torch.load(checkpoint_file)
227
+ Epoch 71/91: 100% 82/82 [01:36<00:00, 1.18s/it]
228
+ Epoch 71/91, Loss: 1.453567
229
  Checkpoint saved to checkpoint.pt
230
+ Epoch 72/91: 100% 82/82 [01:42<00:00, 1.25s/it]
231
+ Epoch 72/91, Loss: 1.162141
232
  Checkpoint saved to checkpoint.pt
233
+ Epoch 73/91: 100% 82/82 [01:42<00:00, 1.24s/it]
234
+ Epoch 73/91, Loss: 1.174683
235
  Checkpoint saved to checkpoint.pt
236
+ Epoch 74/91: 100% 82/82 [01:42<00:00, 1.25s/it]
237
+ Epoch 74/91, Loss: 1.089287
238
  Checkpoint saved to checkpoint.pt
239
+ Epoch 75/91: 100% 82/82 [01:42<00:00, 1.25s/it]
240
+ Epoch 75/91, Loss: 1.010704
241
  Checkpoint saved to checkpoint.pt
242
+ Epoch 76/91: 100% 82/82 [01:42<00:00, 1.24s/it]
243
+ Epoch 76/91, Loss: 0.979691
244
  Checkpoint saved to checkpoint.pt
245
+ Epoch 77/91: 100% 82/82 [01:41<00:00, 1.24s/it]
246
+ Epoch 77/91, Loss: 0.918769
247
  Checkpoint saved to checkpoint.pt
248
+ Epoch 78/91: 100% 82/82 [01:41<00:00, 1.24s/it]
249
+ Epoch 78/91, Loss: 0.904894
250
  Checkpoint saved to checkpoint.pt
251
+ Epoch 79/91: 100% 82/82 [01:41<00:00, 1.24s/it]
252
+ Epoch 79/91, Loss: 0.851253
253
+ Checkpoint saved to checkpoint.pt
254
+ Epoch 80/91: 100% 82/82 [01:41<00:00, 1.24s/it]
255
+ Epoch 80/91, Loss: 0.810432
256
+ Checkpoint saved to checkpoint.pt
257
+ Epoch 81/91: 100% 82/82 [01:41<00:00, 1.24s/it]
258
+ Epoch 81/91, Loss: 0.730137
259
+ Checkpoint saved to checkpoint.pt
260
+ Epoch 82/91: 100% 82/82 [01:41<00:00, 1.24s/it]
261
+ Epoch 82/91, Loss: 0.677209
262
+ Checkpoint saved to checkpoint.pt
263
+ Epoch 83/91: 100% 82/82 [01:41<00:00, 1.24s/it]
264
+ Epoch 83/91, Loss: 0.618384
265
+ Checkpoint saved to checkpoint.pt
266
+ Epoch 84/91: 100% 82/82 [01:41<00:00, 1.24s/it]
267
+ Epoch 84/91, Loss: 0.570543
268
+ Checkpoint saved to checkpoint.pt
269
+ Epoch 85/91: 100% 82/82 [01:41<00:00, 1.24s/it]
270
+ Epoch 85/91, Loss: 0.516322
271
+ Checkpoint saved to checkpoint.pt
272
+ Epoch 86/91: 100% 82/82 [01:41<00:00, 1.24s/it]
273
+ Epoch 86/91, Loss: 0.432109
274
+ Checkpoint saved to checkpoint.pt
275
+ Epoch 87/91: 100% 82/82 [01:41<00:00, 1.24s/it]
276
+ Epoch 87/91, Loss: 0.320471
277
+ Checkpoint saved to checkpoint.pt
278
+ Epoch 88/91: 100% 82/82 [01:41<00:00, 1.24s/it]
279
+ Epoch 88/91, Loss: 0.271299
280
+ Checkpoint saved to checkpoint.pt
281
+ Epoch 89/91: 100% 82/82 [01:41<00:00, 1.24s/it]
282
+ Epoch 89/91, Loss: 0.218522
283
+ Checkpoint saved to checkpoint.pt
284
+ Epoch 90/91: 100% 82/82 [01:41<00:00, 1.24s/it]
285
+ Epoch 90/91, Loss: 0.121739
286
+ Checkpoint saved to checkpoint.pt
287
+ Epoch 91/91: 100% 82/82 [01:41<00:00, 1.24s/it]
288
+ Epoch 91/91, Loss: 0.089421
289
+ Checkpoint saved to checkpoint.pt
290
+ Total training time: 34 minutes and 28 seconds
291
  Model saved to trained_model_quantized.pt with quantization and compression.