using device: cuda loaded 338025 tokens 1 epoch = 82 batches Number of model parameters: 124439808 Epoch 1/70: 100% 82/82 [01:38<00:00, 1.20s/it] Epoch 1/70, Loss: 6.169636 Checkpoint saved to checkpoint.pt Epoch 2/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 2/70, Loss: 5.720689 Checkpoint saved to checkpoint.pt Epoch 3/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 3/70, Loss: 5.390238 Checkpoint saved to checkpoint.pt Epoch 4/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 4/70, Loss: 5.164030 Checkpoint saved to checkpoint.pt Epoch 5/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 5/70, Loss: 5.051653 Checkpoint saved to checkpoint.pt Epoch 6/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 6/70, Loss: 4.947546 Checkpoint saved to checkpoint.pt Epoch 7/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 7/70, Loss: 4.893464 Checkpoint saved to checkpoint.pt Epoch 8/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 8/70, Loss: 4.785249 Checkpoint saved to checkpoint.pt Epoch 9/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 9/70, Loss: 4.773346 Checkpoint saved to checkpoint.pt Epoch 10/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 10/70, Loss: 4.669469 Checkpoint saved to checkpoint.pt Epoch 11/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 11/70, Loss: 4.617172 Checkpoint saved to checkpoint.pt Epoch 12/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 12/70, Loss: 4.594382 Checkpoint saved to checkpoint.pt Epoch 13/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 13/70, Loss: 4.554847 Checkpoint saved to checkpoint.pt Epoch 14/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 14/70, Loss: 4.506260 Checkpoint saved to checkpoint.pt Epoch 15/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 15/70, Loss: 4.416086 Checkpoint saved to checkpoint.pt Epoch 16/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 16/70, Loss: 4.370214 Checkpoint saved to checkpoint.pt Epoch 17/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 17/70, Loss: 4.278370 Checkpoint saved to checkpoint.pt Epoch 18/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 18/70, Loss: 4.304771 Checkpoint saved to checkpoint.pt Epoch 19/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 19/70, Loss: 4.209321 Checkpoint saved to checkpoint.pt Epoch 20/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 20/70, Loss: 4.175936 Checkpoint saved to checkpoint.pt Epoch 21/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 21/70, Loss: 4.071361 Checkpoint saved to checkpoint.pt Epoch 22/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 22/70, Loss: 4.071530 Checkpoint saved to checkpoint.pt Epoch 23/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 23/70, Loss: 4.053171 Checkpoint saved to checkpoint.pt Epoch 24/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 24/70, Loss: 3.923664 Checkpoint saved to checkpoint.pt Epoch 25/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 25/70, Loss: 3.827437 Checkpoint saved to checkpoint.pt Epoch 26/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 26/70, Loss: 3.767063 Checkpoint saved to checkpoint.pt Epoch 27/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 27/70, Loss: 3.711340 Checkpoint saved to checkpoint.pt Epoch 28/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 28/70, Loss: 3.622302 Checkpoint saved to checkpoint.pt Epoch 29/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 29/70, Loss: 3.583114 Checkpoint saved to checkpoint.pt Epoch 30/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 30/70, Loss: 3.517573 Checkpoint saved to checkpoint.pt Epoch 31/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 31/70, Loss: 3.445611 Checkpoint saved to checkpoint.pt Epoch 32/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 32/70, Loss: 3.410571 Checkpoint saved to checkpoint.pt Epoch 33/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 33/70, Loss: 3.282128 Checkpoint saved to checkpoint.pt Epoch 34/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 34/70, Loss: 3.307455 Checkpoint saved to checkpoint.pt Epoch 35/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 35/70, Loss: 3.126928 Checkpoint saved to checkpoint.pt Epoch 36/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 36/70, Loss: 3.057953 Checkpoint saved to checkpoint.pt Epoch 37/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 37/70, Loss: 3.082567 Checkpoint saved to checkpoint.pt Epoch 38/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 38/70, Loss: 3.066772 Checkpoint saved to checkpoint.pt Epoch 39/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 39/70, Loss: 2.943954 Checkpoint saved to checkpoint.pt Epoch 40/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 40/70, Loss: 2.874876 Checkpoint saved to checkpoint.pt Epoch 41/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 41/70, Loss: 2.781206 Checkpoint saved to checkpoint.pt Epoch 42/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 42/70, Loss: 2.729423 Checkpoint saved to checkpoint.pt Epoch 43/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 43/70, Loss: 2.656427 Checkpoint saved to checkpoint.pt Epoch 44/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 44/70, Loss: 2.641519 Checkpoint saved to checkpoint.pt Epoch 45/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 45/70, Loss: 2.593380 Checkpoint saved to checkpoint.pt Epoch 46/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 46/70, Loss: 2.504074 Checkpoint saved to checkpoint.pt Epoch 47/70: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 47/70, Loss: 2.510426 Checkpoint saved to checkpoint.pt Epoch 48/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 48/70, Loss: 2.465840 Checkpoint saved to checkpoint.pt Epoch 49/70: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 49/70, Loss: 2.339541 Checkpoint saved to checkpoint.pt Epoch 50/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 50/70, Loss: 2.288784 Checkpoint saved to checkpoint.pt Epoch 51/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 51/70, Loss: 2.272939 Checkpoint saved to checkpoint.pt Epoch 52/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 52/70, Loss: 2.150897 Checkpoint saved to checkpoint.pt Epoch 53/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 53/70, Loss: 2.096288 Checkpoint saved to checkpoint.pt Epoch 54/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 54/70, Loss: 2.057416 Checkpoint saved to checkpoint.pt Epoch 55/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 55/70, Loss: 1.962530 Checkpoint saved to checkpoint.pt Epoch 56/70: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 56/70, Loss: 1.930993 Checkpoint saved to checkpoint.pt Epoch 57/70: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 57/70, Loss: 1.854412 Checkpoint saved to checkpoint.pt Epoch 58/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 58/70, Loss: 1.818957 Checkpoint saved to checkpoint.pt Epoch 59/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 59/70, Loss: 1.764919 Checkpoint saved to checkpoint.pt Epoch 60/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 60/70, Loss: 1.741000 Checkpoint saved to checkpoint.pt Epoch 61/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 61/70, Loss: 1.694582 Checkpoint saved to checkpoint.pt Epoch 62/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 62/70, Loss: 1.751990 Checkpoint saved to checkpoint.pt Epoch 63/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 63/70, Loss: 1.664971 Checkpoint saved to checkpoint.pt Epoch 64/70: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 64/70, Loss: 1.557876 Checkpoint saved to checkpoint.pt Epoch 65/70: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 65/70, Loss: 1.543549 Checkpoint saved to checkpoint.pt Epoch 66/70: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 66/70, Loss: 1.436256 Checkpoint saved to checkpoint.pt Epoch 67/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 67/70, Loss: 1.352293 Checkpoint saved to checkpoint.pt Epoch 68/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 68/70, Loss: 1.361581 Checkpoint saved to checkpoint.pt Epoch 69/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 69/70, Loss: 1.308131 Checkpoint saved to checkpoint.pt Epoch 70/70: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 70/70, Loss: 1.287876 Checkpoint saved to checkpoint.pt Total training time: 127 minutes and 37 seconds Model saved to trained_model_quantized.pt with quantization and compression. ================================================== Increased epoch to 91 to reach loss < 0.099999 ================================================== using device: cuda loaded 338025 tokens 1 epoch = 82 batches Number of model parameters: 124439808 Loading checkpoint from checkpoint.pt /content/erav3-s12-transformer-model/erav3-s12-transformer-model/transformer.py:262: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_file) Epoch 71/91: 100% 82/82 [01:36<00:00, 1.18s/it] Epoch 71/91, Loss: 1.453567 Checkpoint saved to checkpoint.pt Epoch 72/91: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 72/91, Loss: 1.162141 Checkpoint saved to checkpoint.pt Epoch 73/91: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 73/91, Loss: 1.174683 Checkpoint saved to checkpoint.pt Epoch 74/91: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 74/91, Loss: 1.089287 Checkpoint saved to checkpoint.pt Epoch 75/91: 100% 82/82 [01:42<00:00, 1.25s/it] Epoch 75/91, Loss: 1.010704 Checkpoint saved to checkpoint.pt Epoch 76/91: 100% 82/82 [01:42<00:00, 1.24s/it] Epoch 76/91, Loss: 0.979691 Checkpoint saved to checkpoint.pt Epoch 77/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 77/91, Loss: 0.918769 Checkpoint saved to checkpoint.pt Epoch 78/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 78/91, Loss: 0.904894 Checkpoint saved to checkpoint.pt Epoch 79/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 79/91, Loss: 0.851253 Checkpoint saved to checkpoint.pt Epoch 80/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 80/91, Loss: 0.810432 Checkpoint saved to checkpoint.pt Epoch 81/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 81/91, Loss: 0.730137 Checkpoint saved to checkpoint.pt Epoch 82/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 82/91, Loss: 0.677209 Checkpoint saved to checkpoint.pt Epoch 83/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 83/91, Loss: 0.618384 Checkpoint saved to checkpoint.pt Epoch 84/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 84/91, Loss: 0.570543 Checkpoint saved to checkpoint.pt Epoch 85/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 85/91, Loss: 0.516322 Checkpoint saved to checkpoint.pt Epoch 86/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 86/91, Loss: 0.432109 Checkpoint saved to checkpoint.pt Epoch 87/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 87/91, Loss: 0.320471 Checkpoint saved to checkpoint.pt Epoch 88/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 88/91, Loss: 0.271299 Checkpoint saved to checkpoint.pt Epoch 89/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 89/91, Loss: 0.218522 Checkpoint saved to checkpoint.pt Epoch 90/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 90/91, Loss: 0.121739 Checkpoint saved to checkpoint.pt Epoch 91/91: 100% 82/82 [01:41<00:00, 1.24s/it] Epoch 91/91, Loss: 0.089421 Checkpoint saved to checkpoint.pt Total training time: 34 minutes and 28 seconds Model saved to trained_model_quantized.pt with quantization and compression.