thomwolf HF staff commited on
Commit
df5bb27
·
1 Parent(s): bed3d00

adding audio

Browse files
.gitattributes CHANGED
@@ -31,6 +31,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
31
  *.tflite filter=lfs diff=lfs merge=lfs -text
32
  *.tgz filter=lfs diff=lfs merge=lfs -text
33
  *.wasm filter=lfs diff=lfs merge=lfs -text
 
34
  *.xz filter=lfs diff=lfs merge=lfs -text
35
  *.zip filter=lfs diff=lfs merge=lfs -text
36
  *.zst filter=lfs diff=lfs merge=lfs -text
 
31
  *.tflite filter=lfs diff=lfs merge=lfs -text
32
  *.tgz filter=lfs diff=lfs merge=lfs -text
33
  *.wasm filter=lfs diff=lfs merge=lfs -text
34
+ *.wav filter=lfs diff=lfs merge=lfs -text
35
  *.xz filter=lfs diff=lfs merge=lfs -text
36
  *.zip filter=lfs diff=lfs merge=lfs -text
37
  *.zst filter=lfs diff=lfs merge=lfs -text
assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:552f71aef82738f9b5c9f1d6be495e0f83cec0eabf485066628badb3283cb4b8
3
+ size 48830444
assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b86787f64d918dea2417a22ce52e85b5c373658de67f6efe718a6635a45c71c4
3
+ size 50145644
assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6462483d36777733f18386000fc8bb95ff3cf9fae38be7248d6370ef2f58aeb4
3
+ size 44644844
assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c13f529617991e627f8cb0c7d9ccb5cf8673f16e92476501fc73b9624db4d87
3
+ size 33004844
assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7be1f772696a43160b79b14d4a80bef8ac48d4a91fc827dee98a17369e3d7919
3
+ size 70838444
dist/assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:552f71aef82738f9b5c9f1d6be495e0f83cec0eabf485066628badb3283cb4b8
3
+ size 48830444
dist/assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b86787f64d918dea2417a22ce52e85b5c373658de67f6efe718a6635a45c71c4
3
+ size 50145644
dist/assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6462483d36777733f18386000fc8bb95ff3cf9fae38be7248d6370ef2f58aeb4
3
+ size 44644844
dist/assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c13f529617991e627f8cb0c7d9ccb5cf8673f16e92476501fc73b9624db4d87
3
+ size 33004844
dist/assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7be1f772696a43160b79b14d4a80bef8ac48d4a91fc827dee98a17369e3d7919
3
+ size 70838444
dist/index.html CHANGED
@@ -277,7 +277,17 @@
277
  <!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
278
 
279
  <h2>First Steps: Training on one GPU</h2>
280
-
 
 
 
 
 
 
 
 
 
 
281
  <p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
282
 
283
  <ol>
@@ -617,7 +627,18 @@
617
  <p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
618
 
619
  <h2>Data Parallelism</h2>
620
-
 
 
 
 
 
 
 
 
 
 
 
621
  <p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
622
 
623
  <p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
@@ -947,6 +968,18 @@
947
 
948
  <h2>Tensor Parallelism</h2>
949
 
 
 
 
 
 
 
 
 
 
 
 
 
950
  <p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
951
 
952
  <p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
@@ -1380,7 +1413,18 @@
1380
  <p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
1381
 
1382
  <h2>Pipeline Parallelism</h2>
1383
-
 
 
 
 
 
 
 
 
 
 
 
1384
  <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
1385
 
1386
  <!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
@@ -1979,7 +2023,17 @@
1979
  <p>Time to turn the lights off and activate CUDA mode! </p>
1980
 
1981
  <h2>Diving in the GPUs – fusing, threading, mixing</h2>
1982
-
 
 
 
 
 
 
 
 
 
 
1983
  <p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
1984
 
1985
  <p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
 
277
  <!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
278
 
279
  <h2>First Steps: Training on one GPU</h2>
280
+
281
+ <div class="audio-container">
282
+ <audio controls>
283
+ <source src="../assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav" type="audio/wav">
284
+ Your browser does not support the audio element.
285
+ </audio>
286
+ <div class="figure-legend">
287
+ <p>If you fancy adding a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the first sections of this book as you're reading along.</p>
288
+ </div>
289
+ </div>
290
+
291
  <p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
292
 
293
  <ol>
 
627
  <p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
628
 
629
  <h2>Data Parallelism</h2>
630
+
631
+ <div class="audio-container">
632
+ <audio controls>
633
+ <source src="../assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav" type="audio/wav">
634
+ Your browser does not support the audio element.
635
+ </audio>
636
+ <div class="figure-legend">
637
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
638
+ </div>
639
+ </div>
640
+
641
+
642
  <p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
643
 
644
  <p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
 
968
 
969
  <h2>Tensor Parallelism</h2>
970
 
971
+ <div class="audio-container">
972
+ <audio controls>
973
+ <source src="../assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav" type="audio/wav">
974
+ Your browser does not support the audio element.
975
+ </audio>
976
+ <div class="figure-legend">
977
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
978
+ </div>
979
+ </div>
980
+
981
+
982
+
983
  <p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
984
 
985
  <p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
 
1413
  <p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
1414
 
1415
  <h2>Pipeline Parallelism</h2>
1416
+
1417
+ <div class="audio-container">
1418
+ <audio controls>
1419
+ <source src="../assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav" type="audio/wav">
1420
+ Your browser does not support the audio element.
1421
+ </audio>
1422
+ <div class="figure-legend">
1423
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
1424
+ </div>
1425
+ </div>
1426
+
1427
+
1428
  <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
1429
 
1430
  <!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
 
2023
  <p>Time to turn the lights off and activate CUDA mode! </p>
2024
 
2025
  <h2>Diving in the GPUs – fusing, threading, mixing</h2>
2026
+
2027
+ <div class="audio-container">
2028
+ <audio controls>
2029
+ <source src="../assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav" type="audio/wav">
2030
+ Your browser does not support the audio element.
2031
+ </audio>
2032
+ <div class="figure-legend">
2033
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
2034
+ </div>
2035
+ </div>
2036
+
2037
  <p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
2038
 
2039
  <p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
src/index.html CHANGED
@@ -277,7 +277,17 @@
277
  <!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
278
 
279
  <h2>First Steps: Training on one GPU</h2>
280
-
 
 
 
 
 
 
 
 
 
 
281
  <p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
282
 
283
  <ol>
@@ -617,7 +627,18 @@
617
  <p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
618
 
619
  <h2>Data Parallelism</h2>
620
-
 
 
 
 
 
 
 
 
 
 
 
621
  <p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
622
 
623
  <p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
@@ -947,6 +968,18 @@
947
 
948
  <h2>Tensor Parallelism</h2>
949
 
 
 
 
 
 
 
 
 
 
 
 
 
950
  <p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
951
 
952
  <p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
@@ -1380,7 +1413,18 @@
1380
  <p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
1381
 
1382
  <h2>Pipeline Parallelism</h2>
1383
-
 
 
 
 
 
 
 
 
 
 
 
1384
  <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
1385
 
1386
  <!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
@@ -1979,7 +2023,17 @@
1979
  <p>Time to turn the lights off and activate CUDA mode! </p>
1980
 
1981
  <h2>Diving in the GPUs – fusing, threading, mixing</h2>
1982
-
 
 
 
 
 
 
 
 
 
 
1983
  <p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
1984
 
1985
  <p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
 
277
  <!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
278
 
279
  <h2>First Steps: Training on one GPU</h2>
280
+
281
+ <div class="audio-container">
282
+ <audio controls>
283
+ <source src="../assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav" type="audio/wav">
284
+ Your browser does not support the audio element.
285
+ </audio>
286
+ <div class="figure-legend">
287
+ <p>If you fancy adding a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the first sections of this book as you're reading along.</p>
288
+ </div>
289
+ </div>
290
+
291
  <p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
292
 
293
  <ol>
 
627
  <p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
628
 
629
  <h2>Data Parallelism</h2>
630
+
631
+ <div class="audio-container">
632
+ <audio controls>
633
+ <source src="../assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav" type="audio/wav">
634
+ Your browser does not support the audio element.
635
+ </audio>
636
+ <div class="figure-legend">
637
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
638
+ </div>
639
+ </div>
640
+
641
+
642
  <p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
643
 
644
  <p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
 
968
 
969
  <h2>Tensor Parallelism</h2>
970
 
971
+ <div class="audio-container">
972
+ <audio controls>
973
+ <source src="../assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav" type="audio/wav">
974
+ Your browser does not support the audio element.
975
+ </audio>
976
+ <div class="figure-legend">
977
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
978
+ </div>
979
+ </div>
980
+
981
+
982
+
983
  <p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
984
 
985
  <p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
 
1413
  <p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
1414
 
1415
  <h2>Pipeline Parallelism</h2>
1416
+
1417
+ <div class="audio-container">
1418
+ <audio controls>
1419
+ <source src="../assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav" type="audio/wav">
1420
+ Your browser does not support the audio element.
1421
+ </audio>
1422
+ <div class="figure-legend">
1423
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
1424
+ </div>
1425
+ </div>
1426
+
1427
+
1428
  <p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
1429
 
1430
  <!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
 
2023
  <p>Time to turn the lights off and activate CUDA mode! </p>
2024
 
2025
  <h2>Diving in the GPUs – fusing, threading, mixing</h2>
2026
+
2027
+ <div class="audio-container">
2028
+ <audio controls>
2029
+ <source src="../assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav" type="audio/wav">
2030
+ Your browser does not support the audio element.
2031
+ </audio>
2032
+ <div class="figure-legend">
2033
+ <p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
2034
+ </div>
2035
+ </div>
2036
+
2037
  <p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
2038
 
2039
  <p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>