Spaces:
Running
Running
thom moorrrr (#65)
Browse files- adding audio (df5bb27c50ff5deb2a9d59e3fd0b4087b29c408e)
- .gitattributes +1 -0
- assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav +3 -0
- assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav +3 -0
- assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav +3 -0
- assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav +3 -0
- assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav +3 -0
- dist/assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav +3 -0
- dist/assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav +3 -0
- dist/assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav +3 -0
- dist/assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav +3 -0
- dist/assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav +3 -0
- dist/index.html +58 -4
- src/index.html +58 -4
.gitattributes
CHANGED
@@ -31,6 +31,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
31 |
*.tflite filter=lfs diff=lfs merge=lfs -text
|
32 |
*.tgz filter=lfs diff=lfs merge=lfs -text
|
33 |
*.wasm filter=lfs diff=lfs merge=lfs -text
|
|
|
34 |
*.xz filter=lfs diff=lfs merge=lfs -text
|
35 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
36 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
|
|
31 |
*.tflite filter=lfs diff=lfs merge=lfs -text
|
32 |
*.tgz filter=lfs diff=lfs merge=lfs -text
|
33 |
*.wasm filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.wav filter=lfs diff=lfs merge=lfs -text
|
35 |
*.xz filter=lfs diff=lfs merge=lfs -text
|
36 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
37 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:552f71aef82738f9b5c9f1d6be495e0f83cec0eabf485066628badb3283cb4b8
|
3 |
+
size 48830444
|
assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b86787f64d918dea2417a22ce52e85b5c373658de67f6efe718a6635a45c71c4
|
3 |
+
size 50145644
|
assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6462483d36777733f18386000fc8bb95ff3cf9fae38be7248d6370ef2f58aeb4
|
3 |
+
size 44644844
|
assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3c13f529617991e627f8cb0c7d9ccb5cf8673f16e92476501fc73b9624db4d87
|
3 |
+
size 33004844
|
assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7be1f772696a43160b79b14d4a80bef8ac48d4a91fc827dee98a17369e3d7919
|
3 |
+
size 70838444
|
dist/assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:552f71aef82738f9b5c9f1d6be495e0f83cec0eabf485066628badb3283cb4b8
|
3 |
+
size 48830444
|
dist/assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:b86787f64d918dea2417a22ce52e85b5c373658de67f6efe718a6635a45c71c4
|
3 |
+
size 50145644
|
dist/assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6462483d36777733f18386000fc8bb95ff3cf9fae38be7248d6370ef2f58aeb4
|
3 |
+
size 44644844
|
dist/assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:3c13f529617991e627f8cb0c7d9ccb5cf8673f16e92476501fc73b9624db4d87
|
3 |
+
size 33004844
|
dist/assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7be1f772696a43160b79b14d4a80bef8ac48d4a91fc827dee98a17369e3d7919
|
3 |
+
size 70838444
|
dist/index.html
CHANGED
@@ -277,7 +277,17 @@
|
|
277 |
<!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
|
278 |
|
279 |
<h2>First Steps: Training on one GPU</h2>
|
280 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
281 |
<p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
|
282 |
|
283 |
<ol>
|
@@ -617,7 +627,18 @@
|
|
617 |
<p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
|
618 |
|
619 |
<h2>Data Parallelism</h2>
|
620 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
621 |
<p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
|
622 |
|
623 |
<p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
|
@@ -947,6 +968,18 @@
|
|
947 |
|
948 |
<h2>Tensor Parallelism</h2>
|
949 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
950 |
<p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
|
951 |
|
952 |
<p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
|
@@ -1380,7 +1413,18 @@
|
|
1380 |
<p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
|
1381 |
|
1382 |
<h2>Pipeline Parallelism</h2>
|
1383 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1384 |
<p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
|
1385 |
|
1386 |
<!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
|
@@ -1979,7 +2023,17 @@
|
|
1979 |
<p>Time to turn the lights off and activate CUDA mode! </p>
|
1980 |
|
1981 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
1982 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1983 |
<p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
|
1984 |
|
1985 |
<p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
|
|
|
277 |
<!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
|
278 |
|
279 |
<h2>First Steps: Training on one GPU</h2>
|
280 |
+
|
281 |
+
<div class="audio-container">
|
282 |
+
<audio controls>
|
283 |
+
<source src="../assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav" type="audio/wav">
|
284 |
+
Your browser does not support the audio element.
|
285 |
+
</audio>
|
286 |
+
<div class="figure-legend">
|
287 |
+
<p>If you fancy adding a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the first sections of this book as you're reading along.</p>
|
288 |
+
</div>
|
289 |
+
</div>
|
290 |
+
|
291 |
<p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
|
292 |
|
293 |
<ol>
|
|
|
627 |
<p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
|
628 |
|
629 |
<h2>Data Parallelism</h2>
|
630 |
+
|
631 |
+
<div class="audio-container">
|
632 |
+
<audio controls>
|
633 |
+
<source src="../assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav" type="audio/wav">
|
634 |
+
Your browser does not support the audio element.
|
635 |
+
</audio>
|
636 |
+
<div class="figure-legend">
|
637 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
638 |
+
</div>
|
639 |
+
</div>
|
640 |
+
|
641 |
+
|
642 |
<p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
|
643 |
|
644 |
<p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
|
|
|
968 |
|
969 |
<h2>Tensor Parallelism</h2>
|
970 |
|
971 |
+
<div class="audio-container">
|
972 |
+
<audio controls>
|
973 |
+
<source src="../assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav" type="audio/wav">
|
974 |
+
Your browser does not support the audio element.
|
975 |
+
</audio>
|
976 |
+
<div class="figure-legend">
|
977 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
978 |
+
</div>
|
979 |
+
</div>
|
980 |
+
|
981 |
+
|
982 |
+
|
983 |
<p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
|
984 |
|
985 |
<p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
|
|
|
1413 |
<p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
|
1414 |
|
1415 |
<h2>Pipeline Parallelism</h2>
|
1416 |
+
|
1417 |
+
<div class="audio-container">
|
1418 |
+
<audio controls>
|
1419 |
+
<source src="../assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav" type="audio/wav">
|
1420 |
+
Your browser does not support the audio element.
|
1421 |
+
</audio>
|
1422 |
+
<div class="figure-legend">
|
1423 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
1424 |
+
</div>
|
1425 |
+
</div>
|
1426 |
+
|
1427 |
+
|
1428 |
<p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
|
1429 |
|
1430 |
<!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
|
|
|
2023 |
<p>Time to turn the lights off and activate CUDA mode! </p>
|
2024 |
|
2025 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
2026 |
+
|
2027 |
+
<div class="audio-container">
|
2028 |
+
<audio controls>
|
2029 |
+
<source src="../assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav" type="audio/wav">
|
2030 |
+
Your browser does not support the audio element.
|
2031 |
+
</audio>
|
2032 |
+
<div class="figure-legend">
|
2033 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
2034 |
+
</div>
|
2035 |
+
</div>
|
2036 |
+
|
2037 |
<p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
|
2038 |
|
2039 |
<p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
|
src/index.html
CHANGED
@@ -277,7 +277,17 @@
|
|
277 |
<!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
|
278 |
|
279 |
<h2>First Steps: Training on one GPU</h2>
|
280 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
281 |
<p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
|
282 |
|
283 |
<ol>
|
@@ -617,7 +627,18 @@
|
|
617 |
<p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
|
618 |
|
619 |
<h2>Data Parallelism</h2>
|
620 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
621 |
<p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
|
622 |
|
623 |
<p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
|
@@ -947,6 +968,18 @@
|
|
947 |
|
948 |
<h2>Tensor Parallelism</h2>
|
949 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
950 |
<p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
|
951 |
|
952 |
<p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
|
@@ -1380,7 +1413,18 @@
|
|
1380 |
<p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
|
1381 |
|
1382 |
<h2>Pipeline Parallelism</h2>
|
1383 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1384 |
<p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
|
1385 |
|
1386 |
<!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
|
@@ -1979,7 +2023,17 @@
|
|
1979 |
<p>Time to turn the lights off and activate CUDA mode! </p>
|
1980 |
|
1981 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
1982 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1983 |
<p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
|
1984 |
|
1985 |
<p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
|
|
|
277 |
<!-- <p>Time to get started by quickly revisiting the basic training steps of an LLM!</p> -->
|
278 |
|
279 |
<h2>First Steps: Training on one GPU</h2>
|
280 |
+
|
281 |
+
<div class="audio-container">
|
282 |
+
<audio controls>
|
283 |
+
<source src="../assets/audio/Single GPU Model Training_ Memory, Batch Size, and Optimization.wav" type="audio/wav">
|
284 |
+
Your browser does not support the audio element.
|
285 |
+
</audio>
|
286 |
+
<div class="figure-legend">
|
287 |
+
<p>If you fancy adding a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the first sections of this book as you're reading along.</p>
|
288 |
+
</div>
|
289 |
+
</div>
|
290 |
+
|
291 |
<p>Let’s start by quickly reviewing the very basics of model training before we start to scale to many GPUs. When a model is trained on a single GPU, the training typically consists of three steps: </p>
|
292 |
|
293 |
<ol>
|
|
|
627 |
<p>Now let’s get a larger workstation 🖥️ with a couple of GPUs and start investigating our first scaling technique called <em><strong>data parallelism</strong> which –as we'll see– is just a parallel version of gradient accumulation</em>.</p>
|
628 |
|
629 |
<h2>Data Parallelism</h2>
|
630 |
+
|
631 |
+
<div class="audio-container">
|
632 |
+
<audio controls>
|
633 |
+
<source src="../assets/audio/Data Parallelism and ZeRO_ Scaling Model Training.wav" type="audio/wav">
|
634 |
+
Your browser does not support the audio element.
|
635 |
+
</audio>
|
636 |
+
<div class="figure-legend">
|
637 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
638 |
+
</div>
|
639 |
+
</div>
|
640 |
+
|
641 |
+
|
642 |
<p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replica's “model instances”) and run forward and backward passes on different micro batches of data in parallel for each GPU, hence the name Data Parallelism. You've probably already seen Data Parallelism in simple training examples but as you'll soon see we'll dive quite deeper in this section so stay tuned even if you know the general approach.</p>
|
643 |
|
644 |
<p><img alt="image.png" src="/assets/images/dp_diagram.png" /></p>
|
|
|
968 |
|
969 |
<h2>Tensor Parallelism</h2>
|
970 |
|
971 |
+
<div class="audio-container">
|
972 |
+
<audio controls>
|
973 |
+
<source src="../assets/audio/Tensor, Sequence, and Context Parallelism for Distributed Training.wav" type="audio/wav">
|
974 |
+
Your browser does not support the audio element.
|
975 |
+
</audio>
|
976 |
+
<div class="figure-legend">
|
977 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
978 |
+
</div>
|
979 |
+
</div>
|
980 |
+
|
981 |
+
|
982 |
+
|
983 |
<p>So we have sharded the model’s parameters, gradients and optimizers states with ZeRO but we hit a limit once activation memory overtakes our memory budget. Welcome Tensor Parallelism (TP), a method which shards weights, gradients, and optimizers states as well as activations and without the need to gather them all prior to the computation. Seems like a dream! Let’s first have a look at how Tensor Parallel works with simple matrix multiplications.</p>
|
984 |
|
985 |
<p>Tensor Parallelism leverages the mathematical properties of matrix multiplication <d-math>A \times B</d-math>. To understand how it works, let's examine two fundamental equations that make this parallelization possible:</p>
|
|
|
1413 |
<p>However, we still know that TP doesn't scale well across nodes, so what can we do if the model weights don't easily fit on 1 node? Here come another degree of parallelism, our forth one, called <strong>Pipeline Parallelism</strong>, to the rescue!</p>
|
1414 |
|
1415 |
<h2>Pipeline Parallelism</h2>
|
1416 |
+
|
1417 |
+
<div class="audio-container">
|
1418 |
+
<audio controls>
|
1419 |
+
<source src="../assets/audio/5D Parallelism_ Scaling Deep Learning Model Training Strategies.wav" type="audio/wav">
|
1420 |
+
Your browser does not support the audio element.
|
1421 |
+
</audio>
|
1422 |
+
<div class="figure-legend">
|
1423 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
1424 |
+
</div>
|
1425 |
+
</div>
|
1426 |
+
|
1427 |
+
|
1428 |
<p>In the <a target="_self" href="#tensor-parallelism">Tensor Parallelism</a> section we saw that trying to scale Tensor parallelism past the number of GPUs per single node (typically 4 or 8) hit a lower bandwidth network called “inter-node connection” which can quite strongly impair our performances. We can see this clearly on e.g. the all-reduce operation when we benchmark it on our cluster across several nodes (each node has 8 GPUs):</p>
|
1429 |
|
1430 |
<!-- <iframe class="l-body-outset" id="plotFrame11" src="assets/data/benchmarks/pp_comm_bandwidth.html" width="90%" scrolling="no" frameborder="0"></iframe> -->
|
|
|
2023 |
<p>Time to turn the lights off and activate CUDA mode! </p>
|
2024 |
|
2025 |
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
2026 |
+
|
2027 |
+
<div class="audio-container">
|
2028 |
+
<audio controls>
|
2029 |
+
<source src="../assets/audio/GPU Deep Dive_ Optimizing Model Performance and Memory Efficiency.wav" type="audio/wav">
|
2030 |
+
Your browser does not support the audio element.
|
2031 |
+
</audio>
|
2032 |
+
<div class="figure-legend">
|
2033 |
+
<p>To add a podcast feeling to your reading experience, feel free to listen to the NotebookLM hosts discussing the following sections of this book as you're reading along.</p>
|
2034 |
+
</div>
|
2035 |
+
</div>
|
2036 |
+
|
2037 |
<p>Up to now our discussion has been focused on the high-level organization of our model operations. We’ve moved around computations on various accelerators, taking into account general memory constraints and high-level scheduling of the compute units.</p>
|
2038 |
|
2039 |
<p>But this ignored all the optimizations we can do at a much lower level by carefully understanding how our model operations are scheduled and performed on each GPU.</p>
|