You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the prompt response to my previous ticket on docker version.
I want to run gpipe together with data parallelism on an 8x GPU server. I searched around and found that num_splits_per_client at program.py:80 seemed to determine the level of DP for TPU trainer.
self.data_parallelism=p.num_splits_per_client
I set it to 2 expecting that it will run on two DP workers, each of them having 4 pipeline stages with GPipe. I used 1 GPU per stage, therefore I used 8 GPUs in total (2 DP x 4 PP). However, I observed that only the first 4 GPUs were active while the last 4 GPUs were idling, and the throughput was similar to that of 4-GPU non-DP baseline. I suspect that this parameter is not taking effect. A screenshot of GPU trace is attached below.
I would like to ask what is the correct way to set the parameters of DP using lingvo? And if possible, would you please provide some examples on using DP+Pipeline, perhaps in the run_distributed.py format from /docker?
Thank you!
The text was updated successfully, but these errors were encountered:
Hi lingvo contributors,
Thanks for the prompt response to my previous ticket on docker version.
I want to run gpipe together with data parallelism on an 8x GPU server. I searched around and found that
num_splits_per_client
at program.py:80 seemed to determine the level of DP for TPU trainer.I set it to 2 expecting that it will run on two DP workers, each of them having 4 pipeline stages with GPipe. I used 1 GPU per stage, therefore I used 8 GPUs in total (2 DP x 4 PP). However, I observed that only the first 4 GPUs were active while the last 4 GPUs were idling, and the throughput was similar to that of 4-GPU non-DP baseline. I suspect that this parameter is not taking effect. A screenshot of GPU trace is attached below.
I would like to ask what is the correct way to set the parameters of DP using lingvo? And if possible, would you please provide some examples on using DP+Pipeline, perhaps in the
run_distributed.py
format from/docker
?Thank you!
The text was updated successfully, but these errors were encountered: