Slow training with large batch size #72

favyen2 · 2024-10-04T16:26:59Z

pytorch default dataloader has each worker load one batch, so with a large batch size like batch_size=64, every worker is loading 64 windows sequentially, and we have to wait for the first worker to finish an entire batch before the training can start.

Maybe it'd be better to form a batch by combining items across workers, given that loading from GCS can be slow?

Or maybe we should switch to Weka and find a good way to sync between GCS and Weka, that could solve the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow training with large batch size #72

Slow training with large batch size #72

favyen2 commented Oct 4, 2024

Slow training with large batch size #72

Slow training with large batch size #72

Comments

favyen2 commented Oct 4, 2024