Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(*:skip) Fix querying dataset size functions #3794

Closed

Conversation

KarhouTam
Copy link
Contributor

Issue

Description

I found that in many places, when flower collaborates with pytorch, it uses len(self.trainloader) and len(self.testloader) to query the sizes of clients train set and test set.
However, in pytorch, len(dataloader) will return the num of batches that the dataloader.dataset can split into, not the real size of dataloader.dataset. Means that len(dataloader) != len(dataloader.dataset), and the latter one is more reasonable in federated learning scenario, I think.

For example:

>>> from torch.utils.data import DataLoader, Subset
>>> from torchvision.datasets import MNIST
>>> mnist = MNIST(...)
>>> subset_idxs = list(range(64))
>>> mnist_subset = Subset(mnist, indices=subset_idxs)
>>> dataloader_mnist_whole = DataLoader(mnist, batch_size=32)
>>> dataloader_mnist_subset = DataLoader(mnist_subset, batch_size=32)
>>> len(dataloader_mnist_whole)
1875
>>> len(dataloader_mnist_subset)
2
>>> len(dataloader_mnist_whole.dataset)
60000
>>> len(dataloader_mnist_subset.dataset)
64

Related issues/PRs

Proposal

So I just search len(self.trainloader) and len(self.testloader) globally and replace them with len(self.trainloader.dataset) and len(self.testloader.dataset).

Checklist

  • Implement proposed change
  • Write tests
  • Update documentation
  • Make CI checks pass
  • Ping maintainers on Slack (channel #contributions)

@jafermarq
Copy link
Contributor

Hi @KarhouTam, thanks for opening the PR. This is just an implementation choice right (and not a bug) The weighing factor used for aggregation is just a function of the batchsize rather than the total number of examples. In most cases this will result in near identical (because some batches might be incomplete) results as if the total number of samples was instead communicated (assuming all clients use the same batch size)

We can follow this approach in newer examples. For the baselines i don't think we'll make the change.

@KarhouTam
Copy link
Contributor Author

I think this change has no risk at all and is better to uniform the aggregation weight factor in flower.

But anyway, I respect your decision. 🫡

(This PR can be reopened anytime if you change your mind.)

@KarhouTam KarhouTam closed this Jul 14, 2024
@KarhouTam KarhouTam deleted the fix-querying-dataset-size-functions branch September 7, 2024 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants