Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changelist generation doesn't scale #18

Open
giorgiobasile opened this issue Oct 24, 2017 · 3 comments
Open

Changelist generation doesn't scale #18

giorgiobasile opened this issue Oct 24, 2017 · 3 comments

Comments

@giorgiobasile
Copy link
Member

giorgiobasile commented Oct 24, 2017

In the ChangeListExecutor class, the changelist_generator collects all the resources from a previously generated resourcelist using the update_previous_state() method. This was reasonable for the rspub-core filesystem-centric approach, but generally speaking this just doesn't scale (I'm working with ~70 million resources).
I guess the only reason for doing so is being able to perform this check, which is again reasonable when you have file system resources, but what should happen is that your resource generator should be able to list changes and label them as C/U/D without relying on py-resourcesync. You should therefore use a specific generator for "change" resources (or make a generator able to issue resources or changes based on the strategy).
What I mean is something like:

resource_generator = self.resource_generator()
changes = {change for count, change in resource_generator(resource_metadata)}
created = [r for r in changes if r.change=="created"]
updated = [r for r in changes if r.change=="updated"]
deleted = [r for r in changes if r.change=="deleted"]

What do you think? Does it sound reasonable?

@giorgiobasile giorgiobasile changed the title Changelist generation for large set of resources Changelist generation doesn't scale Oct 24, 2017
@hariharshankar
Copy link
Contributor

I left this part of the code as it was in rspub-core, so I will have to look into this in detail to understand what is happening. In the meantime, if you would like to submit a PR, please do so :-).

@giorgiobasile
Copy link
Member Author

giorgiobasile commented Oct 25, 2017

I'm trying to overcome this limitation keeping in mind the specific use case provided by CORE. This means that I'm working on something not really general, although it may be a good starting point. As soon as I have more time I will definitely work on a PR.
Btw, the general approach used by rspub-core for changes was:

  • I parse the old resourcelist
  • I apply the changes that are already recorded in previous changelists, if any
  • I see the differences and write them

As you can imagine, when you have 70 million resources, this is time, cpu and memory consuming. I discussed about this with Henk back in the days, and he confirmed that this was something we need to work on.

@hariharshankar
Copy link
Contributor

Okay, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants