Changelist generation doesn't scale #18

giorgiobasile · 2017-10-24T11:45:28Z

In the ChangeListExecutor class, the changelist_generator collects all the resources from a previously generated resourcelist using the update_previous_state() method. This was reasonable for the rspub-core filesystem-centric approach, but generally speaking this just doesn't scale (I'm working with ~70 million resources).
I guess the only reason for doing so is being able to perform this check, which is again reasonable when you have file system resources, but what should happen is that your resource generator should be able to list changes and label them as C/U/D without relying on py-resourcesync. You should therefore use a specific generator for "change" resources (or make a generator able to issue resources or changes based on the strategy).
What I mean is something like:

resource_generator = self.resource_generator()
changes = {change for count, change in resource_generator(resource_metadata)}
created = [r for r in changes if r.change=="created"]
updated = [r for r in changes if r.change=="updated"]
deleted = [r for r in changes if r.change=="deleted"]

What do you think? Does it sound reasonable?

The text was updated successfully, but these errors were encountered:

hariharshankar · 2017-10-25T15:38:39Z

I left this part of the code as it was in rspub-core, so I will have to look into this in detail to understand what is happening. In the meantime, if you would like to submit a PR, please do so :-).

giorgiobasile · 2017-10-25T15:44:19Z

I'm trying to overcome this limitation keeping in mind the specific use case provided by CORE. This means that I'm working on something not really general, although it may be a good starting point. As soon as I have more time I will definitely work on a PR.
Btw, the general approach used by rspub-core for changes was:

I parse the old resourcelist
I apply the changes that are already recorded in previous changelists, if any
I see the differences and write them

As you can imagine, when you have 70 million resources, this is time, cpu and memory consuming. I discussed about this with Henk back in the days, and he confirmed that this was something we need to work on.

hariharshankar · 2017-10-25T15:46:05Z

Okay, thanks!

giorgiobasile added the enhancement label Oct 24, 2017

giorgiobasile changed the title ~~Changelist generation for large set of resources~~ Changelist generation doesn't scale Oct 24, 2017

giorgiobasile mentioned this issue Oct 26, 2017

Changes directly from Generator #19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelist generation doesn't scale #18

Changelist generation doesn't scale #18

giorgiobasile commented Oct 24, 2017 •

edited

Loading

hariharshankar commented Oct 25, 2017

giorgiobasile commented Oct 25, 2017 •

edited

Loading

hariharshankar commented Oct 25, 2017

Changelist generation doesn't scale #18

Changelist generation doesn't scale #18

Comments

giorgiobasile commented Oct 24, 2017 • edited Loading

hariharshankar commented Oct 25, 2017

giorgiobasile commented Oct 25, 2017 • edited Loading

hariharshankar commented Oct 25, 2017

giorgiobasile commented Oct 24, 2017 •

edited

Loading

giorgiobasile commented Oct 25, 2017 •

edited

Loading