Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make generate_path requests a redis queue #92

Open
Elucidation opened this issue Jun 19, 2023 · 10 comments
Open

Make generate_path requests a redis queue #92

Elucidation opened this issue Jun 19, 2023 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@Elucidation
Copy link
Owner

Currently, as the number of robots increase, the number of generate_path requests also increase in a relatively linear 1:1 relationship.

Running on dev machine with ~220 robots in a larger warehouse, we still see generate_path has a very stable processing time of < 30ms, with the average time being 2.88ms

image

From this, with the goal of making pathfinding a scalable process, we want to be able to have the robot manager parcel out generate path requests to a queue, and then N worker nodes can pull from this queue and complete and return the paths for them.

Analysis

One question is whether there's a way to identify generate path requests that will likely take < 1 ms or so, in which case it might be worth keeping those local instead of pushing them onto a queue.

Looking at the logs and doing some pair-plots and a correlation matrix

image

It doesn't seem like manhattan distance (something easily calculatable in O(1) time before generate_path based off of start-end locations) correlates much with duration (0.37).

@Elucidation
Copy link
Owner Author

Elucidation commented Jun 19, 2023

There is a bit of a correlation in that for very small manhattan distances, duration will generally be on the lower end.

image

Splitting the groups by manhattan distance <= 10 gives us a bit ability to identify generate paths that will be shorter than longer
image

Even with that we have durations up to 10ms, and there was the one outlier too, this all suggests it might be worth putting everything on a network queue instead of keeping some local

@Elucidation Elucidation self-assigned this Jun 19, 2023
@Elucidation Elucidation added the enhancement New feature or request label Jun 19, 2023
@Elucidation
Copy link
Owner Author

Thinking about ways to do this.

Right now, for each job that has a robot that needs to path somewhere (sometimes home too, worst case reset timing errors may happen currently if all robots try to go home takes long, could have collisions), robot allocator calls generate_path, finds a path or not, and we continue job.
To make this async, we likely need to add more states to job, like a waiting for path state or something, either for every case or some catch all that then continues.

graph LR
  A[WAITING_TO_START] --> B
  B[PICKING_ITEM] --> C
  C[ITEM_PICKED] --> D
  D[GOING_TO_STATION] --> E
  E[ITEM_DROPPED] --> F
  F[RETURNING_HOME] --> G
  G[COMPLETE]
  I[ERROR]
Loading

Approach 1

Add inbetween states to job states

WAITING_TO_START = 1
WAITING_FOR_PATH_TO_ITEM_ZONE = 2
PICKING_ITEM = 3
ITEM_PICKED = 4
WAITING_FOR_PATH_TO_STATION = 5
GOING_TO_STATION = 6
ITEM_DROPPED = 7
WAITING_FOR_PATH_TO_HOME = 8
RETURNING_HOME = 9
COMPLETE = 10

Then, every time we need a generate_path (example, to go to item zone), we have a start and end pos, a set of dynamic and static obstacles as well.
Push this information as a jsonified dict to a paths_needed queue. update to WAITING_FOR_PATH_TO_ITEM_ZONE.
Also add world timestamp for these queries.
Do this for all jobs. The queue gets filled.

Then, poll for paths_found queue (set? or priority queue by request time) and pop off as many as we can fit in a time series?
make sure paths found results have same timestamp as current, remove any that are old and resend those queries?
With those found paths update those jobs/robots and do this for all paths found in the queue.

Clear paths_found and paths_needed queues for next time step (probably do this at start too)

There's quite a few redis calls here

@Elucidation
Copy link
Owner Author

Elucidation commented Jun 21, 2023

Thinking about message size, if it weren't for the set of dynamic and static obstacles, this would be constant/tiny size.
Further thinking about timing, technically the dynamic obstacles are the only thing that are time sensitive (by definition).

One option is the worker nodes generate the dynamic/static obstacles when they pick up the task, generate a path that is for a certain world timestamp. However this would mean each worker node is querying all robot states which could get very slow and isn't sync-safe.

We could have each worker node listen for the world update states and keep latest robot states/paths too.

Or have robot manager send it all. This can become very large, N robots * max path length dynamic obstacles in worst case, example for 200x200 grid with 500 robots is on the order of millions of dynamic obstacles, where each is a tuple of two values, json stringifying this would be a 10's of millions long string, ie 10's to 100's of Mb. This is obviously worst case and realistically would be much smaller, but probably megabytes is possible.


So, since dynamic obstacles shouldn't fit in the queue message, we want the worker nodes to generate them themselves, and instead of querying the DB for all the robot paths, it should listen to the world update stream and keep track of it.
This makes sense in that robot allocator doesn't update all paths until after it's finished, so worker nodes can assume world state update is an atomic event where all paths are synced at that moment.

A bonus is that if a worker node gets to a task at a later time step, it is still able to process it and respond to it, it should notify what time step the new path is for. It should check if it finished with time to spare or if not (put in error / give up? / put back on queue? )

Now the path needed queue message is just the time requested, start and end pos, ie a constant size message.
Then the path found queue message will contain the full path (as string) or empty, and the request and start time stamps

robot manager listens for path found messages, and if the time stamp is old it throws it out (log error) and continues, otherwise it sets the path and updates job state.

@Elucidation
Copy link
Owner Author

Elucidation commented Jun 21, 2023

One bonus here will be once we have worker nodes processing these generate_path queries, there's no reason it needs to stay python, the worker nodes could be ported to C or Go for speed improvements.

@Elucidation
Copy link
Owner Author

Elucidation commented Jun 21, 2023

Some quick look at python redis read/write gives us:

def test_write():
    redis_con.set('test2', 'bloop') # seems to take around 1.7 - 2.5 ms

def test_read():
    data = redis_con.get('test2') # ~ 1.6 - 2.1 ms

Then for bigger hash writes (59kb 100k ints)

def test_write_big():
    # The json string for 100k ints is 58,890 characters, ie ~59kb msg
    data = {'vals': json.dumps(list(range(100000)))} # around 30 - 88 ms !
    redis_con.hset('test3', mapping=data)

@timeit
def test_read_big():
    data = redis_con.hgetall('test3') # 10 - 15 ms

As the message size increases the latency for writing seems to increase from ~2ms to say 90ms for a 59kb message.
Reading the big message takes around 10 - 15 ms


From this, we shouldn't have robot manager writing obstacles (which can be 1000's of positions ie multiple thousand ints) on a stream, instead have the worker nodes listen to the existing world sim stream and calculate them independently.

Reading is a bit faster.

queues

@timeit
def test_write_queue():
    for i in range(100): # 100 pushes takes ~200ms, then ~120ms on later ones
        redis_con.rpush('test4', str(i)) # each is ~2ms
    
test_write_queue()

@timeit
def test_read_queue():
    for i in range(100): # 100 reads takes ~200ms, then ~120ms on later ones
        data = redis_con.lpop('test4') # each is around 2ms

So from this we're seeing that for > 50% of generate_paths on current system are 2.88ms or less, vs the redis latency being greater than that.
However, because of parallelization, we can put a rough upper bound on read/writes to redis where if the reads and writes are batched, it'll comfortably fit under say the threshold of say 100-400ms for all generate paths.

@Elucidation
Copy link
Owner Author

Elucidation commented Jun 21, 2023

Note, other studies show redis and zeromq both have latency around 150us, so likely here with virtualization and the extra network layer is why I'm seeing ~2ms (ie. 10x slower). One option is using host network --net host for docker containers, though on windows we lose access to ports

@Elucidation
Copy link
Owner Author

Okay running on host network greatly speeds everything up

'test_write' End. Took 0.437 ms
'test_read' End. Took 0.341 ms
'test_write_big' End. Took 1.593 ms
'test_read_big' End. Took 1.473 ms
'test_write_queue' End. Took 18.192 ms # 100 calls
'test_read_queue' End. Took 23.837 ms # 100 calls

Even the 59kb read/write only took < 2ms, same for queue (per call)

Next change is to remove the subnet and use host network!

@Elucidation
Copy link
Owner Author

Started using python -m cProfile to find bottlenecks

image

@Elucidation
Copy link
Owner Author

Tag on #116 where main robot allocator can attempt to generate paths for up to say 2ms, and if that fails then push it onto a queue. Alternatively just fully offload everything onto the queue and let worker nodes attempt it instead.

@Elucidation
Copy link
Owner Author

Blocker; once a path has been generated, other paths will often need to know about that path to avoid it, breaking the parallelization of this scheme.
Either we have only one worker node and it keeps track of everything, or we have some way to split up paths that cannot collide, or we drop paths that do collide based on some ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant