Limit number of sequences #220

yelite · 2024-02-23T21:02:56Z

This PR:

Adds max_num_seq and max_num_seq_per_request to limit number of sequences in the batch and in each request.
Use max_num_seq to guide the memory profiling to be less conservative, leaving more room for KV cache
Expose gpu_memory_utilization in engine config

@sunggg @masahi

masahi · 2024-02-23T23:08:07Z

This should be sent to https://github.com/octoml/mlc-serve/ as well.

yelite added 11 commits February 21, 2024 16:16

Add max_num_seq

29cac68

add to engine

549861a

Limit max_num_seq when grows the batch

4cf54ea

Add max num seq to args

a96e915

Add max_num_seq_per_sequence

1fd2b2e

Expose gpu memory utilization to engine config

634a869

Fix dataclass

48af6d5

Remove the default value of gpu_memory_utilization

38ba3e3

Merge branch 'expose-gpu-util' into fix-memory-profiling

320a9ba

Fix params

0b574aa

Apply to torch model

cd4f1c8

masahi approved these changes Feb 23, 2024

View reviewed changes

masahi merged commit 8ee6aaa into octoml:batch-serving Feb 23, 2024
1 check passed

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024

Update prep_deps to include cargo installation script (octoml#220)

1f53191

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit number of sequences #220

Limit number of sequences #220

yelite commented Feb 23, 2024

masahi commented Feb 23, 2024

Limit number of sequences #220

Limit number of sequences #220

Conversation

yelite commented Feb 23, 2024

masahi commented Feb 23, 2024