Fix after #214 #215

vvchernov · 2024-02-16T15:57:25Z

After merge #214 strong performance reduction was observed. There is fix of it. throughput benchmark shows no performance reduction after fix

Results of throughput benchmark (Mistral-7b, use input 10):
5e36d1b (last but one commit): Engine Throughput: 6.48 requests/s, 2516.60 tokens/s
7495bd0 (last commit): Engine Throughput: 0.14 requests/s, 53.75 tokens/s
with fix: Engine Throughput: 5.62 requests/s, 2181.77 tokens/s

Results of throughput benchmark (Mistral-7b, use input 1000):
5e36d1b (last but one commit): Engine Throughput: 34.39 requests/s, 13157.05 tokens/s
7495bd0 (last commit): - need very long time to get it
with fix: Engine Throughput: 34.48 requests/s, 13193.23 tokens/s

… tensor instead of list

sunggg

Appreciate spotting this important issue and quick fix!

Would be also nice to leave some performance difference before/after for the future reference. I think this is important data point.

sunggg · 2024-02-16T16:00:36Z

serve/mlc_serve/engine/engine_common.py

+    vocab_size = request.sampling_params.vocab_size
+    bin_counts = torch.zeros((vocab_size + 1,),
+                             dtype=torch.long,
+                             device=tokens.device)


Can we clarify where this computation runs? What I want to do as a follow-up is clarify where each operation is running and further async/parallelize if we can.

I do not have good design now, but plan to think about. It is only the quick fix. I do not think that it is alone place should be fixed, need to think about sampler design (see also comment below)

Oh, I was not suggesting we should do it now haha I just wanted to leave a comment where this computation runs. device=tokens.device is cpu i assume?

I think yes, it is device of prompt_token_ids, I think they are on CPU. Comment has been added

serve/mlc_serve/model/model_common.py

vvchernov · 2024-02-16T16:31:17Z

Would be also nice to leave some performance difference before/after for the future reference. I think this is important data point.

@elvin-n could you show results before sampler big refactor as ref data point. I add last ones

vvchernov · 2024-02-16T16:51:52Z

Would be also nice to leave some performance difference before/after for the future reference. I think this is important data
point.

I added it to the desc of the PR. Need to say that our aim is 19000-20000 token per sec for Mistral-7b

sunggg

Thank you guys for spotting this issue and proactively addressing them!

vvchernov · 2024-02-20T16:41:17Z

After this fix I've decided to double check: why had performance reduction been fixed? Of course, I had some ideas which I based on during the fixing, but there were also doubts that were eventually justified. Using timeit I have tested different scenarios of data processing for repetition_penalty parameter and compare them for Mistral-7b on 1xH100.
It should be noted before share results of the tests that there are two stage of data processing of request: 1. the data from new request is preprocessed in async regime before add it to queue. 2. data flows through a topology and is processed during sampling. repetition_penalty counts not only output tokens (new generated), but also input ones (from prompt). Due to this we prepared mask_prompt once on the first stage for each request and use it when sampler state is initialized for request batch. As result we need to construct 2d torch tensor, but can do it in different ways. Test scripts can be found here.
Results:

First step we need to prepare (init) prompt mask, it is 1d array with some length. In practice length is vocab_size (32000)
- length = 10: list - 0.13 us; torch tensor from list - 4.4 us; list by numpy (Song's solution) - 2.3 us; torch tensor from prompt - 19.3 us (fix)
- length = 32000: list - 59 us; torch tensor from list - 4.1 ms; list by numpy (Song's solution) - 330 us; torch tensor from prompt - 47 us (fix)
Second step we need join data prepared one the first step (batch size = 10):
- length = 10: init torch tensor from list of lists - 18 us (Song's solution); torch stack - 3 us (fix)
- length = 1000: init torch tensor from list of lists - 1.3 ms (Song's solution); torch stack - 4 us (fix)
- length = 32000: init torch tensor from list of lists - 40 ms (Song's solution); torch stack - 300 us (fix)
Hypothetical case when torch tensor init from list(s) by different ways of their preparing (batch size = 10):
- length = 10: init torch tensor from list of lists - 20 us; torch stack - 40 us
- length = 100: init torch tensor from list of lists - 137 us; torch stack - 165 us
- length = 1000: init torch tensor from list of lists - 1.3 ms; torch stack - 1.3 ms
- length = 32000: init torch tensor from list of lists - 42 ms; torch stack - 42 ms

In all cases I've tested processing with batch (default length = 10). But my measurements showed that it is linearly depended on batch size, due to this the results is average for one request at point 1.
Conclusions: 1. performance reduction was resolved in general due to transfer of some calculation to async regime. Calculations during sampling were reduced, but it is still big enough compering with typical time 1 us. I see that as final solution it should be async calculations or something like that (actor system); 2. Processing time of different approaches is strongly depended on data size, in this case fix for repetition_penalty does not look like the best one for logits_bias

vvchernov added 4 commits February 16, 2024 20:09

transfer prompt mask from sampling params to request state. use torch…

6ec0f72

… tensor instead of list

fix prompt mask for EvalMultiQueryRequest

8f81ea7

clean code

a82a090

update sampler tests

ae5f2a8

sunggg reviewed Feb 16, 2024

View reviewed changes

vvchernov force-pushed the vc/fix#214 branch from c6503f2 to ae5f2a8 Compare February 16, 2024 16:26

fix after rebase

0c79118

vvchernov requested a review from sunggg February 16, 2024 17:05

add device to comment

5ad89d9

sunggg approved these changes Feb 16, 2024

View reviewed changes

sunggg merged commit 5588d17 into octoml:batch-serving Feb 16, 2024
1 check passed

vvchernov deleted the vc/fix#214 branch February 17, 2024 13:09

masahi mentioned this pull request Feb 21, 2024

[Tracking] Sampler optimization #199

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix after #214 #215

Fix after #214 #215

vvchernov commented Feb 16, 2024 •

edited

Loading

sunggg left a comment

sunggg Feb 16, 2024

vvchernov Feb 16, 2024

sunggg Feb 16, 2024

vvchernov Feb 16, 2024

vvchernov commented Feb 16, 2024

vvchernov commented Feb 16, 2024

sunggg left a comment

vvchernov commented Feb 20, 2024

Fix after #214 #215

Fix after #214 #215

Conversation

vvchernov commented Feb 16, 2024 • edited Loading

sunggg left a comment

Choose a reason for hiding this comment

sunggg Feb 16, 2024

Choose a reason for hiding this comment

vvchernov Feb 16, 2024

Choose a reason for hiding this comment

sunggg Feb 16, 2024

Choose a reason for hiding this comment

vvchernov Feb 16, 2024

Choose a reason for hiding this comment

vvchernov commented Feb 16, 2024

vvchernov commented Feb 16, 2024

sunggg left a comment

Choose a reason for hiding this comment

vvchernov commented Feb 20, 2024

vvchernov commented Feb 16, 2024 •

edited

Loading