Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix after #214 #215

Merged
merged 6 commits into from
Feb 16, 2024
Merged

Fix after #214 #215

merged 6 commits into from
Feb 16, 2024

Conversation

vvchernov
Copy link

@vvchernov vvchernov commented Feb 16, 2024

After merge #214 strong performance reduction was observed. There is fix of it. throughput benchmark shows no performance reduction after fix

Results of throughput benchmark (Mistral-7b, use input 10):
5e36d1b (last but one commit): Engine Throughput: 6.48 requests/s, 2516.60 tokens/s
7495bd0 (last commit): Engine Throughput: 0.14 requests/s, 53.75 tokens/s
with fix: Engine Throughput: 5.62 requests/s, 2181.77 tokens/s

Results of throughput benchmark (Mistral-7b, use input 1000):
5e36d1b (last but one commit): Engine Throughput: 34.39 requests/s, 13157.05 tokens/s
7495bd0 (last commit): - need very long time to get it
with fix: Engine Throughput: 34.48 requests/s, 13193.23 tokens/s

Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate spotting this important issue and quick fix!

Would be also nice to leave some performance difference before/after for the future reference. I think this is important data point.

vocab_size = request.sampling_params.vocab_size
bin_counts = torch.zeros((vocab_size + 1,),
dtype=torch.long,
device=tokens.device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify where this computation runs? What I want to do as a follow-up is clarify where each operation is running and further async/parallelize if we can.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have good design now, but plan to think about. It is only the quick fix. I do not think that it is alone place should be fixed, need to think about sampler design (see also comment below)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was not suggesting we should do it now haha I just wanted to leave a comment where this computation runs. device=tokens.device is cpu i assume?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, it is device of prompt_token_ids, I think they are on CPU. Comment has been added

serve/mlc_serve/model/model_common.py Show resolved Hide resolved
@vvchernov
Copy link
Author

Would be also nice to leave some performance difference before/after for the future reference. I think this is important data point.

@elvin-n could you show results before sampler big refactor as ref data point. I add last ones

@vvchernov
Copy link
Author

Would be also nice to leave some performance difference before/after for the future reference. I think this is important data
point.

I added it to the desc of the PR. Need to say that our aim is 19000-20000 token per sec for Mistral-7b

Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you guys for spotting this issue and proactively addressing them!

@sunggg sunggg merged commit 5588d17 into octoml:batch-serving Feb 16, 2024
1 check passed
@vvchernov vvchernov deleted the vc/fix#214 branch February 17, 2024 13:09
@vvchernov
Copy link
Author

After this fix I've decided to double check: why had performance reduction been fixed? Of course, I had some ideas which I based on during the fixing, but there were also doubts that were eventually justified. Using timeit I have tested different scenarios of data processing for repetition_penalty parameter and compare them for Mistral-7b on 1xH100.
It should be noted before share results of the tests that there are two stage of data processing of request: 1. the data from new request is preprocessed in async regime before add it to queue. 2. data flows through a topology and is processed during sampling. repetition_penalty counts not only output tokens (new generated), but also input ones (from prompt). Due to this we prepared mask_prompt once on the first stage for each request and use it when sampler state is initialized for request batch. As result we need to construct 2d torch tensor, but can do it in different ways. Test scripts can be found here.
Results:

  • First step we need to prepare (init) prompt mask, it is 1d array with some length. In practice length is vocab_size (32000)
    • length = 10: list - 0.13 us; torch tensor from list - 4.4 us; list by numpy (Song's solution) - 2.3 us; torch tensor from prompt - 19.3 us (fix)
    • length = 32000: list - 59 us; torch tensor from list - 4.1 ms; list by numpy (Song's solution) - 330 us; torch tensor from prompt - 47 us (fix)
  • Second step we need join data prepared one the first step (batch size = 10):
    • length = 10: init torch tensor from list of lists - 18 us (Song's solution); torch stack - 3 us (fix)
    • length = 1000: init torch tensor from list of lists - 1.3 ms (Song's solution); torch stack - 4 us (fix)
    • length = 32000: init torch tensor from list of lists - 40 ms (Song's solution); torch stack - 300 us (fix)
  • Hypothetical case when torch tensor init from list(s) by different ways of their preparing (batch size = 10):
    • length = 10: init torch tensor from list of lists - 20 us; torch stack - 40 us
    • length = 100: init torch tensor from list of lists - 137 us; torch stack - 165 us
    • length = 1000: init torch tensor from list of lists - 1.3 ms; torch stack - 1.3 ms
    • length = 32000: init torch tensor from list of lists - 42 ms; torch stack - 42 ms

In all cases I've tested processing with batch (default length = 10). But my measurements showed that it is linearly depended on batch size, due to this the results is average for one request at point 1.
Conclusions: 1. performance reduction was resolved in general due to transfer of some calculation to async regime. Calculations during sampling were reduced, but it is still big enough compering with typical time 1 us. I see that as final solution it should be async calculations or something like that (actor system); 2. Processing time of different approaches is strongly depended on data size, in this case fix for repetition_penalty does not look like the best one for logits_bias

@masahi masahi mentioned this pull request Feb 21, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants