Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streamer] Fix UTF-8 handling in streamer #2978

Merged
merged 1 commit into from
Oct 14, 2024

Conversation

MasterJH5574
Copy link
Member

This PR fixes a bug in the streamer handling for UTF-8 characters. Prior to this PR, the streamer has an assumption that a replacement character () always correspond to an entire token. However, for the Qwen2 model tokenizer, some token can be if decoded directly, which breaks the assumption and leads to incorrect result generated by the streamer.

This PR fixes this issue with a safer behavior that does not rely on such an assumption.

This PR fixes a bug in the streamer handling for UTF-8 characters.
Prior to this PR, the streamer has an assumption that a replacement
character (`�`) always correspond to an entire token. However, for
the Qwen2 model tokenizer, some token can be ` �` if decoded directly,
which breaks the assumption and leads to incorrect result generated
by the streamer.

This PR fixes this issue with a safer behavior that does not rely
on such an assumption.
@MasterJH5574 MasterJH5574 merged commit fead3e5 into mlc-ai:main Oct 14, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant