Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Non-Latin Characters #90

Open
k3ntar0 opened this issue Nov 16, 2023 · 1 comment
Open

Support for Non-Latin Characters #90

k3ntar0 opened this issue Nov 16, 2023 · 1 comment

Comments

@k3ntar0
Copy link

k3ntar0 commented Nov 16, 2023

This project is wonderful!
How can we make it compatible with characters used in non-Latin scripts, for example, Japanese characters?
Are tessdata available?

@robertknight
Copy link
Owner

The model data this library loads is the same as the C++ Tesseract, so this means that you can load files from https://github.com/tesseract-ocr/tessdata_best for your language.

How can we make it compatible with characters used in non-Latin scripts, for example, Japanese characters?

The code in this project is in theory script-independent, in the sense that it is mostly concerned with getting data into Tesseract as pixels and out as bounding boxes and Unicode text. If you load the right model, non-Latin languages may already work. However, I have not done any testing of this myself and there may be some extra work required. This is an area where I could use some help from interested users of the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants