Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/Names of interface elements in text output after partition #3725

Open
SlawaLoev-KSO opened this issue Oct 16, 2024 · 3 comments
Open

bug/Names of interface elements in text output after partition #3725

SlawaLoev-KSO opened this issue Oct 16, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@SlawaLoev-KSO
Copy link

SlawaLoev-KSO commented Oct 16, 2024

Describe the bug
When partitioning partition("/path/file.doc") I seem to receive also names of interface elements (from libreoffice?) in the text output such as:

  • Luisteren
  • Fonetisch lezen
  • Woordenboek - Gedetailleerd woordenboek weergeven

Note: The text itself is in English, as is my system -- these elements are for some reason in Dutch (I am indeed in the Netherlands).

To Reproduce

from unstructured.partition.auto import partition

elements = partition("/path/file.doc")

print("\n".join([str(el) for el in elements[:100]]))

Expected behavior
I would not expect interface elements to be part of the processed output.

@SlawaLoev-KSO SlawaLoev-KSO added the bug Something isn't working label Oct 16, 2024
@scanny
Copy link
Collaborator

scanny commented Oct 16, 2024

@SlawaLoev-KSO can you be more specific about what you mean by "interface elements"? For example, do you mean menu-bar options perhaps? or something like form-field labels?

@SlawaLoev-KSO
Copy link
Author

tbh I don't exactly know what it is -- just weird words (sounding like it could be interface elements) in the text output that are not in the .doc itself, I just guessed at what it is.

@scanny
Copy link
Collaborator

scanny commented Oct 17, 2024

Can you share an example document?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants