You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am using the API with chunking strategy by title. When I compare the PDF with parsed data, I find that chunked excerpts don't see to include their title. If I parse with no chunking, I see title is identified correctly, as is the document hierarchy. I would have expected the title to be part of the chunk as it has a lot of semantic weight.
To Reproduce
Here is my pipeline. I attach a PDF it processes, search for "What we found" in the PDF to see title for a section, it is this title which is this title which occurs in its own CompositeElement.
Describe the bug
I am using the API with chunking strategy by title. When I compare the PDF with parsed data, I find that chunked excerpts don't see to include their title. If I parse with no chunking, I see title is identified correctly, as is the document hierarchy. I would have expected the title to be part of the chunk as it has a lot of semantic weight.
To Reproduce
Here is my pipeline. I attach a PDF it processes, search for "What we found" in the PDF to see title for a section, it is this title which is this title which occurs in its own CompositeElement.
Here is the file I am testing with ...
oversightgov__faa_quickly_awarded_cares_act_funds_but_can_enhanc__2c7d00c9-def8-4409-9359-1a626bbf69b5.pdf
Expected behavior
I wouldn't expect titles to just be their own chunks, instead that they would be part of the main text for that section.
Screenshots
Environment Info
I am using the unstructured docker image as found here:
https://github.com/Unstructured-IO/unstructured/tree/main?tab=readme-ov-file#run-the-library-in-a-container
I exec in. Had to also install ...
Here are my versions ...
Python 3.11.10
unstructured 0.15.13
unstructured-client 0.25.9
unstructured-inference 0.7.36
unstructured-ingest 0.0.21
unstructured.paddleocr 2.8.1.0
unstructured.pytesseract 0.3.13
Additional context
The text was updated successfully, but these errors were encountered: