Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/Titles not included in chunks by-title #3688

Open
dividor opened this issue Oct 2, 2024 · 0 comments
Open

bug/Titles not included in chunks by-title #3688

dividor opened this issue Oct 2, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@dividor
Copy link

dividor commented Oct 2, 2024

Describe the bug
I am using the API with chunking strategy by title. When I compare the PDF with parsed data, I find that chunked excerpts don't see to include their title. If I parse with no chunking, I see title is identified correctly, as is the document hierarchy. I would have expected the title to be part of the chunk as it has a lot of semantic weight.

To Reproduce

Here is my pipeline. I attach a PDF it processes, search for "What we found" in the PDF to see title for a section, it is this title which is this title which occurs in its own CompositeElement.


MAX_CHARACTERS=1500
CHUNK_OVERLAP=200
COMBINE_TEXT_UNDER_N_CHARS=50

Pipeline.from_configs(
            context=ProcessorConfig(),
            indexer_config=LocalIndexerConfig(input_path=input_dir),
            downloader_config=LocalDownloaderConfig(),
            source_connection_config=LocalConnectionConfig(),
            partitioner_config=PartitionerConfig(
                partition_by_api=True,
                api_key=os.getenv("UNSTRUCTURED_API_KEY"),
                partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
                strategy="hi_res",
                additional_partition_args={
                    "split_pdf_page": True,
                    "split_pdf_allow_failed": True,
                    "split_pdf_concurrency_level": 15,
                    "reprocess": True,
                    "extract_image_block_types": ["Image"]
                },
                reprocess=True
            ),
            #https://docs.unstructured.io/api-reference/ingest/ingest-configuration/chunking-configuration
            chunker_config=ChunkerConfig(
                chunking_strategy="by_title",
                max_characters = MAX_CHARACTERS,
                chunk_overlap = CHUNK_OVERLAP,
                combine_text_under_n_chars= COMBINE_TEXT_UNDER_N_CHARS
            ),
            #embedder_config=EmbedderConfig(embedding_provider="langchain-huggingface"),
            uploader_config=LocalUploaderConfig(output_dir=f"{OUTPUT_DIR}/{METHOD}")
        ).run()

Here is the file I am testing with ...

oversightgov__faa_quickly_awarded_cares_act_funds_but_can_enhanc__2c7d00c9-def8-4409-9359-1a626bbf69b5.pdf

Expected behavior
I wouldn't expect titles to just be their own chunks, instead that they would be part of the main text for that section.

Screenshots

Environment Info
I am using the unstructured docker image as found here:

https://github.com/Unstructured-IO/unstructured/tree/main?tab=readme-ov-file#run-the-library-in-a-container

I exec in. Had to also install ...

pip install unstructured-ingest
pip install unstructured

Here are my versions ...

Python 3.11.10

unstructured 0.15.13
unstructured-client 0.25.9
unstructured-inference 0.7.36
unstructured-ingest 0.0.21
unstructured.paddleocr 2.8.1.0
unstructured.pytesseract 0.3.13

Additional context

@dividor dividor added the bug Something isn't working label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant