bug/OCRAgentGoogleVision takes 1 positional argument but 2 were given #3659

pprados · 2024-09-24T09:33:29Z

Describe the bug
Try to parse a pdf with OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision.

To Reproduce
Provide a code snippet that reproduces the issue.

import os

os.environ[
    "OCR_AGENT"] = "unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision"

from unstructured.partition.pdf import partition_pdf

partition_pdf("fake-memo.pdf",
              strategy="hi_res",
              )

Expected behavior
No error

Environment Info
OS version: Linux-6.8.0-45-generic-x86_64-with-glibc2.39
Python version: 3.11.4
unstructured version: 0.15.14.dev1
unstructured-inference version: 0.7.36
pytesseract is not installed
Torch version: 2.4.1
Detectron2 is not installed
PaddleOCR version: None
Libmagic version: file-5.45
magic file from /etc/magic:/usr/share/misc/magic

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

DavidBlore · 2024-09-24T19:02:28Z

I’m experiencing the same issue. The issue arises in the OCRAgent's get_instance method as it expects all OCRAgents to have a language parameter in their constructor. See below for clarity:

@staticmethod
@functools.lru_cache(maxsize=None)
def get_instance(ocr_agent_module: str, language: str) -> "OCRAgent":
    module_name, class_name = ocr_agent_module.rsplit(".", 1)
    if module_name not in OCR_AGENT_MODULES_WHITELIST:
        raise ValueError(
            f"Environment variable OCR_AGENT module name {module_name} must be set to a "
            f"whitelisted module part of {OCR_AGENT_MODULES_WHITELIST}."
        )

    try:
        module = importlib.import_module(module_name)
        loaded_class = getattr(module, class_name)
        return loaded_class(language) # <--- This is where the issue occurs
    except (ImportError, AttributeError) as e:
        logger.error(f"Failed to get OCRAgent instance: {e}")
        raise RuntimeError(
            "Could not get the OCRAgent instance. Please check the OCR package and the "
            "OCR_AGENT environment variable."
        )

However, the OCRAgentGoogleVision class's constructor does not take in a language parameter in its constructor. Thus, the exception of OCRAgentGoogleVision takes 1 positional argument but 2 were given is thrown.

I'm willing to submit a PR to address this but want to know what the desired approach to solving this would be. Some possible options are:

Standardize Constructor Parameters: Ensure all OCRAgent classes have a language parameter by defining it in the OCRAgent abstract base class (ABC).
- make this optional
- don't make this optional
Modify OCRAgentGoogleVision: Add a language parameter to its constructor.
Conditional Instantiation: Modify ocr_interface.py to check if language is a parameter in the constructor. If it is, call loaded_class(language), otherwise call loaded_class().

christinestraub · 2024-09-30T19:07:14Z

Hi @DavidBlore, Thank you for your willingness to submit a PR to address this issue. After considering the options you've presented, I believe the most suitable approach would be:
Modify OCRAgentGoogleVision: Add a language parameter to its constructor.
This approach offers several advantages:

Consistency: It aligns OCRAgentGoogleVision with other OCR agents like OCRAgentTesseract and OCRAgentPaddle, which already have language parameters.
Flexibility: By adding a language parameter with a default value, we maintain backwards compatibility while allowing for language specification when needed.
Simplicity: This change is straightforward to implement and doesn't require modifying the abstract base class or the ocr_interface.py file.

Implementation suggestion:

class OCRAgentGoogleVision(OCRAgent):
    def __init__(self, language='en'):
        super().__init__()
        self.language = language
        # ... rest of the constructor

This change allows users to specify a language if needed, but defaults to English ('en') if not provided, similar to other OCR agents.

Next steps:

Implement this change in the OCRAgentGoogleVision class.
Update any relevant documentation or type hints.
Add tests to ensure the new parameter works as expected.

This PR addresses issue #3659 by adding an optional `language` parameter to the `OCRAgentGoogleVision` class constructor. This parameter serves as a "language hint" for the `document_text_detection` method in the `ImageAnnotatorClient`. For more information on language hints, refer to the [Google Cloud Vision documentation](https://cloud.google.com/vision/docs/languages). **Default Behavior**: The language parameter defaults to None, allowing Google Cloud Vision to auto-detect the language, as recommended in their documentation. **Purpose**: This change is necessary because the `OCRAgent`'s `get_instance` method expects all `OCRAgent`s to include a language parameter in their constructors. **Context on Issue:** When trying to parse a PDF with `OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`, an error occurs in the `get_instance` method. The method expects a `language` parameter, which the current `OCRAgentGoogleVision` constructor does not support, leading to a positional argument error. --------- Co-authored-by: Christine Straub <[email protected]>

pprados added the bug Something isn't working label Sep 24, 2024

DavidBlore mentioned this issue Oct 4, 2024

fix: add language to OCRAgentGoogleVision constructor #3696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/OCRAgentGoogleVision takes 1 positional argument but 2 were given #3659

bug/OCRAgentGoogleVision takes 1 positional argument but 2 were given #3659

pprados commented Sep 24, 2024

DavidBlore commented Sep 24, 2024

christinestraub commented Sep 30, 2024 •

edited

Loading

bug/OCRAgentGoogleVision takes 1 positional argument but 2 were given #3659

bug/OCRAgentGoogleVision takes 1 positional argument but 2 were given #3659

Comments

pprados commented Sep 24, 2024

DavidBlore commented Sep 24, 2024

christinestraub commented Sep 30, 2024 • edited Loading

christinestraub commented Sep 30, 2024 •

edited

Loading