Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Updating Pytorch-Launcher component to work with pipelines v2 #11273

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Fiona-Waters
Copy link
Contributor

@Fiona-Waters Fiona-Waters commented Oct 7, 2024

Description of your changes:
This PR will resolve kubeflow/training-operator#2068
I have updated the pytorch launcher component to use v2 constructs.
I have also updated the pytorch launcher component to use kubeflow training-operator TrainingClient.

Checklist:

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neuromage for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Hi @Fiona-Waters. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off by: Fiona-Waters [email protected]

Signed-off-by: Fiona Waters <[email protected]>
Signed-off-by: Fiona-Waters [email protected]

Signed-off-by: Fiona Waters <[email protected]>
@Fiona-Waters
Copy link
Contributor Author

@terrytangyuan maybe you could help with the training-operator related error I am getting here. Any advice would be really appreciated. Thank you.

@terrytangyuan
Copy link
Member

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

@Fiona-Waters
Copy link
Contributor Author

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

Sure. Running it quickly locally it looks like this:

job {'api_version': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'annotations': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': None,
              'managed_fields': None,
              'name': 'pytorchjob',
              'namespace': 'kubeflow',
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': {'elastic_policy': None,
          'nproc_per_node': None,
          'pytorch_replica_specs': {'Master': {}, 'Worker': {}},
          'run_policy': {'active_deadline_seconds': None,
                         'backoff_limit': None,
                         'clean_pod_policy': 'Running',
                         'scheduling_policy': None,
                         'suspend': None,
                         'ttl_seconds_after_finished': None}},
 'status': None}

I can try to log it out when running on KinD later today.

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.openapi_types = self.swagger_types
# self.openapi_types = self.swagger_types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved to importing KubeflowOrgV1PyTorchJob from kubeflow.training and this does not have a swagger_types attribute. With the previous from kubeflow.pytorchjob import V1PyTorchJob there was an error around using floats and ints within the numpy config file.

AttributeError: module 'numpy' has no attribute 'float'.
`np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed these functions altogether as they are no longer required.

@Fiona-Waters
Copy link
Contributor Author

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

Sure. Running it quickly locally it looks like this:

job {'api_version': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'annotations': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': None,
              'managed_fields': None,
              'name': 'pytorchjob',
              'namespace': 'kubeflow',
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': {'elastic_policy': None,
          'nproc_per_node': None,
          'pytorch_replica_specs': {'Master': {}, 'Worker': {}},
          'run_policy': {'active_deadline_seconds': None,
                         'backoff_limit': None,
                         'clean_pod_policy': 'Running',
                         'scheduling_policy': None,
                         'suspend': None,
                         'ttl_seconds_after_finished': None}},
 'status': None}

I can try to log it out when running on KinD later today.

This is what the serialized job looks like. Kind is present.

{'apiVersion': 'kubeflow.org/v1', 'kind': 'PyTorchJob', 'metadata': {'name': 'pytorchjob', 'namespace': 'kubeflow'}, 'spec': {'pytorchReplicaSpecs': {'Master': {}, 'Worker': {}}, 'runPolicy': {'cleanPodPolicy': 'Running'}}}.

… operator

Signed-off-by: Fiona-Waters [email protected]

Signed-off-by: Fiona Waters <[email protected]>
@Fiona-Waters Fiona-Waters marked this pull request as ready for review October 17, 2024 22:29
@Fiona-Waters Fiona-Waters changed the title [WIP] chore: Updating Pytorch-Launcher component to work with pipelines v2 chore: Updating Pytorch-Launcher component to work with pipelines v2 Oct 17, 2024
@Fiona-Waters
Copy link
Contributor Author

@HumairAK would really appreciate if you could review this if/when you have time, please.

@@ -1,4 +1,6 @@
pyyaml
kubernetes
kubeflow-pytorchjob
kubeflow.training
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there minimal required version?

version=args.version,
client=api_client
)
#config.load_kube_config()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be needed?


# container component description setting inputs and implementation
@dsl.container_component
def pytorch_job_launcher(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine code like would be inside the /src directory instead of sample.py file but I am not familiar enough with KFP codebase to support that.

Comment on lines -31 to -42
# Patch PyTorchJob APIs to align with k8s usage
class V1PyTorchJob(V1PyTorchJob_original):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.openapi_types = self.swagger_types


class V1PyTorchJobSpec(V1PyTorchJobSpec_original):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.openapi_types = self.swagger_types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these no longer needed? Will this change break existing usage?

@terrytangyuan
Copy link
Member

Thank you for your work on this! I left some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update pytorch launcher component in Kubeflow Pipelines repository
2 participants