chore: Updating Pytorch-Launcher component to work with pipelines v2 #11273

Fiona-Waters · 2024-10-07T09:23:36Z

Description of your changes:
This PR will resolve kubeflow/training-operator#2068
I have updated the pytorch launcher component to use v2 constructs.
I have also updated the pytorch launcher component to use kubeflow training-operator TrainingClient.

Checklist:

You have signed off your commits
The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

google-oss-prow · 2024-10-07T09:23:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign neuromage for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

components/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow · 2024-10-07T09:23:49Z

Hi @Fiona-Waters. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off by: Fiona-Waters [email protected] Signed-off-by: Fiona Waters <[email protected]>

Signed-off-by: Fiona-Waters [email protected] Signed-off-by: Fiona Waters <[email protected]>

Fiona-Waters · 2024-10-07T09:28:08Z

@terrytangyuan maybe you could help with the training-operator related error I am getting here. Any advice would be really appreciated. Thank you.

terrytangyuan · 2024-10-08T14:58:41Z

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

Fiona-Waters · 2024-10-08T15:18:09Z

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

Sure. Running it quickly locally it looks like this:

job {'api_version': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'annotations': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': None,
              'managed_fields': None,
              'name': 'pytorchjob',
              'namespace': 'kubeflow',
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': {'elastic_policy': None,
          'nproc_per_node': None,
          'pytorch_replica_specs': {'Master': {}, 'Worker': {}},
          'run_policy': {'active_deadline_seconds': None,
                         'backoff_limit': None,
                         'clean_pod_policy': 'Running',
                         'scheduling_policy': None,
                         'suspend': None,
                         'ttl_seconds_after_finished': None}},
 'status': None}

I can try to log it out when running on KinD later today.

terrytangyuan · 2024-10-09T00:25:05Z

components/kubeflow/pytorch-launcher/src/launch_pytorchjob.py

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
-        self.openapi_types = self.swagger_types
+        # self.openapi_types = self.swagger_types


Why is this removed?

I moved to importing KubeflowOrgV1PyTorchJob from kubeflow.training and this does not have a swagger_types attribute. With the previous from kubeflow.pytorchjob import V1PyTorchJob there was an error around using floats and ints within the numpy config file.

AttributeError: module 'numpy' has no attribute 'float'. `np.float` was a deprecated alias for the builtin `float`. To avoid this error in existing code, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.

I've removed these functions altogether as they are no longer required.

components/kubeflow/pytorch-launcher/src/launch_pytorchjob.py

Fiona-Waters · 2024-10-16T21:11:31Z

Can you print out job object? It's supposed to be an Job/CRD object with fields like kind.

Sure. Running it quickly locally it looks like this:

job {'api_version': 'kubeflow.org/v1',
 'kind': 'PyTorchJob',
 'metadata': {'annotations': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': None,
              'managed_fields': None,
              'name': 'pytorchjob',
              'namespace': 'kubeflow',
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': {'elastic_policy': None,
          'nproc_per_node': None,
          'pytorch_replica_specs': {'Master': {}, 'Worker': {}},
          'run_policy': {'active_deadline_seconds': None,
                         'backoff_limit': None,
                         'clean_pod_policy': 'Running',
                         'scheduling_policy': None,
                         'suspend': None,
                         'ttl_seconds_after_finished': None}},
 'status': None}

I can try to log it out when running on KinD later today.

This is what the serialized job looks like. Kind is present.

{'apiVersion': 'kubeflow.org/v1', 'kind': 'PyTorchJob', 'metadata': {'name': 'pytorchjob', 'namespace': 'kubeflow'}, 'spec': {'pytorchReplicaSpecs': {'Master': {}, 'Worker': {}}, 'runPolicy': {'cleanPodPolicy': 'Running'}}}.

… operator Signed-off-by: Fiona-Waters [email protected] Signed-off-by: Fiona Waters <[email protected]>

Fiona-Waters · 2024-10-17T22:37:38Z

@HumairAK would really appreciate if you could review this if/when you have time, please.

terrytangyuan · 2024-10-18T14:32:19Z

components/kubeflow/pytorch-launcher/requirements.txt

@@ -1,4 +1,6 @@
 pyyaml
 kubernetes
 kubeflow-pytorchjob
+kubeflow.training


Is there minimal required version?

terrytangyuan · 2024-10-18T14:33:07Z

components/kubeflow/pytorch-launcher/src/launch_pytorchjob.py

-        version=args.version,
-        client=api_client
-    )
+    #config.load_kube_config()


This might be needed?

terrytangyuan · 2024-10-18T14:35:39Z

components/kubeflow/pytorch-launcher/sample.py

+
+# container component description setting inputs and implementation
+@dsl.container_component
+def pytorch_job_launcher(


I'd imagine code like would be inside the /src directory instead of sample.py file but I am not familiar enough with KFP codebase to support that.

terrytangyuan · 2024-10-18T14:36:15Z

components/kubeflow/pytorch-launcher/src/launch_pytorchjob.py

-# Patch PyTorchJob APIs to align with k8s usage
-class V1PyTorchJob(V1PyTorchJob_original):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.openapi_types = self.swagger_types
-
-
-class V1PyTorchJobSpec(V1PyTorchJobSpec_original):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.openapi_types = self.swagger_types
-


Are these no longer needed? Will this change break existing usage?

terrytangyuan · 2024-10-18T14:36:44Z

Thank you for your work on this! I left some comments.

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 7, 2024

google-oss-prow bot requested review from animeshsingh and IronPan October 7, 2024 09:23

google-oss-prow bot added size/L needs-ok-to-test labels Oct 7, 2024

Fiona-Waters added 2 commits October 7, 2024 10:26

Update pytorch launcher for pipelines v2

2567001

Signed-off by: Fiona-Waters [email protected] Signed-off-by: Fiona Waters <[email protected]>

Use training operator TrainingClient

d3f3c41

Signed-off-by: Fiona-Waters [email protected] Signed-off-by: Fiona Waters <[email protected]>

Fiona-Waters force-pushed the pytorch branch from d635a34 to 9ce1e27 Compare October 7, 2024 09:26

terrytangyuan reviewed Oct 9, 2024

View reviewed changes

Further refactoring of pytorch launcher for pipelines v2 and training…

1722934

… operator Signed-off-by: Fiona-Waters [email protected] Signed-off-by: Fiona Waters <[email protected]>

Fiona-Waters force-pushed the pytorch branch from 9ce1e27 to 1722934 Compare October 17, 2024 22:26

Fiona-Waters marked this pull request as ready for review October 17, 2024 22:29

Fiona-Waters changed the title ~~[WIP] chore: Updating Pytorch-Launcher component to work with pipelines v2~~ chore: Updating Pytorch-Launcher component to work with pipelines v2 Oct 17, 2024

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 17, 2024

terrytangyuan reviewed Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Updating Pytorch-Launcher component to work with pipelines v2 #11273

chore: Updating Pytorch-Launcher component to work with pipelines v2 #11273

Fiona-Waters commented Oct 7, 2024 •

edited

Loading

google-oss-prow bot commented Oct 7, 2024

google-oss-prow bot commented Oct 7, 2024

Fiona-Waters commented Oct 7, 2024

terrytangyuan commented Oct 8, 2024

Fiona-Waters commented Oct 8, 2024

terrytangyuan Oct 9, 2024

Fiona-Waters Oct 9, 2024

Fiona-Waters Oct 17, 2024

Fiona-Waters commented Oct 16, 2024

Fiona-Waters commented Oct 17, 2024

terrytangyuan Oct 18, 2024

terrytangyuan Oct 18, 2024

terrytangyuan Oct 18, 2024

terrytangyuan Oct 18, 2024

terrytangyuan commented Oct 18, 2024

chore: Updating Pytorch-Launcher component to work with pipelines v2 #11273

Are you sure you want to change the base?

chore: Updating Pytorch-Launcher component to work with pipelines v2 #11273

Conversation

Fiona-Waters commented Oct 7, 2024 • edited Loading

google-oss-prow bot commented Oct 7, 2024

google-oss-prow bot commented Oct 7, 2024

Fiona-Waters commented Oct 7, 2024

terrytangyuan commented Oct 8, 2024

Fiona-Waters commented Oct 8, 2024

terrytangyuan Oct 9, 2024

Choose a reason for hiding this comment

Fiona-Waters Oct 9, 2024

Choose a reason for hiding this comment

Fiona-Waters Oct 17, 2024

Choose a reason for hiding this comment

Fiona-Waters commented Oct 16, 2024

Fiona-Waters commented Oct 17, 2024

terrytangyuan Oct 18, 2024

Choose a reason for hiding this comment

terrytangyuan Oct 18, 2024

Choose a reason for hiding this comment

terrytangyuan Oct 18, 2024

Choose a reason for hiding this comment

terrytangyuan Oct 18, 2024

Choose a reason for hiding this comment

terrytangyuan commented Oct 18, 2024

Fiona-Waters commented Oct 7, 2024 •

edited

Loading