Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{{pod.name}} value different when using retryStrategy within templateDefaults #13691

Open
2 of 4 tasks
paulfouquet opened this issue Oct 2, 2024 · 4 comments
Open
2 of 4 tasks
Labels
area/controller Controller issues, panics area/templating Templating with `{{...}}` P3 Low priority type/bug

Comments

@paulfouquet
Copy link

paulfouquet commented Oct 2, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Using Argo Workflows v3.5.5 on an AWS EKS cluster with version 1.28, we had a task that was using the {{pod.name}} to generate a path to our S3 artifact bucket. The value returned by {{pod.name}} was composed by: [WORKFLOW NAME]-[TASK NAME]-[POD ID] (the same value as the directory created under our s3 artifact bucket for the pod running our task).
We did an upgrade of our cluster from EKS 1.28 to 1.29. Since this upgrade, {{pod.name}} is returning a string composed by: [WORKFLOW NAME]-[TASK NAME]-[TASK ID] (TASK ID instead of POD ID). But the directory created under our S3 artifact bucket is still composed using the generated name [WORKFLOW NAME]-[TASK NAME]-[POD ID] as prior our upgrade. Note that we did not upgrade Argo Workflows, and did not deleted the cluster prior to the upgrade.

We first suspected that we might have been in a weird state where Argo Workflows was using POD_NAMES=v1 due to a previous Argo Workflows version we had on the cluster, and by upgrading the cluster an Argo Workflow server / controller node might have been "restarted" using the new default value POD_NAMES=v2. But our tests by using POD_NAMES=v1 show that the {{pod.name}} is now returning a string composed by [TASK NAME]-[TASK ID].

Now the weird thing is that we deployed a brand new cluster, using EKS 1.28 and Argo Workflows 3.5.5, and we also encounter the issue of having {{pod.name}} returning [WORKFLOW NAME]-[TASK NAME]-[TASK ID] and the S3 artifact bucket folder being [WORKFLOW NAME]-[TASK NAME]-[POD ID].

Same behaviour is also observed using EKS 1.30 and Argo Workflows 3.5.11

This issue would be hard to reproduce. I am linking our public repository that uses Argo Workflows with our IaC for our cluster: https://github.com/linz/topo-workflows/tree/master/infra and the task that is using {{pod.name}} is here.

The issue is reproduced using the minimal workflow below:

> argo logs test-pod-name-plf6v test-pod-name-plf6v-get-pod-name-1010547582 -n argo
test-pod-name-plf6v-get-pod-name-1010547582: pod.name=test-pod-name-plf6v-get-pod-name-3577080563

The value returned by {{pod.name}} is not the one we expected. It seems to return the node or type retry ID. For example: pod.name=test-pod-name-8zrgl-get-pod-name-2484369038 - we were expecting to have test-pod-name-8zrgl-get-pod-name-121682877
image
image

Thank you for your help.

Version(s)

v3.5.5
v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-pod-name-
spec:
  templateDefaults:
    retryStrategy:
      limit: '2'
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: get-pod-name
            template: get-pod-name
    - name: get-pod-name
      script:
        image: 'ghcr.io/linz/argo-tasks:latest'
        command: [node]
        source: |
          console.log('pod.name={{pod.name}}');

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep test-pod-name-plf6v
time="2024-10-13T20:04:48.996Z" level=info msg="Processing workflow" Phase= ResourceVersion=959 namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="Retry node test-pod-name-plf6v initialized Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=info msg="DAG node test-pod-name-plf6v-4260280029 initialized Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.000Z" level=warning msg="was unable to obtain the node for test-pod-name-plf6v-3577080563, taskName get-pod-name"
time="2024-10-13T20:04:49.001Z" level=warning msg="was unable to obtain the node for test-pod-name-plf6v-3577080563, taskName get-pod-name"
time="2024-10-13T20:04:49.001Z" level=info msg="All of node test-pod-name-plf6v(0).get-pod-name dependencies [] completed" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.001Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.001Z" level=info msg="Retry node test-pod-name-plf6v-3577080563 initialized Running" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.001Z" level=info msg="Pod node test-pod-name-plf6v-1010547582 initialized Pending" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.015Z" level=info msg="Created pod: test-pod-name-plf6v(0).get-pod-name(0) (test-pod-name-plf6v-get-pod-name-1010547582)" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.015Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.015Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:49.026Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=964 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.017Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=964 namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.017Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.018Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=test-pod-name-plf6v-1010547582 old.message= old.phase=Pending old.progress=0/1 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.018Z" level=info msg="node test-pod-name-plf6v-3577080563 phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.018Z" level=info msg="node test-pod-name-plf6v-3577080563 finished: 2024-10-13 20:04:59.018963346 +0000 UTC" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="Outbound nodes of test-pod-name-plf6v-4260280029 set to [test-pod-name-plf6v-1010547582]" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v-4260280029 phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v-4260280029 finished: 2024-10-13 20:04:59.019062436 +0000 UTC" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="node test-pod-name-plf6v finished: 2024-10-13 20:04:59.019267226 +0000 UTC" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg=reconcileAgentPod namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.019Z" level=info msg="Marking workflow completed" namespace=argo workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.046Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=1002 workflow=test-pod-name-plf6v
time="2024-10-13T20:04:59.062Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/test-pod-name-plf6v-get-pod-name-1010547582/labelPodCompleted

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=test-pod-name-plf6v
time="2024-10-13T20:04:51.074Z" level=info msg="Starting Workflow Executor" version=v3.5.11
time="2024-10-13T20:04:51.078Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-10-13T20:04:51.078Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=test-pod-name-plf6v-get-pod-name-1010547582 templateName=get-pod-name version="&Version{Version:v3.5.11,BuildDate:2024-09-20T14:09:00Z,GitCommit:25bbb71cced32b671f9ad35f0ffd1f0ddb8226ee,GitTag:v3.5.11,GitTreeState:clean,GoVersion:go1.21.13,Compiler:gc,Platform:linux/amd64,}"
time="2024-10-13T20:04:51.085Z" level=info msg="Starting deadline monitor"
time="2024-10-13T20:04:53.087Z" level=info msg="Main container completed" error="<nil>"
time="2024-10-13T20:04:53.087Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-10-13T20:04:53.087Z" level=info msg="No output parameters"
time="2024-10-13T20:04:53.087Z" level=info msg="No output artifacts"
time="2024-10-13T20:04:53.097Z" level=info msg="Alloc=7908 TotalAlloc=13631 Sys=25189 NumGC=4 Goroutines=8"
time="2024-10-13T20:04:53.102Z" level=info msg="Deadline monitor stopped"
@shuangkun shuangkun added the area/controller Controller issues, panics label Oct 2, 2024
@agilgur5 agilgur5 changed the title {{pod.name}} inconsistency {{pod.name}} inconsistency Oct 2, 2024
@agilgur5 agilgur5 added area/templating Templating with `{{...}}` P3 Low priority labels Oct 5, 2024
@chengjoey
Copy link
Contributor

cannot reproduce in minikube/kind

@paulfouquet
Copy link
Author

Thanks @chengjoey for looking at it.

Do you mean that you can't reproduce the initial behaviour we had:

  • {{pod.name}} was composed by: [WORKFLOW NAME]-[TASK NAME]-[POD ID]

And what you've got is:

  • {{pod.name}} returning [WORKFLOW NAME]-[TASK NAME]-[TASK ID] and the S3 artifact bucket folder being [WORKFLOW NAME]-[TASK NAME]-[POD ID]

I would like to make sure what we currently have is an expected behaviour.

@paulfouquet
Copy link
Author

paulfouquet commented Oct 9, 2024

I confirm that I CAN NOT reproduce this bug using kind with K8s 1.30 and Argo Workflows v3.5.5/v3.5.11. The issue might be in our configuration or EKS side. Will do more investigation using AWS EKS and update this ticket with what we'll find.

@paulfouquet paulfouquet changed the title {{pod.name}} inconsistency {{pod.name}} value different when using retryStrategy withing templateDefaults Oct 13, 2024
@paulfouquet
Copy link
Author

The issue happens when a retryStrategy is set in the templateDefaults. I've updated the description.

@paulfouquet paulfouquet changed the title {{pod.name}} value different when using retryStrategy withing templateDefaults {{pod.name}} value different when using retryStrategy within templateDefaults Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/templating Templating with `{{...}}` P3 Low priority type/bug
Projects
None yet
Development

No branches or pull requests

4 participants