loki ruler v3.x support for defining alertrules as configmaps #14432

zbialik · 2024-10-09T00:00:14Z

Describe the bug

I am trying to upgrade Loki from v2.9 to v3.2 and have already done this for majority of my k8s clusters. I have migrated from the loki-distributed helm chart to use the latest loki helm chart, which now supports the microservice deployment mode.

Migrating to the new helm chart + upgrading to v3.2 caused loki-ruler to change to a StatefulSet manifest that's got pretty much identical specs to the Deployment from before (v2.9).

Upon deployment, the upgraded loki-ruler, the Pod immediately Errors/CrashLoops with below log:

level=info ts=2024-10-08T23:47:43.72711083Z caller=main.go:126 msg="Starting Loki" version="(version=k218-659f542, branch=k218, revision=659f5421)"
level=info ts=2024-10-08T23:47:43.727143114Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml
level=info ts=2024-10-08T23:47:43.727727024Z caller=server.go:351 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
level=info ts=2024-10-08T23:47:43.729017391Z caller=memberlist_client.go:439 msg="Using memberlist cluster label and node name" cluster_label= node=loki-ruler-0-5d5f70f4
level=info ts=2024-10-08T23:47:43.729987672Z caller=memberlist_client.go:549 msg="memberlist fast-join starting" nodes_found=1 to_join=4
level=info ts=2024-10-08T23:47:43.832828323Z caller=mapper.go:47 msg="cleaning up mapped rules directory" path=/tmp/loki/scratch
level=info ts=2024-10-08T23:47:43.833192913Z caller=memberlist_client.go:569 msg="memberlist fast-join finished" joined_nodes=6 elapsed_time=103.207037ms
level=info ts=2024-10-08T23:47:43.833229035Z caller=memberlist_client.go:581 phase=startup msg="joining memberlist cluster" join_members=loki-memberlist
level=info ts=2024-10-08T23:47:43.835066232Z caller=module_service.go:82 msg=starting module=server
level=info ts=2024-10-08T23:47:43.835140205Z caller=module_service.go:82 msg=starting module=runtime-config
level=info ts=2024-10-08T23:47:43.835155406Z caller=module_service.go:82 msg=starting module=analytics
level=info ts=2024-10-08T23:47:43.835315964Z caller=module_service.go:82 msg=starting module=memberlist-kv
level=info ts=2024-10-08T23:47:43.835330823Z caller=module_service.go:82 msg=starting module=ring
level=info ts=2024-10-08T23:47:43.835962324Z caller=module_service.go:82 msg=starting module=ingester-querier
level=info ts=2024-10-08T23:47:43.83598184Z caller=module_service.go:82 msg=starting module=store
level=info ts=2024-10-08T23:47:43.835999889Z caller=module_service.go:82 msg=starting module=ruler
level=info ts=2024-10-08T23:47:43.836009883Z caller=ruler.go:533 msg="ruler up and running"
level=info ts=2024-10-08T23:47:43.836037657Z caller=loki.go:544 msg="Loki started" startup_time=199.57001ms
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x300f48]

goroutine 466 [running]:
net/url.(*URL).String(0x0)
        /usr/local/go/src/net/url/url.go:817 +0x28
github.com/grafana/loki/v3/pkg/ruler.MultiTenantRuleManager.func1({0x3c78648, 0x4001506a50}, {0x4000dc5c7c, 0x4}, 0x4000a737a0, {0x3c3b460, 0x40016613b0}, {0x3c6a290, 0x40015077c0})
        /src/loki/pkg/ruler/compat.go:160 +0x2cc
github.com/grafana/loki/v3/pkg/ruler/base.(*DefaultMultiTenantManager).newManager(0x4000ebfc08, {0x3c78648, 0x4001506a50}, {0x4000dc5c7c, 0x4})
        /src/loki/pkg/ruler/base/manager.go:185 +0x144
github.com/grafana/loki/v3/pkg/ruler/base.(*DefaultMultiTenantManager).syncRulesToManager(0x4000ebfc08, {0x3c78648, 0x4001506a50}, {0x4000dc5c7c, 0x4}, {0x4000b65478?, 0x4003f21a48?, 0x4000b71c38?})
        /src/loki/pkg/ruler/base/manager.go:149 +0x318
github.com/grafana/loki/v3/pkg/ruler/base.(*DefaultMultiTenantManager).SyncRuleGroups(0x4000ebfc08, {0x3c78648, 0x4001506a50}, 0x40011c19e0)
        /src/loki/pkg/ruler/base/manager.go:110 +0x128
github.com/grafana/loki/v3/pkg/ruler.(*MultiTenantManager).SyncRuleGroups(0x4000ec0808?, {0x3c78648?, 0x4001506a50?}, 0x2?)
        /src/loki/pkg/ruler/compat.go:107 +0x30
github.com/grafana/loki/v3/pkg/ruler/base.(*Ruler).syncRules(0x4000ec0808, {0x3c78648, 0x4001506a50}, {0x3175699, 0x7})
        /src/loki/pkg/ruler/base/ruler.go:587 +0x3d8
github.com/grafana/loki/v3/pkg/ruler/base.(*Ruler).run(0x4000ec0808, {0x3c78648, 0x4001506a50})
        /src/loki/pkg/ruler/base/ruler.go:548 +0x254
github.com/grafana/dskit/services.(*BasicService).main(0x4000d31b80)
        /src/loki/vendor/github.com/grafana/dskit/services/basic_service.go:193 +0x17c
created by github.com/grafana/dskit/services.(*BasicService).StartAsync.func1 in goroutine 433
        /src/loki/vendor/github.com/grafana/dskit/services/basic_service.go:122 +0xfc

I want to emphasize that all other loki components were a breeze to upgrade. It's just the loki-ruler that I am observing issues with.

Previously (v2.9), loki-ruler ran as a Deployment with the below spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: ruler
    app.kubernetes.io/instance: loki
    app.kubernetes.io/name: loki-distributed
  name: loki-ruler
  namespace: logging
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/component: ruler
      app.kubernetes.io/instance: loki
      app.kubernetes.io/name: loki-distributed
  template:
    metadata:
      labels:
        app.kubernetes.io/component: ruler
        app.kubernetes.io/instance: loki
        app.kubernetes.io/name: loki-distributed
    spec:
      containers:
      - args:
        - -config.file=/etc/loki/config/config.yaml
        - -target=ruler
        image: grafana/loki:2.9.10
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
        volumeMounts:
        - mountPath: /etc/loki/config
          name: config
        - mountPath: /var/loki-distributed-runtime
          name: runtime-config
        - mountPath: /var/loki
          name: data
        - mountPath: /tmp/loki
          name: tmp
        - mountPath: /etc/loki/rules/
          name: rules
      securityContext:
        fsGroup: 10001
        runAsGroup: 10001
        runAsNonRoot: true
        runAsUser: 10001
      volumes:
      - name: config
        configMap:
          name: loki
          items:
            - key: "config.yaml"
              path: "config.yaml"
      - configMap:
          defaultMode: 420
          name: loki-runtime
        name: runtime-config
      - emptyDir: {}
        name: tmp
      - emptyDir: {}
        name: data
      - name: rules
        projected:
          defaultMode: 420
          sources:
          - configMap:
              items:
              - key: my-alerts.yaml
                path: fake/my-alerts.yaml
              name: loki-rule-my-alerts
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-rule-my-alerts
  namespace: logging
data:
  my-alerts.yaml: |
    groups:
      - name: should_fire
        rules:
          - alert: HighPercentageError
            expr: |
              sum(rate({app="foo", env="production"} |= "error" [5m])) by (job)
                /
              sum(rate({app="foo", env="production"}[5m])) by (job)
                > 0.05
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: High error rate

The loki config.yaml I am using has the following in the ruler: section

...
ruler:
  alertmanager_url: http://alertmanager-main.monitoring.svc.cluster.local.:9093
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  storage:
    local:
      directory: /etc/loki/rules
    type: local
  enable_api: true
  enable_alertmanager_v2: true

I'm not exactly sure what has changed since Loki v2.9, but I've not been able to upgrade loki-ruler to v3.2 due to this issue.

Important Notes

If I remove the ConfigMap volume/volumeMount, things will work fine, but that obviously doesn't meet our needs for maintaining our logging alert rules as ConfigMap objects.
If you try deploying the loki helm chart with deploymentMode=Distributed and defining ruler.directories for alerts, the deployment will silently fail because the loki config.yaml will default to use S3 storage backend when at that point, it should configure the ruler to use the local storage.

Expected behavior

Loki Ruler should run fine. It feel like the helm chart either needs some initContainer to copy the alert files to where ruler wants to see them or better handle the synchronization. I see a few issues that are still open and relevant to this today:

Environment:

Infrastructure: Kubernetes v1.30.4
Deployment tool: helm template > plain manifests

Screenshots, Promtail config, or terminal output
N/A

The text was updated successfully, but these errors were encountered:

zbialik · 2024-10-09T18:07:06Z

Update: i've tried using 3.0.0 and 3.1.1 and both have produced similar results.

zbialik · 2024-10-09T18:08:07Z

I'm forced to use a downgraded loki version (2.9) for the loki-ruler until this is resolved because we require all our alert rules to be defined in-core as ConfigMaps (GitOps), rather than enabling definitions through Grafana UI.

zbialik changed the title ~~loki ruler v3.2 does not support definition of alertrules with configmap~~ loki ruler v3.2 support for alertrules as configmaps Oct 9, 2024

zbialik changed the title ~~loki ruler v3.2 support for alertrules as configmaps~~ loki ruler v3.x support for defining alertrules as configmaps Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loki ruler v3.x support for defining alertrules as configmaps #14432

loki ruler v3.x support for defining alertrules as configmaps #14432

zbialik commented Oct 9, 2024

zbialik commented Oct 9, 2024

zbialik commented Oct 9, 2024

loki ruler v3.x support for defining alertrules as configmaps #14432

loki ruler v3.x support for defining alertrules as configmaps #14432

Comments

zbialik commented Oct 9, 2024

Important Notes

zbialik commented Oct 9, 2024

zbialik commented Oct 9, 2024