Can katalyst add the scheduling function for LLC and memory bandwidth resources? #102

yanxiaoqi932 · 2023-06-12T09:13:20Z

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

caohe · 2023-06-20T07:26:31Z

Yes, the isolation of LLC and memory bandwidth is also on the roadmap. However, they may not be regarded as resources scheduled by the scheduler, but to avoid resource competition through interference detection and suppression.
If you are interested in this feature, welcome to participate in the discussion and implementation.

Rouzip · 2023-10-27T03:34:53Z

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

yanxiaoqi932 · 2023-10-27T16:33:54Z

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

Rouzip · 2023-11-02T01:44:44Z

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

I am also considering similar issues, but there are several issues that need to be thought about and solved: 1. For offline workloads, it is necessary to profile in advance to judge whether LLC and memory bandwidth will have a huge impact on them. Only through monitoring, it is difficult to determined whether it is memory-intensive if the offline workload is complex. 2. Since the LLC and memory bandwidth resources of a single node are limited, if too many resources are allocated to the online workload, it will cause a waste of resources. If it is only evict, can you use pod affinity to solve the problem? I hope to have more in-depth communication with you!

yanxiaoqi932 · 2023-11-03T11:17:38Z

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

I am also considering similar issues, but there are several issues that need to be thought about and solved: 1. For offline workloads, it is necessary to profile in advance to judge whether LLC and memory bandwidth will have a huge impact on them. Only through monitoring, it is difficult to determined whether it is memory-intensive if the offline workload is complex. 2. Since the LLC and memory bandwidth resources of a single node are limited, if too many resources are allocated to the online workload, it will cause a waste of resources. If it is only evict, can you use pod affinity to solve the problem? I hope to have more in-depth communication with you!

For the first issue, that's actually a complex problem, we have to profile in advance to get pod's sensitivity of LLC and MB and label the pod; for the second issue, I think it's feasible to use pod affinity to solve the problem, pods deployed in the same socket will share LLC and MB as we know, for example, if we wanna deploy a pod sensitive to LLC, we should avoid deploying it to the socket where there is already an LLC-sensitive pod, we can avoid LLC competition in this way and not need to design eviction policy. It is worth mentioning that katalyst will realize this ability soon: #220

Rouzip · 2023-11-07T02:40:11Z

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

I am also considering similar issues, but there are several issues that need to be thought about and solved: 1. For offline workloads, it is necessary to profile in advance to judge whether LLC and memory bandwidth will have a huge impact on them. Only through monitoring, it is difficult to determined whether it is memory-intensive if the offline workload is complex. 2. Since the LLC and memory bandwidth resources of a single node are limited, if too many resources are allocated to the online workload, it will cause a waste of resources. If it is only evict, can you use pod affinity to solve the problem? I hope to have more in-depth communication with you!

For the first issue, that's actually a complex problem, we have to profile in advance to get pod's sensitivity of LLC and MB and label the pod; for the second issue, I think it's feasible to use pod affinity to solve the problem, pods deployed in the same socket will share LLC and MB as we know, for example, if we wanna deploy a pod sensitive to LLC, we should avoid deploying it to the socket where there is already an LLC-sensitive pod, we can avoid LLC competition in this way and not need to design eviction policy. It is worth mentioning that katalyst will realize this ability soon: #220

Good job!
In our previous experiments, we had a similar idea. However, implementing the RDT usage strategy (isolating workloads, limiting low-priority workloads, or dynamically adjusting based on monitoring data) and determining the profiles of other workloads have proven to be challenging in practice. It is difficult to strike a balance between allowing other workloads to occupy LLC and MB reasonably while also ensuring that performance does not drop significantly. The idea of allocating only one LLC-sensitive workload per socket is intriguing. As mentioned before, we struggled with determining a reasonable RDT allocation strategy that would be robust in the cluster. Look forward to your ideas!

caohe added the enhancement New feature or request label Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can katalyst add the scheduling function for LLC and memory bandwidth resources? #102

Can katalyst add the scheduling function for LLC and memory bandwidth resources? #102

yanxiaoqi932 commented Jun 12, 2023

caohe commented Jun 20, 2023 •

edited

Loading

Rouzip commented Oct 27, 2023 •

edited

Loading

What would you like to be added?

Why is this needed?

yanxiaoqi932 commented Oct 27, 2023 •

edited

Loading

What would you like to be added?

Why is this needed?

Rouzip commented Nov 2, 2023

What would you like to be added?

Why is this needed?

yanxiaoqi932 commented Nov 3, 2023

What would you like to be added?

Why is this needed?

Rouzip commented Nov 7, 2023

What would you like to be added?

Why is this needed?

Can katalyst add the scheduling function for LLC and memory bandwidth resources? #102

Can katalyst add the scheduling function for LLC and memory bandwidth resources? #102

Comments

yanxiaoqi932 commented Jun 12, 2023

What would you like to be added?

Why is this needed?

caohe commented Jun 20, 2023 • edited Loading

Rouzip commented Oct 27, 2023 • edited Loading

What would you like to be added?

Why is this needed?

yanxiaoqi932 commented Oct 27, 2023 • edited Loading

What would you like to be added?

Why is this needed?

Rouzip commented Nov 2, 2023

What would you like to be added?

Why is this needed?

yanxiaoqi932 commented Nov 3, 2023

What would you like to be added?

Why is this needed?

Rouzip commented Nov 7, 2023

What would you like to be added?

Why is this needed?

caohe commented Jun 20, 2023 •

edited

Loading

Rouzip commented Oct 27, 2023 •

edited

Loading

yanxiaoqi932 commented Oct 27, 2023 •

edited

Loading