Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can katalyst add the scheduling function for LLC and memory bandwidth resources? #102

Open
yanxiaoqi932 opened this issue Jun 12, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@yanxiaoqi932
Copy link

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

@caohe caohe added the enhancement New feature or request label Jun 20, 2023
@caohe
Copy link
Member

caohe commented Jun 20, 2023

Yes, the isolation of LLC and memory bandwidth is also on the roadmap. However, they may not be regarded as resources scheduled by the scheduler, but to avoid resource competition through interference detection and suppression.
If you are interested in this feature, welcome to participate in the discussion and implementation.

@Rouzip
Copy link

Rouzip commented Oct 27, 2023

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

@yanxiaoqi932
Copy link
Author

yanxiaoqi932 commented Oct 27, 2023

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

@Rouzip
Copy link

Rouzip commented Nov 2, 2023

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

I am also considering similar issues, but there are several issues that need to be thought about and solved: 1. For offline workloads, it is necessary to profile in advance to judge whether LLC and memory bandwidth will have a huge impact on them. Only through monitoring, it is difficult to determined whether it is memory-intensive if the offline workload is complex. 2. Since the LLC and memory bandwidth resources of a single node are limited, if too many resources are allocated to the online workload, it will cause a waste of resources. If it is only evict, can you use pod affinity to solve the problem? I hope to have more in-depth communication with you!

@yanxiaoqi932
Copy link
Author

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

I am also considering similar issues, but there are several issues that need to be thought about and solved: 1. For offline workloads, it is necessary to profile in advance to judge whether LLC and memory bandwidth will have a huge impact on them. Only through monitoring, it is difficult to determined whether it is memory-intensive if the offline workload is complex. 2. Since the LLC and memory bandwidth resources of a single node are limited, if too many resources are allocated to the online workload, it will cause a waste of resources. If it is only evict, can you use pod affinity to solve the problem? I hope to have more in-depth communication with you!

For the first issue, that's actually a complex problem, we have to profile in advance to get pod's sensitivity of LLC and MB and label the pod; for the second issue, I think it's feasible to use pod affinity to solve the problem, pods deployed in the same socket will share LLC and MB as we know, for example, if we wanna deploy a pod sensitive to LLC, we should avoid deploying it to the socket where there is already an LLC-sensitive pod, we can avoid LLC competition in this way and not need to design eviction policy. It is worth mentioning that katalyst will realize this ability soon: #220

@Rouzip
Copy link

Rouzip commented Nov 7, 2023

What would you like to be added?

Not only CPU cores, LLC and memory bandwidth resources are bottleneck resources sometimes, can you add scheduling functions for LLC and MB?

Why is this needed?

LLC and memory bandwidth resources are bottleneck resources sometimes.

Hello, how do you ensure that LLC and memory bandwidth are the resource bottlenecks?

I found that when we co-locate some online&offline memory-intensive pods, LLC and memory bandwidth are likely to become bottlenecks for online pods, if we assign more LLC and memory bandwidth to online pods, they will perform better. We can use Intel's CAT and MBA technologies to evict offline memory-intensive pods sometimes.

I am also considering similar issues, but there are several issues that need to be thought about and solved: 1. For offline workloads, it is necessary to profile in advance to judge whether LLC and memory bandwidth will have a huge impact on them. Only through monitoring, it is difficult to determined whether it is memory-intensive if the offline workload is complex. 2. Since the LLC and memory bandwidth resources of a single node are limited, if too many resources are allocated to the online workload, it will cause a waste of resources. If it is only evict, can you use pod affinity to solve the problem? I hope to have more in-depth communication with you!

For the first issue, that's actually a complex problem, we have to profile in advance to get pod's sensitivity of LLC and MB and label the pod; for the second issue, I think it's feasible to use pod affinity to solve the problem, pods deployed in the same socket will share LLC and MB as we know, for example, if we wanna deploy a pod sensitive to LLC, we should avoid deploying it to the socket where there is already an LLC-sensitive pod, we can avoid LLC competition in this way and not need to design eviction policy. It is worth mentioning that katalyst will realize this ability soon: #220

Good job!
In our previous experiments, we had a similar idea. However, implementing the RDT usage strategy (isolating workloads, limiting low-priority workloads, or dynamically adjusting based on monitoring data) and determining the profiles of other workloads have proven to be challenging in practice. It is difficult to strike a balance between allowing other workloads to occupy LLC and MB reasonably while also ensuring that performance does not drop significantly. The idea of allocating only one LLC-sensitive workload per socket is intriguing. As mentioned before, we struggled with determining a reasonable RDT allocation strategy that would be robust in the cluster. Look forward to your ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants