You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes, use_tensor_cores=True invokes prefill kernel, and it's reasonable that you get nearly the same performance because decode operations are IO bound.
That's interesting! I didn't manually call tensor cores in flashinfer decode kernels but I'm not sure if nvcc can do some clever optimizations that turns some reduction to tensor cores, so I checked the sass code of decode kernels (you can check sass by cuobjdump -sass *.o), and I can confirm there is NO HMMA instructions in decode kernels' generated sass (and if you check prefill kernels' sass there will be many of them). Perhaps ncu count some other operations in this pipe.
Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.
I use q_head = kv_head = 40, which is the standard attention for llama 13B.
I tried use_tensor_cores = True and False, I get nearly the same performance.
My question is:
Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?
I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?
The text was updated successfully, but these errors were encountered: