Question about `use_tensor_cores = True or False` #520

sleepwalker2017 · 2024-10-10T02:52:05Z

Hi, I'm benchmarking flashinfer on H100, and I'm running attention for the decoding stage.

I use q_head = kv_head = 40, which is the standard attention for llama 13B.

I tried use_tensor_cores = True and False, I get nearly the same performance.

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49025722         40  1225643.1  1222881.5   1219938   1231234       4342.4  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           374625          1   374625.0   374625.0    374625    374625          0.0  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosE…

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     99.2         49189726         40  1229743.2  1229681.5   1228833   1231042        495.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.8           379936          1   379936.0   379936.0    379936    379936          0.0  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::Mas…
      0.0             1600          1     1600.0     1600.0      1600      1600          0.0  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…

My question is:

Is this result reliable? If use_tensor_cores=True, it will invoke the prefill kernel?
I tested the tensor core usage for both kernels, but find they both uses tensor cores, why is that?

  void flashinfer::BatchDecodeWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::PosEncodingMode)0, (unsigned int)2, (unsigned int)4, (unsigned int)8, (unsigned int)16, (unsigned int)1, (unsigned int)8, (flashinfer::PageStorage)0, __half, __half, __half, int>(T10 *, T13 *, flashinfer::paged_kv_t<T9, T11, T13>, flashinfer::kv_partition_info_t<T13>, T12 *, float *, bool *, bool, int, float, float, float, float) (74, 40, 1)x(16, 1, 8), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         0.11
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst       166688
    ---------------------------------------------------------------------- ----------- ------------

  void flashinfer::BatchPrefillWithPagedKVCacheKernel<(flashinfer::LogitsPostHook)0, (flashinfer::MaskMode)0, (flashinfer::PosEncodingMode)0, (unsigned int)1, (unsigned int)8, (unsigned int)2, (unsigned int)1, (unsigned int)4, (flashinfer::PageStorage)0, __half, __half, float, __half, int>(T14 *, T14 *, T14 *, T10 *, flashinfer::paged_kv_t<T9, T11, T14>, T14 *, unsigned char *, T14 *, T14 *, T14 *, T13 *, float *, bool *, T14 *, bool, flashinfer::uint_fastdiv, int, float, float, float, float) (74, 1, 40)x(32, 1, 4), Context 1, Stream 7, Device 0, CC 9.0
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- ----------- ------------
    Metric Name                                                            Metric Unit Metric Value
    ---------------------------------------------------------------------- ----------- ------------
    sm__inst_executed_pipe_tensor_op.sum                                                    (!) n/a
    sm__inst_executed_pipe_tensor_op_dmma.sum                                     inst            0
    sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active           %         3.60
    sm__inst_executed_pipe_tensor_op_hmma.sum                                     inst      4899840
    ---------------------------------------------------------------------- ----------- ------------

The text was updated successfully, but these errors were encountered:

yzh119 · 2024-10-10T06:58:18Z

Hi @sleepwalker2017 , thanks for doing the benchmark!

Yes, use_tensor_cores=True invokes prefill kernel, and it's reasonable that you get nearly the same performance because decode operations are IO bound.
That's interesting! I didn't manually call tensor cores in flashinfer decode kernels but I'm not sure if nvcc can do some clever optimizations that turns some reduction to tensor cores, so I checked the sass code of decode kernels (you can check sass by cuobjdump -sass *.o), and I can confirm there is NO HMMA instructions in decode kernels' generated sass (and if you check prefill kernels' sass there will be many of them). Perhaps ncu count some other operations in this pipe.

tdoublep mentioned this issue Oct 18, 2024

[Performance]: FLASHINFER backend is slower than FLASH_ATTN on H100 vllm-project/vllm#9471

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about `use_tensor_cores = True or False` #520

Question about `use_tensor_cores = True or False` #520

sleepwalker2017 commented Oct 10, 2024

yzh119 commented Oct 10, 2024 •

edited

Loading

Question about use_tensor_cores = True or False #520

Question about use_tensor_cores = True or False #520

Comments

sleepwalker2017 commented Oct 10, 2024

yzh119 commented Oct 10, 2024 • edited Loading

Question about `use_tensor_cores = True or False` #520

Question about `use_tensor_cores = True or False` #520

yzh119 commented Oct 10, 2024 •

edited

Loading