Performance regression in `Vector<T>.Vector<T>(T)` on x86/x64 #108929

ap5d · 2024-10-16T14:45:16Z

Description

When using NET 9-RC2, Vector<T> constructor that broadcasts a scalar to all elements of a vector is not optimized to a broadcasting instruction on x86/x64. .NET 8 compiler makes this optimization.

Reproduction Steps

The regression can be reproduced by compiling the following function:

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector<int> ScalarToVector(int scalar) => new(scalar);

Expected behavior

I would expect the compiler to use only few instructions for broadcasting the scalar to all elements. This what .NET 8 compiler produces:

       vzeroupper 
       vpbroadcastd  ymm0, esi
       vmovups  ymmword ptr [rdi], ymm0
       mov      rax, rdi
       vzeroupper 
       ret

So a single vpbroadcastd does the job when AVX2 is enabled.

Actual behavior

Using .NET 9-rc2, the following machine code is generated:

       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       vxorps   ymm0, ymm0, ymm0
       vmovups  ymmword ptr [rbp-0x30], ymm0
       mov      dword ptr [rbp-0x30], esi
       mov      dword ptr [rbp-0x2C], esi
       mov      dword ptr [rbp-0x28], esi
       mov      dword ptr [rbp-0x24], esi
       mov      dword ptr [rbp-0x20], esi
       mov      dword ptr [rbp-0x1C], esi
       mov      dword ptr [rbp-0x18], esi
       mov      dword ptr [rbp-0x14], esi
       vmovups  ymm0, ymmword ptr [rbp-0x30]
       vmovups  ymmword ptr [rdi], ymm0
       mov      rax, rdi
       vzeroupper 
       add      rsp, 48
       pop      rbp
       ret

As you can see, the compiler fills elements individually to an array on stack, which is much slower.

Regression?

No response

Known Workarounds

Use .NET 8 or select Vector128/256/512.Create method based on Vector<T> length:

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector<int> ScalarToVector(int scalar)
{
    if (Vector<int>.Count == 16)
    {
        return Vector512.Create(scalar).AsVector();
    }
    else if (Vector<int>.Count == 8)
    {
        return Vector256.Create(scalar).AsVector();
    }
    else if (Vector<int>.Count == 4)
    {
        return Vector128.Create(scalar).AsVector();
    }
    else
    {
        return new(scalar);
    }
}

This workaround results in the following machine code with .NET 9-RC2:

       vpbroadcastd ymm0, esi
       vmovups  ymmword ptr [rdi], ymm0
       mov      rax, rdi
       vzeroupper 
       ret

Configuration

No response

Other information

No response

The text was updated successfully, but these errors were encountered:

xtqqczze · 2024-10-16T18:57:56Z

on x86.

@ap5d Your example uses x64 machine code. Could you update the issue title to reflect this?

tannergooding · 2024-10-16T19:09:47Z

As a workaround, you can use Vector.Create(value) which will likely produce better codegen as well.

This is an easy thing to fix, but potentially not worth backporting without additional user reports.

ap5d · 2024-10-16T19:59:43Z

Vector.Create(int value) produces almost the same machine code:

       push     rbp
       sub      rsp, 48
       lea      rbp, [rsp+0x30]
       mov      dword ptr [rbp-0x30], esi
       mov      dword ptr [rbp-0x2C], esi
       mov      dword ptr [rbp-0x28], esi
       mov      dword ptr [rbp-0x24], esi
       mov      dword ptr [rbp-0x20], esi
       mov      dword ptr [rbp-0x1C], esi
       mov      dword ptr [rbp-0x18], esi
       mov      dword ptr [rbp-0x14], esi
       vmovups  ymm0, ymmword ptr [rbp-0x30]
       vmovups  ymmword ptr [rdi], ymm0
       mov      rax, rdi
       vzeroupper 
       add      rsp, 48
       pop      rbp
       ret

Unnecessary zeroing of memory is elided here.

tannergooding · 2024-10-16T20:59:13Z

Put up #108945 to resolve the issue. The entry was missing a flag and in the incorrect place to be handled.

It was broken by accident as part of a larger refactoring that was simplifying the simdashwintrinsic handling. We plan on fully removing this special handling in .NET 10 in favor of a simpler approach that directly maps them instead.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 16, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 16, 2024

ap5d changed the title ~~Performance regression in Vector<T>.Vector<T>(T) on x86~~ Performance regression in Vector<T>.Vector<T>(T) on x86/x64 Oct 16, 2024

tannergooding linked a pull request Oct 16, 2024 that will close this issue

Ensure Vector.Create is properly recognized as intrinsic #108945

Open

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Oct 16, 2024

JulieLeeMSFT assigned tannergooding Oct 17, 2024

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 17, 2024

JulieLeeMSFT added this to the 10.0.0 milestone Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in `Vector<T>.Vector<T>(T)` on x86/x64 #108929

Performance regression in `Vector<T>.Vector<T>(T)` on x86/x64 #108929

ap5d commented Oct 16, 2024 •

edited

Loading

xtqqczze commented Oct 16, 2024

tannergooding commented Oct 16, 2024

ap5d commented Oct 16, 2024 •

edited

Loading

tannergooding commented Oct 16, 2024

Performance regression in Vector<T>.Vector<T>(T) on x86/x64 #108929

Performance regression in Vector<T>.Vector<T>(T) on x86/x64 #108929

Comments

ap5d commented Oct 16, 2024 • edited Loading

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

xtqqczze commented Oct 16, 2024

tannergooding commented Oct 16, 2024

ap5d commented Oct 16, 2024 • edited Loading

tannergooding commented Oct 16, 2024

Performance regression in `Vector<T>.Vector<T>(T)` on x86/x64 #108929

Performance regression in `Vector<T>.Vector<T>(T)` on x86/x64 #108929

ap5d commented Oct 16, 2024 •

edited

Loading

ap5d commented Oct 16, 2024 •

edited

Loading