Clean up in importervectorization.cpp #108818

EgorBo · 2024-10-13T05:14:28Z

A small clean up in importervectorization.cpp:

I've removed a bit obscured logic and the separation between SIMD and scalar. It's now a single function with a single loop
The logic is no longer hard-limited by 1 or 2 loads per expansion. I plan to allow arm64 to use more loads to be on par with x64
I've removed 32-bit specific limitations (hence, more improvements on x86 in the diffs)
We now try to minimize the load size for the remainder, e.g. when our string is 6 bytes, we do the comparison as 4+2 instead of 4+4 (with overlap) - hence, we less likely hit a cache-line (clang does the same). This change is responsible for SPMI diffs.

Diffs

EgorBo · 2024-10-14T14:01:22Z

@jakobbotsch @AndyAyersMS @dotnet/jit-contrib PTAL

the changes are mainly in one function so I suggest reviewing it as a whole, e.g. here:

runtime/src/coreclr/jit/importervectorization.cpp

Lines 90 to 213 in 51a3337

    
           GenTree* Compiler::impExpandHalfConstEquals( 
        
               GenTreeLclVarCommon* data, WCHAR* cns, int charLen, int dataOffset, StringComparison cmpMode) 
        
           { 
        
               static_assert_no_msg(sizeof(WCHAR) == 2); 
        
               assert((charLen > 0) && (charLen <= MaxPossibleUnrollSize)); 
        
               // A gtNewOperNode which can handle SIMD operands (used for bitwise operations): 
        
               auto bitwiseOp = [&](genTreeOps oper, var_types type, GenTree* op1, GenTree* op2) -> GenTree* { 
        
           #ifdef FEATURE_HW_INTRINSICS 
        
                   if (varTypeIsSIMD(type)) 
        
                   { 
        
                       return gtNewSimdBinOpNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(type)); 
        
                   } 
        
                   if (varTypeIsSIMD(op1)) 
        
                   { 
        
                       // E.g. a comparison of SIMD ops returning TYP_INT; 
        
                       assert(varTypeIsSIMD(op2)); 
        
                       return gtNewSimdCmpOpAllNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(op1)); 
        
                   } 
        
           #endif 
        
                   return gtNewOperNode(oper, type, op1, op2); 
        
               }; 
        
               // Convert charLen to byteLen. It never overflows because charLen is a small value 
        
               unsigned byteLen = (unsigned)charLen * 2; 
        
               // Find the largest possible type to read data 
        
               var_types readType         = roundDownMaxType(byteLen, true); 
        
               GenTree*  result           = nullptr; 
        
               unsigned  byteLenRemaining = byteLen; 
        
               while (byteLenRemaining > 0) 
        
               { 
        
                   // We have a remaining data to process and it's smaller than the 
        
                   // previously processed data 
        
                   if (byteLenRemaining < genTypeSize(readType)) 
        
                   { 
        
                       if (varTypeIsIntegral(readType)) 
        
                       { 
        
                           // Use a smaller GPR load for the remaining data, we're going to zero-extend it 
        
                           // since the previous GPR load was larger. Hence, for e.g. 6 bytes we're going to do 
        
                           // "(IND<INT> ^ cns1) | (UINT)(IND<USHORT> ^ cns2)" 
        
                           readType = roundUpGPRType(byteLenRemaining); 
        
                       } 
        
                       else 
        
                       { 
        
                           // TODO-CQ: We should probably do the same for SIMD, e.g. 34 bytes -> SIMD32 and SIMD16 
        
                           // while currently we do SIMD32 and SIMD32. This involves a bit more complex upcasting logic. 
        
                       } 
        
                       // Overlap with the previously processed data 
        
                       byteLenRemaining = genTypeSize(readType); 
        
                       assert(byteLenRemaining <= byteLen); 
        
                   } 
        
                   ssize_t byteOffset = ((ssize_t)byteLen - (ssize_t)byteLenRemaining); 
        
                   // Total offset includes dataOffset (e.g. 12 for String) 
        
                   ssize_t totalOffset = byteOffset + (ssize_t)dataOffset; 
        
                   // Clone dst and add offset if necessary. 
        
                   GenTree* absOffset  = gtNewIconNode(totalOffset, TYP_I_IMPL); 
        
                   GenTree* currData   = gtNewOperNode(GT_ADD, TYP_BYREF, gtCloneExpr(data), absOffset); 
        
                   GenTree* loadedData = gtNewIndir(readType, currData, GTF_IND_UNALIGNED | GTF_IND_ALLOW_NON_ATOMIC); 
        
                   // For OrdinalIgnoreCase mode we need to convert both data and cns to lower case 
        
                   if (cmpMode == OrdinalIgnoreCase) 
        
                   { 
        
                       WCHAR mask[MaxPossibleUnrollSize] = {}; 
        
                       int   maskSize                    = (int)genTypeSize(readType) / 2; 
        
                       if (!ConvertToLowerCase(cns + (byteOffset / 2), reinterpret_cast<WCHAR*>(&mask), maskSize)) 
        
                       { 
        
                           // value contains non-ASCII chars, we can't proceed further 
        
                           return nullptr; 
        
                       } 
        
                       // 0x20 mask for the current chunk to convert it to lower case 
        
                       GenTree* toLowerMask = gtNewGenericCon(readType, (uint8_t*)mask); 
        
                       // loadedData is now "loadedData | toLowerMask" 
        
                       loadedData = bitwiseOp(GT_OR, genActualType(readType), loadedData, toLowerMask); 
        
                   } 
        
                   else 
        
                   { 
        
                       assert(cmpMode == Ordinal); 
        
                   } 
        
                   GenTree* srcCns = gtNewGenericCon(readType, (uint8_t*)cns + byteOffset); 
        
                   // A small optimization: prefer X == Y over X ^ Y == 0 since 
        
                   // just one comparison is needed, and we can do it with a single load. 
        
                   if ((genTypeSize(readType) == byteLen) && varTypeIsIntegral(readType)) 
        
                   { 
        
                       // TODO-CQ: Figure out why it's a size regression for SIMD 
        
                       return bitwiseOp(GT_EQ, TYP_INT, loadedData, srcCns); 
        
                   } 
        
                   // loadedData ^ srcCns 
        
                   GenTree* xorNode = bitwiseOp(GT_XOR, genActualType(readType), loadedData, srcCns); 
        
                   // Merge with the previous result with OR 
        
                   if (result == nullptr) 
        
                   { 
        
                       // It's the first check 
        
                       result = xorNode; 
        
                   } 
        
                   else 
        
                   { 
        
                       if (!result->TypeIs(readType)) 
        
                       { 
        
                           assert(varTypeIsIntegral(result) && varTypeIsIntegral(readType)); 
        
                           xorNode = gtNewCastNode(result->TypeGet(), xorNode, true, result->TypeGet()); 
        
                       } 
        
                       // Merge with the previous result via OR 
        
                       result = bitwiseOp(GT_OR, genActualType(result->TypeGet()), result, xorNode); 
        
                   } 
        
                   // Move to the next chunk. 
        
                   byteLenRemaining -= genTypeSize(readType); 
        
               } 
        
               // Compare the result against zero, e.g. (chunk1 ^ cns1) | (chunk2 ^ cns2) == 0 
        
               return bitwiseOp(GT_EQ, TYP_INT, result, gtNewZeroConNode(result->TypeGet())); 
        
           }

This simplifies code along with some benefits (see description).

jakobbotsch

LGTM, nice clean up.

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

EgorBo · 2024-10-17T17:59:38Z

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

I presume it's better if some other phase (morph/lower) converts to a whatever is better shape. So it can also handle users inputs with bitwise ops

Clean up in importervectorization.cpp

3df72ca

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2024

dotnet-policy-service bot assigned EgorBo Oct 13, 2024

EgorBo added 2 commits October 13, 2024 07:32

fix build

126e784

fix build

fea0ce7

This was referenced Oct 13, 2024

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Open

ReadECDsaPrivateKey_BrainpoolP160r1_Pfx test failure on windows #108815

Closed

EgorBo added 2 commits October 13, 2024 17:20

Clean up

048f678

clean up

51a3337

EgorBot mentioned this pull request Oct 13, 2024

EgorBot for EgorBo in #108818 EgorBot/runtime-utils#119

Open

EgorBo marked this pull request as ready for review October 14, 2024 13:59

jakobbotsch approved these changes Oct 15, 2024

View reviewed changes

Merge branch 'main' into cleanup-str-unroll

d080256

EgorBo merged commit 8734660 into dotnet:main Oct 18, 2024
108 checks passed

EgorBo deleted the cleanup-str-unroll branch October 18, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up in importervectorization.cpp #108818

Clean up in importervectorization.cpp #108818

EgorBo commented Oct 13, 2024 •

edited

Loading

EgorBo commented Oct 14, 2024

jakobbotsch left a comment

EgorBo commented Oct 17, 2024 •

edited

Loading

Clean up in importervectorization.cpp #108818

Clean up in importervectorization.cpp #108818

Conversation

EgorBo commented Oct 13, 2024 • edited Loading

EgorBo commented Oct 14, 2024

jakobbotsch left a comment

Choose a reason for hiding this comment

EgorBo commented Oct 17, 2024 • edited Loading

EgorBo commented Oct 13, 2024 •

edited

Loading

EgorBo commented Oct 17, 2024 •

edited

Loading