Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up in importervectorization.cpp #108818

Merged
merged 6 commits into from
Oct 18, 2024
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 13, 2024

A small clean up in importervectorization.cpp:

  • I've removed a bit obscured logic and the separation between SIMD and scalar. It's now a single function with a single loop
  • The logic is no longer hard-limited by 1 or 2 loads per expansion. I plan to allow arm64 to use more loads to be on par with x64
  • I've removed 32-bit specific limitations (hence, more improvements on x86 in the diffs)
  • We now try to minimize the load size for the remainder, e.g. when our string is 6 bytes, we do the comparison as 4+2 instead of 4+4 (with overlap) - hence, we less likely hit a cache-line (clang does the same). This change is responsible for SPMI diffs.

Diffs

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2024
@EgorBo EgorBo marked this pull request as ready for review October 14, 2024 13:59
@EgorBo
Copy link
Member Author

EgorBo commented Oct 14, 2024

@jakobbotsch @AndyAyersMS @dotnet/jit-contrib PTAL

the changes are mainly in one function so I suggest reviewing it as a whole, e.g. here:

GenTree* Compiler::impExpandHalfConstEquals(
GenTreeLclVarCommon* data, WCHAR* cns, int charLen, int dataOffset, StringComparison cmpMode)
{
static_assert_no_msg(sizeof(WCHAR) == 2);
assert((charLen > 0) && (charLen <= MaxPossibleUnrollSize));
// A gtNewOperNode which can handle SIMD operands (used for bitwise operations):
auto bitwiseOp = [&](genTreeOps oper, var_types type, GenTree* op1, GenTree* op2) -> GenTree* {
#ifdef FEATURE_HW_INTRINSICS
if (varTypeIsSIMD(type))
{
return gtNewSimdBinOpNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(type));
}
if (varTypeIsSIMD(op1))
{
// E.g. a comparison of SIMD ops returning TYP_INT;
assert(varTypeIsSIMD(op2));
return gtNewSimdCmpOpAllNode(oper, type, op1, op2, CORINFO_TYPE_NATIVEUINT, genTypeSize(op1));
}
#endif
return gtNewOperNode(oper, type, op1, op2);
};
// Convert charLen to byteLen. It never overflows because charLen is a small value
unsigned byteLen = (unsigned)charLen * 2;
// Find the largest possible type to read data
var_types readType = roundDownMaxType(byteLen, true);
GenTree* result = nullptr;
unsigned byteLenRemaining = byteLen;
while (byteLenRemaining > 0)
{
// We have a remaining data to process and it's smaller than the
// previously processed data
if (byteLenRemaining < genTypeSize(readType))
{
if (varTypeIsIntegral(readType))
{
// Use a smaller GPR load for the remaining data, we're going to zero-extend it
// since the previous GPR load was larger. Hence, for e.g. 6 bytes we're going to do
// "(IND<INT> ^ cns1) | (UINT)(IND<USHORT> ^ cns2)"
readType = roundUpGPRType(byteLenRemaining);
}
else
{
// TODO-CQ: We should probably do the same for SIMD, e.g. 34 bytes -> SIMD32 and SIMD16
// while currently we do SIMD32 and SIMD32. This involves a bit more complex upcasting logic.
}
// Overlap with the previously processed data
byteLenRemaining = genTypeSize(readType);
assert(byteLenRemaining <= byteLen);
}
ssize_t byteOffset = ((ssize_t)byteLen - (ssize_t)byteLenRemaining);
// Total offset includes dataOffset (e.g. 12 for String)
ssize_t totalOffset = byteOffset + (ssize_t)dataOffset;
// Clone dst and add offset if necessary.
GenTree* absOffset = gtNewIconNode(totalOffset, TYP_I_IMPL);
GenTree* currData = gtNewOperNode(GT_ADD, TYP_BYREF, gtCloneExpr(data), absOffset);
GenTree* loadedData = gtNewIndir(readType, currData, GTF_IND_UNALIGNED | GTF_IND_ALLOW_NON_ATOMIC);
// For OrdinalIgnoreCase mode we need to convert both data and cns to lower case
if (cmpMode == OrdinalIgnoreCase)
{
WCHAR mask[MaxPossibleUnrollSize] = {};
int maskSize = (int)genTypeSize(readType) / 2;
if (!ConvertToLowerCase(cns + (byteOffset / 2), reinterpret_cast<WCHAR*>(&mask), maskSize))
{
// value contains non-ASCII chars, we can't proceed further
return nullptr;
}
// 0x20 mask for the current chunk to convert it to lower case
GenTree* toLowerMask = gtNewGenericCon(readType, (uint8_t*)mask);
// loadedData is now "loadedData | toLowerMask"
loadedData = bitwiseOp(GT_OR, genActualType(readType), loadedData, toLowerMask);
}
else
{
assert(cmpMode == Ordinal);
}
GenTree* srcCns = gtNewGenericCon(readType, (uint8_t*)cns + byteOffset);
// A small optimization: prefer X == Y over X ^ Y == 0 since
// just one comparison is needed, and we can do it with a single load.
if ((genTypeSize(readType) == byteLen) && varTypeIsIntegral(readType))
{
// TODO-CQ: Figure out why it's a size regression for SIMD
return bitwiseOp(GT_EQ, TYP_INT, loadedData, srcCns);
}
// loadedData ^ srcCns
GenTree* xorNode = bitwiseOp(GT_XOR, genActualType(readType), loadedData, srcCns);
// Merge with the previous result with OR
if (result == nullptr)
{
// It's the first check
result = xorNode;
}
else
{
if (!result->TypeIs(readType))
{
assert(varTypeIsIntegral(result) && varTypeIsIntegral(readType));
xorNode = gtNewCastNode(result->TypeGet(), xorNode, true, result->TypeGet());
}
// Merge with the previous result via OR
result = bitwiseOp(GT_OR, genActualType(result->TypeGet()), result, xorNode);
}
// Move to the next chunk.
byteLenRemaining -= genTypeSize(readType);
}
// Compare the result against zero, e.g. (chunk1 ^ cns1) | (chunk2 ^ cns2) == 0
return bitwiseOp(GT_EQ, TYP_INT, result, gtNewZeroConNode(result->TypeGet()));
}

This simplifies code along with some benefits (see description).

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice clean up.

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 17, 2024

Btw I see that arm64 still uses bitwise ops instead of conditional compares. The conditional compares would likely be more dense.

I presume it's better if some other phase (morph/lower) converts to a whatever is better shape. So it can also handle users inputs with bitwise ops

@EgorBo EgorBo merged commit 8734660 into dotnet:main Oct 18, 2024
108 checks passed
@EgorBo EgorBo deleted the cleanup-str-unroll branch October 18, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants