Iain Sandoe
2018-10-30 11:28:12 UTC
Hi,
For a processor that supports SSE, but not AVX.
the following code:
typedef int __attribute__((mode(QI))) qi;
typedef qi __attribute__((vector_size (32))) v32qi;
v32qi foo (int x)
{
v32qi y = {'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f',
'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
return y;
}
produces a warning " warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]”.
so - the question is what is the resultant ABI in the changed case (since _m256 is supported for such processors)
====
Looking at the psABI v1.0
* pp24 Returning of Values
The returning of values is done according to the following algorithm:
• Classify the return type with the classification algorithm.
…
• If the class is SSE, the next available vector register of the sequence %xmm0, %xmm1 is used.
• If the class is SSEUP, the eight byte is returned in the next available eightbyte chunk of the last used vector register.
...
* classification algorithm : pp20
• Arguments of type __m256 are split into four eightbyte chunks. The least significant one belongs to class SSE and all the others to class SSEUP.
• Arguments of type __m512 are split into eight eightbyte chunks. The least significant one belongs to class SSE and all the others to class SSEUP.
* footnote on pp21
12 The post merger clean up described later ensures that, for the processors that do not support the __m256 type, if the size of an object is larger than two eightbytes and the first eightbyte is not SSE or any other eightbyte is not SSEUP, it still has class MEMORY.
This in turn ensures that for processors that do support the __m256 type, if the size of an object is four eightbytes and the first eightbyte is SSE and all other eightbytes are SSEUP, it can be passed in a register. This also applies to the __m512 type. That is for processors that support the __m512 type, if the size of an object is eight eightbytes and the first eightbyte is SSE and all other eightbytes are SSEUP, it can be passed in a register, otherwise, it will be passed in memory.
---
However : the case where the processor does *not* support __m256 but the first eightbyte *is* SSE and the following eighbytes *are* SSEUP is not clarified.
The intent for SSE seems clear - use a reg
The intent for following SSEUP is less clear.
Nevertheless, it seems to imply that the intent for processors with SSE that the __m256 (and __m512) returns should be passed in xmm0:1(:3, maybe).
figure 3.4 pp23 does not clarify xmm* use for vector return at all - only mentioning floating point.
===== status
In any event, GCC passes the vec32 return in memory,
LLVM conversely passes it in xmm0:1 (at least for the versions I’ve tried).
which leads to an ABI discrepancy when GCC is used to build code on systems based on LLVM.
Please could the X86 maintainers clarify the intent (and maybe consider enhancing the footnote classification notes to make things clearer)?
- and then we can figure out how to deal with the systems that are already implemented - and how to move forward.
(as an aside, in any event, it seems inefficient to pass through memory when at least xmm0:1 are already set aside for return value use).
thanks
Iain
For a processor that supports SSE, but not AVX.
the following code:
typedef int __attribute__((mode(QI))) qi;
typedef qi __attribute__((vector_size (32))) v32qi;
v32qi foo (int x)
{
v32qi y = {'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f',
'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
return y;
}
produces a warning " warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]”.
so - the question is what is the resultant ABI in the changed case (since _m256 is supported for such processors)
====
Looking at the psABI v1.0
* pp24 Returning of Values
The returning of values is done according to the following algorithm:
• Classify the return type with the classification algorithm.
…
• If the class is SSE, the next available vector register of the sequence %xmm0, %xmm1 is used.
• If the class is SSEUP, the eight byte is returned in the next available eightbyte chunk of the last used vector register.
...
* classification algorithm : pp20
• Arguments of type __m256 are split into four eightbyte chunks. The least significant one belongs to class SSE and all the others to class SSEUP.
• Arguments of type __m512 are split into eight eightbyte chunks. The least significant one belongs to class SSE and all the others to class SSEUP.
* footnote on pp21
12 The post merger clean up described later ensures that, for the processors that do not support the __m256 type, if the size of an object is larger than two eightbytes and the first eightbyte is not SSE or any other eightbyte is not SSEUP, it still has class MEMORY.
This in turn ensures that for processors that do support the __m256 type, if the size of an object is four eightbytes and the first eightbyte is SSE and all other eightbytes are SSEUP, it can be passed in a register. This also applies to the __m512 type. That is for processors that support the __m512 type, if the size of an object is eight eightbytes and the first eightbyte is SSE and all other eightbytes are SSEUP, it can be passed in a register, otherwise, it will be passed in memory.
---
However : the case where the processor does *not* support __m256 but the first eightbyte *is* SSE and the following eighbytes *are* SSEUP is not clarified.
The intent for SSE seems clear - use a reg
The intent for following SSEUP is less clear.
Nevertheless, it seems to imply that the intent for processors with SSE that the __m256 (and __m512) returns should be passed in xmm0:1(:3, maybe).
figure 3.4 pp23 does not clarify xmm* use for vector return at all - only mentioning floating point.
===== status
In any event, GCC passes the vec32 return in memory,
LLVM conversely passes it in xmm0:1 (at least for the versions I’ve tried).
which leads to an ABI discrepancy when GCC is used to build code on systems based on LLVM.
Please could the X86 maintainers clarify the intent (and maybe consider enhancing the footnote classification notes to make things clearer)?
- and then we can figure out how to deal with the systems that are already implemented - and how to move forward.
(as an aside, in any event, it seems inefficient to pass through memory when at least xmm0:1 are already set aside for return value use).
thanks
Iain