From: Ard Biesheuvel Date: Thu, 21 Nov 2019 17:13:41 +0000 (+0100) Subject: chacha/asm/chacha-armv8.pl: preserve FP registers d8 and d9 correctly X-Git-Tag: openssl-3.0.0-alpha1~901 X-Git-Url: https://git.librecmc.org/?a=commitdiff_plain;h=07a470729c4ace678fba6aeeeaf506436aa856e2;p=oweals%2Fopenssl.git chacha/asm/chacha-armv8.pl: preserve FP registers d8 and d9 correctly Depending on the size of the input, we may take different paths through the accelerated arm64 ChaCha20 routines, each of which use a different subset of the FP registers, some of which need to be preserved and restored, as required by the AArch64 calling convention (AAPCS64) In some cases, (e.g., when the input size is 640 bytes), we call the 512 byte NEON path followed directly by the scalar path, and in this case, we preserve and restore d8 and d9, only to clobber them again immediately before handing over to the scalar path which does not touch the FP registers at all, and hence does not restore them either. Fix this by moving the restoration of d8 and d9 to a later stage in the 512 byte routine, either before calling the scalar path, or when exiting the function. Fixes #10470 CLA: trivial Reviewed-by: Paul Dale Reviewed-by: Matt Caswell (Merged from https://github.com/openssl/openssl/pull/10497) --- diff --git a/crypto/chacha/asm/chacha-armv8.pl b/crypto/chacha/asm/chacha-armv8.pl index aed873d57e..7868389f71 100755 --- a/crypto/chacha/asm/chacha-armv8.pl +++ b/crypto/chacha/asm/chacha-armv8.pl @@ -1232,8 +1232,7 @@ $code.=<<___; adds $len,$len,#512 ushr $ONE,$ONE,#1 // 4 -> 2 - ldp d8,d9,[sp,#128+0] // meet ABI requirements - ldp d10,d11,[sp,#128+16] + ldp d10,d11,[sp,#128+16] // meet ABI requirements ldp d12,d13,[sp,#128+32] ldp d14,d15,[sp,#128+48] @@ -1250,6 +1249,7 @@ $code.=<<___; ld1 {$CTR,$ROT24},[$key] b.hs .Loop_outer_neon + ldp d8,d9,[sp,#0] // meet ABI requirements eor @K[1],@K[1],@K[1] eor @K[2],@K[2],@K[2] eor @K[3],@K[3],@K[3] @@ -1259,6 +1259,7 @@ $code.=<<___; b .Loop_outer .Ldone_512_neon: + ldp d8,d9,[sp,#128+0] // meet ABI requirements ldp x19,x20,[x29,#16] add sp,sp,#128+64 ldp x21,x22,[x29,#32]