By allowing all limbs to be up to 52 bits between operations, which was
already allowed by all out code, we can make the carry propagation more
parallelizable. Seems to help the compiler more than the handwritten asm.
name old time/op new time/op delta
Add-8 7.77ns ±19% 6.43ns ± 1% -17.16% (p=0.000 n=10+8)
Mul-8 26.3ns ± 0% 24.6ns ± 1% -6.32% (p=0.000 n=9+10)
Mul32-8 5.86ns ± 1% 5.87ns ± 1% ~ (p=0.171 n=10+10)
WideMultCall-8 2.54ns ± 0% 2.54ns ± 0% ~ (p=0.965 n=9+8)
BasepointMul-8 18.6µs ± 1% 18.7µs ± 1% ~ (p=0.095 n=9+10)
ScalarMul-8 65.6µs ± 3% 63.9µs ± 1% -2.63% (p=0.000 n=10+9)
VartimeDoubleBaseMul-8 61.1µs ± 1% 60.7µs ± 2% -0.73% (p=0.017 n=10+9)
MultiscalarMulSize8-8 224µs ± 1% 224µs ± 1% ~ (p=0.182 n=10+9)
A lot of higher order operation time is spent in there, and the compiler
output is not great.
Not at all clear why Add doesn't benefit from the faster assembly, but
it might be small enough to be better off with inlining.
name old time/op new time/op delta
Add-8 7.77ns ±19% 7.77ns ±19% ~ (p=0.739 n=10+10)
Mul-8 26.0ns ± 1% 26.3ns ± 0% +0.94% (p=0.000 n=9+9)
Mul32-8 5.88ns ± 1% 5.86ns ± 1% ~ (p=0.085 n=10+10)
WideMultCall-8 2.54ns ± 1% 2.54ns ± 0% ~ (p=0.130 n=8+9)
BasepointMul-8 20.0µs ± 1% 18.6µs ± 1% -6.90% (p=0.000 n=8+9)
ScalarMul-8 71.7µs ± 0% 65.6µs ± 3% -8.55% (p=0.000 n=10+10)
VartimeDoubleBaseMul-8 68.6µs ± 1% 61.1µs ± 1% -10.86% (p=0.000 n=10+10)
MultiscalarMulSize8-8 248µs ± 2% 224µs ± 1% -9.99% (p=0.000 n=10+10)
The implementation is a bit of a hack, we can probably save some
operations by not doing the two projP2.FromP1xP1 conversions, but it's
unclear if the performance matters to anyone.
For hdevalence/ed25519consensus#5
We'll need these for ristretto255, but we might want to expose them in a
separate package. Note how FieldElement was only exported for the
benefit of ExtendedCoords. For now, unexport FieldElement and delete
ExtendedCoords (since a proper FromExtendedCoords implementation would
check the curve equations anyway).
Since the sizes are fixed, we can use outlining to make Bytes almost as
efficient as FillBytes: a careful caller can avoid the allocation, and
copying 32 bytes is unlikely to show up on the profiles.
I could not decide if they should be called SetIdentity/SetGenerator, so
instead I removed them. Turns out we only needed them in one place,
where Set(NewIdentityPoint()) inlines well enough that it should perform
the same.
Most the Identity calls were redundant as the value was overwritten
before being used next.
Also, replaced Bytes (which appended, unlike big.Int.Bytes) with
FillBytes. ristretto255 has Encode/Decode instead of
FillBytes/FromCanonicalBytes in order to match Element, which is not
relevant here.
This pure Go implementation of Mul32 is more than twice as fast as the
assembly Mul implementation, and four times faster than the pure Go Mul.
Mul32 7.91ns ± 1%
Mul 18.6ns ± 1%
Mul [purego] 33.4ns ± 0%
Before Go 1.13, where we can't use math/bits because the fallbacks might
not be constant time, Mul32 is a little slower, but not nearly as much
as the pure Go Mul.
Mul32 9.74ns ± 0%
Mul [purego] 75.4ns ± 1%
The names of the ScalarMults were picked to match elliptic.Curve.
The Scalar type is re-exposed as an opaque type, with an API that
matches the Element one.