lists.openwall.net  lists / announce owlusers owldev johnusers johndev passwdqcusers yescrypt popa3dusers / osssecurity kernelhardening musl sabotage tlsify passwords / cryptdev xvendor / Bugtraq FullDisclosure linuxkernel linuxnetdev linuxext4 linuxhardening PHC  
Open Source and information security mailing list archives
 

Date: Tue, 25 Mar 2014 15:16:49 0400 From: Bill Cox <waywardgeek@...il.com> To: discussions@...swordhashing.net Subject: Re: [PHC] TwoCats multiplication chain On Tue, Mar 25, 2014 at 1:01 PM, Solar Designer <solar@...nwall.com> wrote: >> There is a multiplication poisoning issue, but with the 4 variables, I >> have not seen it occur after a billion iterations, so the percentage >> of blocks that would have poisoned chains is tiny. The poisoning >> would occur if a and c both become 0 or 1 at the same time. >> Otherwise, if only a or c is 0 or 1, changes in b and d should kick >> them back into higher values. > > Is your use of four variables intended specifically to reduce the > frequency of such poisoning, or were there other reasons as well? It's mainly for the poisoning issue. With just two variables, one of them would become 1 pretty often. It also gave me a chance to throw addition into the mix. >> Will this do? > > Probably. > > Here's another thought I had, but didn't mention yet: what if the > attacker makes their multiplier circuit such that it's not constant > time, but rather that it may complete a cycle earlier in case the final > carry chain is interrupted (has 0's) in places allowing for such > speedup? If sums of partial products for some of the output bits are > zero, then higher output bits don't depend on sums of partial products > for any bits lower than that. This means that e.g. bit 63, which > potentially depends on everything due to this final carry, may in fact > be computed very quickly most of the time. I'm not following this logic. About half of the final output bits (the upper 32  I discard the lower 32) will be zero, and the higher output bits still depend on all the inputs. In a carrysave multiplier, the carries propagate in all the partial products, and can influence the next column in any of them. There are a ton of ways to speed up a multiplier. I know about Booth encoding because I used it once, but I don't know what's used today in an Intel processor. Mathematically, there are tons of opportunities to reduce far below the 2X savings we get with Booth encoding, but having an efficient layout is important to reduce delays, so most of the opportunities are not used, IIRC. > Higher 32 bits are probably in fact "harder" to speed up in hardware > than the lower 32, as you wrote, but once someone went for the effort of > this variabletime multiplier circuit, maybe it'd actually run faster > than it could otherwise? I'm concerned there is still an opportunity to speed up a chain of multiplications and XOR by up to a 2Xish factor, but not knowing much about modern multiplier design, I'm not sure. For example, if we built a carrysave multiplier (which no one does), we could pipeline along the diagonal, rather than between multipliers. Normally, the critical path in a carrysave multiplier is 64 carrysave units long. If we pipeline from the upper left to lower right through the middle of the multiplier, then the distance between flops seems to be 32 rather than 64. However, other optimization techniques work against this speedup. For example, Booth encoding reduces it to roughly a 16x33 array of carry save adder cells, saving almost 2X in area, and the adders are often carrylookahead or similar. In this case, there's not much to be gained by pipelining within the multiplier. > I've been thinking that maybe the middle 32 bits (16 to 47) are actually > the slowest to compute, even in a variabletime optimized circuit, but I > could be wrong. This is more of a gut feeling than careful analysis. > And on 32bit archs this doubleregister shift may be costly. On 32bit > x86, right now you simply pick EDX instead of EAX, but if going for bits > 16 to 47 you'd have a SHRD in there, wasting a cycle and thereby > probably killing any advantage this would provide. On ARM, I think it'd > be even worse: LSR and ORwithshift, meaning two cycles wasted. (But > I'm not familiar with ARM. I merely looked these up in arch reference. > Maybe there's a quicker way I overlooked.) It's on 64bit and on SIMD > with 64bit vector elements that we incur a similar shift anyway, and > can adjust the shift count arbitrarily (be it 32 or 16). I'm pretty confident the upper bits are the slowest to compute. However, my multiplication function has asymmetrical timing on the inputs. One input is valid long before the other, and an attacker could design a multiplier optimized to be faster for one input than the other. However, I'm not sure how to do that once past the first partial product, because all the outputs toggle based on the late arriving input. > Another consideration in deciding between >> 32 vs. >> 16 vs. >> 20 (or > 21?) is that shifts by up to 20 (or 21?) only use multiplication result > bits computable in doubleprecision floatingpoint, making this more > JavaScriptfriendly and maybe friendlier to some GPUs (such as highend > NVIDIAs). This has both pros and cons. Since I do currently have an > arbitrary shift in a 64bit value anyway (unlike you), this change is > within consideration for me (whereas it'd cost 1+ cycle of overhead for > you, so probably not worth it given that it has both pros and cons). > Maybe this shift count may be a parameter (choice between 32 and 20?) > > I think you shouldn't make further changes to this now unless a major > problem is found. I'm just rambling. I'd be interested in your > comments, though, as this might affect my work on escrypt. I agree. It's time to finalize the algorithms. Bill
Powered by blists  more mailing lists