« The Simplest Stream Cipher on EarthPatented or not patented? »

Simpler is Faster?

27/03/08 | by Sean O’Neil [mail] | Categories: News

I was just told by AkKort that his unrolled EnRUPT-128-128 in MSVC 2005 is 40% faster than AES-128, a simple loop is 5% faster and a size-optimized 32-bit Intel Assembly implementation occupies 66 bytes and is only 15% slower than AES-128.

I don’t know yet what processor it was measured on or what the actual numbers of clock cycles are, but it sounds exciting! I can’t wait to see the speeds of all kinds of implementations of EnRUPT vs. AES vs. RC4 on all kinds of processors and hardware platforms…

PS: The stream RUPT/aeRUPT/mcRUPT modes are supposed to be about 3 times faster than the block enRUPT/mdRUPT modes…

Trackback address for this post

Trackback URL (right click and copy shortcut/link location)

15 comments

Comment from: akkort [Member] Email
***--
Source code:
http://81.30.182.45/aesenrupt.cpp

I made some optimisations and get the following results on MSVC 2005, P4-2.8:

AES 61794.534070 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 82595.524923 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT 62224.259620 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 96490.099209 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTa 63640.458985 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 1.337 1.007 1.561 1.030
0.748 1.000 0.753 1.168 0.771
0.993 1.327 1.000 1.551 1.023
0.640 0.856 0.645 1.000 0.660
0.971 1.298 0.978 1.516 1.000

So the unrolled enRUPT is 50% faster than non-unrolled and 17% faster than unrolled AES.
Assembly shows good results too, even a bit faster than non-unrolled C, it is quite good with it size about 60 bytes.

But then I've install Intel C++ Compiler and got the following:
AES 122797.555352 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 122685.308958 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT 64096.336199 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 95460.688478 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTa 61794.534070 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 0.999 0.522 0.777 0.503
1.001 1.000 0.522 0.778 0.504
1.916 1.914 1.000 1.489 0.964
1.286 1.285 0.671 1.000 0.647
1.987 1.985 1.037 1.545 1.000

It is mean that Intel's optimisation make about 100% speed boost. And there no any difference on enRUPT. So the previous results shows bad optimisation on VC, wich cannot optimize many indexed table access for AES. Also Intel themselfs unrolled AES wich was non-unrolled, but do not the same with enRUPT.

Summary: best-optimized enRUPT-128 is 30% slower than best-obtimised AES-128.
27/03/08 @ 13:36
Comment from: Sean O’Neil [Member] Email · http://cryptolib.com/
*****
Thank you, AkKort! Great job. Care to share your source code with the world?
30/03/08 @ 17:34
Comment from: akkort [Member] Email
Take source code here: http://81.30.182.45/aesenrupt.cpp You can use it freely :)
30/03/08 @ 20:57
Comment from: Joes [Visitor]
Was able to speedup unrolled implementation by 10%

AES 116004.950735 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 120916.872072 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT 72160.068817 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 98689.505882 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTfa 108678.322267 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 1.042 0.622 0.851 0.937
0.959 1.000 0.597 0.816 0.899
1.608 1.676 1.000 1.368 1.506
1.175 1.225 0.731 1.000 1.101
1.067 1.113 0.664 0.908 1.000

(enRUPTfa)
I will cleanup code and probably port it using masm32 (at the moment it's based off inline assembly).
I'm thinking about adding few tricks, at least I'm sure I can do enRUPT with 128 bit key comparable with AES in this test case - by preloading key parts in registers.
31/03/08 @ 23:25
Comment from: Joes [Visitor]
Also, in regards to previously posted tests, why test methods are __forceinline?
31/03/08 @ 23:42
Comment from: Sean O’Neil [Member] Email · http://cryptolib.com/
Joes you could probably speed it up even more by patching the code with the key words directly...
01/04/08 @ 14:59
Comment from: Sean O’Neil [Member] Email · http://cryptolib.com/
The following enRUPT-128-128 function takes 260 clocks per block on Core 2 Duo (16.25 clock cycles per byte), the same as the fastest known AES implementation optimized for speed by the Intel compiler:
#pragma warning (disable:4731)

#define enRUPTa(a,b,c,r) \
	__asm { mov	esi,c }\
	__asm { xor	esi,r }\
	__asm { lea	ebp,[a*2] }\
	__asm { xor	esi,[edi+(r%4)*4] }\
	__asm { xor	b,[edi+(r%4)*4] }\
	__asm { xor	esi,ebp }\
	__asm { ror	esi,8 }\
	__asm { lea	esi,[esi*8+esi] }\
	__asm { xor	b,esi }

void enRUPT (u32 x[4], u32 key[4])
{
	__asm { mov	esi,[x] }
	__asm { mov	edi,[key] }
	__asm { push	ebp }
	__asm { mov	eax,[esi   ] }
	__asm { mov	ebx,[esi+ 4] }
	__asm { mov	ecx,[esi+ 8] }
	__asm { mov	edx,[esi+12] }
	enRUPTa(eax,ebx,ecx, 1); enRUPTa(ebx,ecx,edx, 2); enRUPTa(ecx,edx,eax, 3); enRUPTa(edx,eax,ebx, 4);
	enRUPTa(eax,ebx,ecx, 5); enRUPTa(ebx,ecx,edx, 6); enRUPTa(ecx,edx,eax, 7); enRUPTa(edx,eax,ebx, 8);
	enRUPTa(eax,ebx,ecx, 9); enRUPTa(ebx,ecx,edx,10); enRUPTa(ecx,edx,eax,11); enRUPTa(edx,eax,ebx,12);
	enRUPTa(eax,ebx,ecx,13); enRUPTa(ebx,ecx,edx,14); enRUPTa(ecx,edx,eax,15); enRUPTa(edx,eax,ebx,16);
	enRUPTa(eax,ebx,ecx,17); enRUPTa(ebx,ecx,edx,18); enRUPTa(ecx,edx,eax,19); enRUPTa(edx,eax,ebx,20);
	enRUPTa(eax,ebx,ecx,21); enRUPTa(ebx,ecx,edx,22); enRUPTa(ecx,edx,eax,23); enRUPTa(edx,eax,ebx,24);
	enRUPTa(eax,ebx,ecx,25); enRUPTa(ebx,ecx,edx,26); enRUPTa(ecx,edx,eax,27); enRUPTa(edx,eax,ebx,28);
	enRUPTa(eax,ebx,ecx,29); enRUPTa(ebx,ecx,edx,30); enRUPTa(ecx,edx,eax,31); enRUPTa(edx,eax,ebx,32);
	enRUPTa(eax,ebx,ecx,33); enRUPTa(ebx,ecx,edx,34); enRUPTa(ecx,edx,eax,35); enRUPTa(edx,eax,ebx,36);
	enRUPTa(eax,ebx,ecx,37); enRUPTa(ebx,ecx,edx,38); enRUPTa(ecx,edx,eax,39); enRUPTa(edx,eax,ebx,40);
	enRUPTa(eax,ebx,ecx,41); enRUPTa(ebx,ecx,edx,42); enRUPTa(ecx,edx,eax,43); enRUPTa(edx,eax,ebx,44);
	enRUPTa(eax,ebx,ecx,45); enRUPTa(ebx,ecx,edx,46); enRUPTa(ecx,edx,eax,47); enRUPTa(edx,eax,ebx,48);
	__asm { pop	ebp }
	__asm { mov	esi,[x] }
	__asm { mov	[esi   ],eax }
	__asm { mov	[esi+ 4],ebx }
	__asm { mov	[esi+ 8],ecx }
	__asm { mov	[esi+12],edx }
}
It is 50% faster than the fastest C implementation of enRUPT-128-128 I have.
01/04/08 @ 17:47
Comment from: akkort [Member] Email
First, you have a bug in code. Must be:
__asm { xor esi,[edi+((r+1)%4)*4] }\
__asm { xor b,[edi+((r+1)%4)*4] }\

/* Fixed, thank you. – Sean O’Neil */

At second, on my P4 your code is 20% slower than AES:

AES 126382.041431 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 126382.041431 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT 65568.015633 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 98689.505882 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTa 104694.015601 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 1.000 0.519 0.781 0.828
1.000 1.000 0.519 0.781 0.828
1.927 1.927 1.000 1.505 1.597
1.281 1.281 0.664 1.000 1.061
1.207 1.207 0.626 0.943 1.000
02/04/08 @ 07:11
Comment from: Sean O’Neil [Member] Email · http://cryptolib.com/
AkKort, here is the speed comparison of your code compared with mine slightly rewritten by somebody else, on their Core 2 Duo:

AES 114520.245734 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 121025.904418 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT2 143241.972252 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 98762.125092 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTfa 111569.183707 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 1.057 1.251 0.862 0.974
0.946 1.000 1.184 0.816 0.922
0.799 0.845 1.000 0.689 0.779
1.160 1.225 1.450 1.000 1.130
1.026 1.085 1.284 0.885 1.000

Looks like the speed varies quite significantly depending on the processor type and memory speed.
02/04/08 @ 08:26
Comment from: akkort [Member] Email
Yes, the results will depend of processor type and memory speed. Also it is possible that we use different optimisation options for compiler. Can you send me compiled exe with your code? akkort[at]hotmail[dot]ru

02/04/08 @ 08:49
Comment from: joes [Member] Email
akkort: That was me :-)
Source code: http://pastebin.com/f6e6d9ccb

enRUPT2: Optimized x-128 (x is key length, 128 is block length) assembly implementation.
enRUPTfa: C unrolled implementation with additional table with precomputed key[p % kw] ^ p

Compiler used: MSVC 2005, CPU: C2D @ 2.0 GHz

Results:
AES 117631.663453 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 122685.308958 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT2 150806.435955 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 99864.380952 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTfa 111569.183707 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 1.043 1.282 0.849 0.948
0.959 1.000 1.229 0.814 0.909
0.780 0.814 1.000 0.662 0.740
1.178 1.229 1.510 1.000 1.117
1.054 1.100 1.352 0.895 1.000
02/04/08 @ 09:37
Comment from: akkort [Member] Email
joes, install Intel C compiler (download for free from intel.com), switch on all optimizations and compare with your's result. You'll get the same speed difference as my. There is just bad MSVC optimisation.
02/04/08 @ 10:21
Comment from: joes [Member] Email
Here's my Intel 10.1 results:

AES 156248.810244 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 156248.810244 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT2 148143.187638 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 99938.740134 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTfa 65568.015633 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 1.000 0.948 0.640 0.420
1.000 1.000 0.948 0.640 0.420
1.055 1.055 1.000 0.675 0.443
1.563 1.563 1.482 1.000 0.656
2.383 2.383 2.259 1.524 1.000

Btw, if you remove __forceinline, results are different:
AES 153391.689143 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
AESu 153391.689143 kbps [4e5bfeae 880aebe9 73534999 86e318bb]
enRUPT2 145572.373102 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTu 98762.125092 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
enRUPTfa 110195.178982 kbps [fcb74c06 ae5210d6 2f09b883 c700fd99]
Speed compare:
1.000 1.000 0.949 0.644 0.718
1.000 1.000 0.949 0.644 0.718
1.054 1.054 1.000 0.678 0.757
1.553 1.553 1.474 1.000 1.116
1.392 1.392 1.321 0.896 1.000
02/04/08 @ 10:30
Comment from: gwyllion [Visitor]
> The following enRUPT-128-128 function takes 260 clocks per block on Core 2 Duo (16.25 clock cycles per byte), the same as the fastest known AES implementation optimized for speed by the Intel compiler.

A bitslice implementation of AES takes 9.2 cycles/byte on Core 2. See CHES 2007 paper of Matsui and Nakajima.

The estream AES implementation by Hongjun Wu takes 12.6 cycles/byte on Core 2.
02/04/08 @ 11:32
Comment from: Sean O’Neil [Member] Email · http://cryptolib.com/
gwyllion, bitslice is only useful for large volumes of data encrypted in CTR mode and for brute-force searches for the key or the password used as the key. In that mode, EnRUPT is much faster even simply parallelised 4 times with SSE. Now 12.6 CPB on C2D sounds very interesting! I haven't seen it before.
02/04/08 @ 12:43

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
PoorExcellent
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
May 2008
Mon Tue Wed Thu Fri Sat Sun
 << <   > >>
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Poll

EnRUPT for SHA-3?

View Results

Q: What is EnRUPT?

A: EnRUPT is a simple scalable all-in-one block/stream cipher/hash.

Search

Categories

XML Feeds

Weather

°

Feels like: °
Wind:
Today's high: °
Today's low: °
Sunrise:
Sunset:
More...

powered by b2evolution free blog software