RLE64 (English)
Published by on Monday, November 4, 2013
Introduction
RLE64 is a simple custom implementation of the Run Length Encoding compression and decompression algorithm operating with 64 bit entities.
This allows RLE64 to be quite efficient in terms of speed, even if not so much in terms of compression ratio. At this point, its ratios are considerably worse than a byte entity RLE algorithm, which at the same time, are lower than other datacompression techniques.
So what is the point on RLE64? Quite simple, to be the fastest compression program ever created!
Description
In practical terms, RLE64 is only convinient when dealing with highly redundant low bit depth contents. Typically 1, 2, 4 or 8 bits per pixel uncompressed images such as BMP, TGA; or 1, 2, 4 or 8 bits uncompressed waves such as PCM. In this niche scenario, RLE64 performs with a decent compression ratio, and a blazing compression speed, that if afterwards, is combined with a regular and more powerfull compressor would improve its ratio by as much as 30%.
Inside other data scenarios, RLE64 will not compress the content well, or nothing at all, but usually will not increase its size neither. In a favorable situation, even if compression speed will usually lead the classification, decompression speed will be competitive, but not always the fastest one. This is because of the inherent inneficiencies in RLE, that cause its compression ratio to be not as high as other algorithms, meaning that the decompressor routine, is required to read more compressed data from memory than other tighter programs.
Comparing it with other tools, its current implementation is single threaded, so only one CPU or core will be used in the compression and decompression processes, keeping that clear, is another factor to make it not suitable for real world usage.
Althought it will not be hard to convert it to a 32 bit, 16 bit or 8 bit RLE tool, in its current 64 bit form, it is limited to be run in 64 bit CPU with 64 bit OS. Fortunatelly, original ANSI C source, is portable enought to easily transport it to other systems.
I discovered RLE myself when was a kid, before knowing it was already invented by someone else, and have implemented several variations of it during my life, starting from ZX Spectrum Basic to 16 bit DOS in C. Lastly, it was included as a part of Blink, so it was almost 15 years without implementing it, and being a 64 bit lover, and RLE enthusiast, that seemed a good idea. Not to mention that its simplicity makes it possible to be implemented, tested and tuned in a few hours. But as said, it has been quite consistent and what is more important, stable.
Of course, there is some room for improvement, I have not payed so much attention to the assembler listing generated by the compiler, but I am sure it can be at least, a bit faster. Compiling it with a more proficient compiler like Intel C++, would give a bit of extra spedup too.
Benchmarks
Thanks to the work at encode.ru, it has been quite easy to compare them to other lighting fast compressors. The comparison consist on compressing a huge B&W image of more than 100 Mpix (13200 x 10200) in BMP24 format, with a weight of almost 400 Mb, that is, a favorable scenario for RLE compression, and in particular for RLE64 algorithm.
As a final remark, and to be fair, the benchmark is used as is, this means that have no furhter optimizations, nor compiler switches, and that is run in x86 mode, while RLE64, was tested as expected in a x64 enviroment. Anyway, being able to compress such a big amount of information, in just less than 100 ms, involving a troughtput of more than 4 gigabytes per second, makes its performance impressive. A size reduction of almost 95% amazes me too, no matter if it is not reflecting common usage situations.
memcpy = 195 ms (2022 MB/s), 403920054->403920054
rle64 3.00 64 = 94 ms (4297 MB/s), 403920054->21910710, 171 ms (2362 MB/s)
rle64 3.00 32 = 178 ms (2269 MB/s), 403920054->15175478, 180 ms (2244 MB/s)
rle64 3.00 16 = 343 ms (1177 MB/s), 403920054->9265512, 195 ms (2071 MB/s)
rle64 3.00 8 = 541 ms (746 MB/s), 403920054->8832643, 170 ms (2376 MB/s)
rle32 3.00 64 = 248 ms (1628 MB/s), 403920054->21910710, 173 ms (2334 MB/s)
rle32 3.00 32 = 263 ms (1535 MB/s), 403920054->15175478, 178 ms (2269 MB/s)
rle32 3.00 16 = 468 ms (863 MB/s), 403920054->9265512, 200 ms (2019 MB/s)
rle32 3.00 16 = 553 ms (730 MB/s), 403920054->8832643, 195 ms (2071 MB/s)
fastlz 0.1 -1 = 407 ms (969 MB/s), 403920054->9223205, 431 ms (915 MB/s)
fastlz 0.1 -2 = 423 ms (932 MB/s), 403920054->6541701, 314 ms (1256 MB/s)
lz4 rev 9 = 484 ms (814 MB/s), 403920054->7004416, 239 ms (1650 MB/s)
lz4 rev 11 = 183 ms (2155 MB/s), 403920054->7004070, 244 ms (1616 MB/s)
snappy 1.0.3 = 203 ms (1943 MB/s), 403920054->22009731, 236 ms (1671 MB/s)
lzf 3.6 vf = 590 ms (668 MB/s), 403920054->9212559, 789 ms (499 MB/s)
lzf 3.6 uf = 585 ms (674 MB/s), 403920054->10136275, 800 ms (493 MB/s)
lzjb 2010 = 575 ms (686 MB/s), 403920054->16733472, 571 ms (690 MB/s)
lzmat 1.1 = 10328 ms (38 MB/s), 403920054->1584022, 548 ms (719 MB/s)
lzo 2.05 1b_1 = 544 ms (725 MB/s), 403920054->6331780, 276 ms (1429 MB/s)
lzo 2.05 1b_9 = 1456 ms (270 MB/s), 403920054->5886078, 633 ms (623 MB/s)
lzo 2.05 1b_99 = 1542 ms (255 MB/s), 403920054->5136012, 537 ms (734 MB/s)
lzo 2.05 1c_1 = 544 ms (725 MB/s), 403920054->6354790, 356 ms (1108 MB/s)
lzo 2.05 1c_9 = 1342 ms (293 MB/s), 403920054->5916541, 706 ms (558 MB/s)
lzo 2.05 1c_99 = 1541 ms (255 MB/s), 403920054->5191165, 624 ms (632 MB/s)
lzo 2.05 1f_1 = 544 ms (725 MB/s), 403920054->6390895, 353 ms (1117 MB/s)
lzo 2.05 1x_1 = 218 ms (1809 MB/s), 403920054->6608866, 336 ms (1173 MB/s)
lzo 2.05 1y_1 = 222 ms (1776 MB/s), 403920054->6218813, 361 ms (1092 MB/s)
lzrw1 = 645 ms (611 MB/s), 403920054->56308292, 445 ms (886 MB/s)
lzrw1-a = 612 ms (644 MB/s), 403920054->50504224, 394 ms (1001 MB/s)
lzrw2 = 567 ms (695 MB/s), 403920054->50483846, 507 ms (778 MB/s)
lzrw3 = 546 ms (722 MB/s), 403920054->50483854, 533 ms (740 MB/s)
lzrw3-a = 5514 ms (71 MB/s), 403920054->48497544, 421 ms (936 MB/s)
snappy 1.0.3 = 202 ms (1952 MB/s), 403920054->22009731, 236 ms (1671 MB/s)
tornado 0.666 16k/1 = 260 ms (1517 MB/s), 403920054->5175793, 447 ms (882 MB/s)
tornado 128k/2m = 292 ms (1350 MB/s), 403920054->5175424, 472 ms (835 MB/s)
tornado 128k/8m = 298 ms (1323 MB/s), 403920054->5176366, 492 ms (801 MB/s)
tornado 4m/8m = 486 ms (811 MB/s), 403920054->5176552, 493 ms (800 MB/s)
tornado b128k/8m = 316 ms (1248 MB/s), 403920054->4146275, 516 ms (764 MB/s)
tornado b4m/8m = 509 ms (774 MB/s), 403920054->4146502, 515 ms (765 MB/s)
tornado b4m/32m = 413 ms (955 MB/s), 403920054->4146517, 522 ms (755 MB/s)
quicklz 1.5.0 -3 = 6732 ms (58 MB/s), 403920054->11198759, 294 ms (1341 MB/s)
quicklz 1.5.0 -2 = 1668 ms (236 MB/s), 403920054->8423314, 166 ms (2376 MB/s)
quicklz 1.5.0 -1 = 563 ms (700 MB/s), 403920054->11884171, 200 ms (1972 MB/s)
quicklz 1.5.1 b5 -1 = 567 ms (695 MB/s), 403920054->11884171, 203 ms (1943 MB/s)
zlib 1.2.5 -1 = 3169 ms (124 MB/s), 403920054->4731215, 595 ms (662 MB/s)
zlib 1.2.5 -6 = 6057 ms (65 MB/s), 403920054->2352284, 1066 ms (370 MB/s)
Updates
After some investigation, I discovered the implementation speed to be limited to memory bandwidth, which in my case was about 5 GB/s. So, I move to a more generic approach supporting RLE8, RLE16, RLE32 and RLE64, the formers having slower speed and better compression ratio than the latters.
Also I trying Intel C++ 12.0, which gave a marginally speed increase or speed decrease depending on the situations, so final binaries, are compiled still with Visual C++ 2010 SP1. Additionally I have built regular 32 bit executables too, to make easy users not owning a 64 bit box test the programs.
A bit of extra speed boost, is gathered thanks to PGO (Profile Guided Optimizations).
Sponsors
- Some sample files to be used as test and benchmark (1.15 Mb in 7z format).
Links
- Official RLE64's website: nikkhokkho.sourceforge.net/static.php?page=RLE64.
- Official author's website: www.javiergutierrezchamorro.com.