Compression Competition -- $15,000 USD

**Fixee** · 11th November 2011, 23:29

Forgive me if this is off-topic for this forum, but I know some folks here like to engage in (and win) compression competitions.

This one is, from what I can tell, not that well specified (as far as I can tell, they don't charge for the size of the compressor, for example). But in case there is any interest...

http://www.sequencesqueeze.org/

**Shelwien** · 12th November 2011, 18:20

http://s3.amazonaws.com/sequencesque...634.filt.fastq

**Alexander Rhatushnyak** · 13th November 2011, 03:28

Originally Posted by Shelwien

http://s3.amazonaws.com/sequencesque...634.filt.fastq

Is this the only file you need to compress? Or is it just a sample file?

Judging

Upon submission, entries will join a queue for automated judging.

The total size of /contest-data/compressed will be used to calculate compression ratio. Timing and memory usage will be calculated using /usr/bin/time with the %e and %K options. Header, quality and sequence mismatches are simple counts of the number of lines in the original data which did not appear exactly the same in the decompressed data.

Results will be emailed to you upon completion of automated judging. The top 5 entries will also appear on the frontpage of this website in the leaderboard section.

The choice of overall winner of the US$15,000 prize will be a combination of performance in the leaderboard with the professional opinions of the judging panel, as described in the competition terms and conditions. First place in the leaderboard in any single category does not necessarily mean overall winner.

Okay, let's have a look at Terms & conditions...

7. The categories of judging and the process used to score and compare entries in each category are: speed of compression, speed of decompression, compression ratio, and round-trip accuracy. In addition judges will assess the novelty of concept, code style, scalability, and their opinion of the future potential of each entry in making their final decision.
8. Judges decisions are final. Judging will take place within 1 month of the Closing Date.

In short, judges may choose any entry, even those looking inefficient if only speed of compression, speed of decompression, and compression ratio are considered.

**Shelwien** · 13th November 2011, 03:51

There's another 6G file there, but seems that it can't be downloaded with http.
http://s3.amazonaws.com/sequencesque...stoiaalliance/

Also http://en.wikipedia.org/wiki/FASTQ_format

**frede_sch** · 16th November 2011, 19:44

Does anyone successfully get the sample data? If yes, could you provide it by http/ftp/torrent?

**encode** · 17th November 2011, 08:09

Yep, judges will receive lots of free sources... After, they submit to themselfs own software that will won the prize! Brilliant!

**Matt Mahoney** · 30th November 2011, 16:10

Anyone working on this? I ran a quick (well, several hours) test on the 80 MB fastq file that Shelwien kindly linked to (renamed to test0).

Code:

http://s3.amazonaws.com/sequencesqueeze-dev.pistoiaalliance/SRR062634.filt.fastq

15,165,514 test0.paq8px
16,115,901 test0-6.lpaq9m
16,235,639 test0-m4.zpaq
16,375,589 test0.paq9a
16,393,695 test0.pmm
16,405,427 test0-cc.nz
16,462,281 test0.nz
16,476,068 test0.ash
16,554,743 test0-6.lpaq1
16,826,593 test0-cfm100.bbb
16,941,205 test0-m3.zpaq
17,183,336 test0.bcm
17,201,598 test0-d6-n16M-f16M.ctw
17,238,309 test0-p.bsc
17,238,344 test0.bsc
17,252,864 test0-m256-o8.pmd
17,471,255 test0-m2.zpaq
17,555,051 test0-256.hook
17,602,755 test0-m256-o16-r1.pmd
17,714,291 test0-p-m6.bsc
17,727,126 test0-m256-o4.pmd
17,762,644 test0.pmd
17,827,111 test0-m256-o16.pmd
17,921,307 test0-p-m3.bsc
19,133,776 test0-m1.zpaq
19,531,518 test0-m256-o2.pmd
19,815,265 test0-mx=9.7z
20,279,408 test0.bz2
20,373,589 test0.comprox_ba
20,426,942 test0.7z
21,272,446 test0-lzx21.cab
21,773,012 test0.sr2
23,202,980 test0.kzip
23,835,069 TEST0.Z
24,059,782 test0-12.tor
24,468,825 test0.comprox_sa
24,559,540 test0-9.zip
25,145,150 test0.slug
25,224,908 test0.gz
25,225,134 test0.zip
25,899,315 test0.fpaq0f2
26,179,986 test0-5.tor
27,098,436 test0.etincelle
29,442,764 test0-9.lzo
32,887,930 test0.lzrw5
33,006,771 test0.lzrw3-a
37,424,223 test0.fpaq0p
39,742,983 test0.lzo
40,086,254 test0.stz
40,261,409 test0.lzrw3
40,931,044 test0.fpaq0
41,305,492 test0.lzrw2
41,412,785 test0-1.tor
42,614,128 test0.lzrw1-1
43,257,616 test0.lzrw1
52,827,193 test0.ppp
53,268,197 test0.flzp
80,755,971 test0

The leaderboard at http://www.sequencesqueeze.org/ shows James Bonfield, who I recognized as the author of the SRF format for DNA sequence data that I wrote a compressor for at Ocarina. So I searched sourceforge.net for "fastq" and found 2 compressors out of 11 programs, KungFq by Elena Grassi (last updated Aug. 3 2011) and fqzcomp by James Bonfield (last updated Nov. 11, 2011).

KungFq is a Windows program that is supposed to do some input transform and compress with either zlib or lzma. However, it crashed on this file (Java bad number format exception). fqzcomp also does some transform, I guess create separate streams, and compresses with zlib. It came with source only, apparently for Linux, and must be linked with zlib. So I installed zlib (zlib.net, ./configure, make, sudo make install), changed icc to g++ in Makefile and compiled. There are 2 programs that read stdin to stdout: the compressor, fastq2fqz, and the decompresser, fqz2fast2. The result is 18,825,692, which is beaten by about half of the programs above.

Of course the contest considers speed by some unspecified formula and some subjective judgement of software quality. It compresses in about 3 seconds and decompresses in 1, compared to about 3 hours for paq8px_v69.

But given that there are not a whole lot of people who know how to write data compression software, it might be something to look at.

**Alexander Rhatushnyak** · 30th November 2011, 16:31

Originally Posted by Matt Mahoney

Anyone working on this? I ran a quick (well, several hours) test on the 80 MB fastq file that Shelwien kindly linked to (renamed to test0).

Code:

http://s3.amazonaws.com/sequencesqueeze-dev.pistoiaalliance/SRR062634.filt.fastq

15,165,514 test0.paq8px
16,115,901 test0-6.lpaq9m
16,235,639 test0-m4.zpaq
16,375,589 test0.paq9a
16,393,695 test0.pmm

Surprisingly lpaq9m is again compressing better than zpaq and paq9 (just like with prime numbers in various formats: http://encode.su/threads/1367-Synthe...ll=1#post26504 ), though it was written for large English text files like enwik8 and enwik9, and it's not expected to compress well without DRT preprocessing.

Also I guess lpaq was N times faster than zpaq -m4 and NN times faster than paq8px.
When you run tests automatically it's easy (imho) to calculate processing times, either logging start/finish times explicitly or from outfile timestamps.

ADDED two hours later:

It compresses in about 3 seconds and decompresses in 1

The Current leaderboard table at http://www.sequencesqueeze.org shows Compress.time=188.38 and Decompress.time=100.91, and it's not clear whether these are seconds or milliseconds (or are these results on a 100 times bigger file?) Also, it's not clear from memory usage numbers 36512 and 16016 whether these are kilobytes, megabytes, or data-specific records.
Is this all explained somewhere on the site?

Anyone working on this?

I am not working on this, mainly because it's like playing lottery since rules sound like "we may choose any entry as a winner, regardless of performance, but nonetheless you are welcome to participate because the first 40 entries received will each receive a US$20 Amazon Web Services voucher. (Only one voucher per person.)"
Does anyone here need an AWS voucher? Why?

**Matt Mahoney** · 30th November 2011, 19:26

I did some experiments in which I split the fastq file into 4 separate files, each containing every fourth line. This improved compression in every case I tested.

Code:

// Split input into 4 files each containing every 4'th line
int c, i=1;
while ((c=getc(f[0]))!=EOF) {
  putc(c, f[i]);
  if (c=='\n' && ++i>4) i=1;
}

17,751,387 test1    @SRR062634.321 HWI-EAS110_103327062:6:1:1446:951/2
31,193,446 test2    TGATCATTTGATTAATACTGACATGTAGACAAGAAGAAAAGTATGTTTCATGCTATTTTGAGTAACTTCCATTTAGAAGCCTACTCCTGAGCACAACATT
   617,692 test3    +
31,193,446 test4    B5=BD5DAD?:CBDD-DDDDDCDDB+-B:;?A?CCE?;D3A?B?DB??;DDDEEABD+>DAC?A-CD-=D?C5A@::AC-?AB?=:>CA@##########

15,978,394 test?.pmm        1034631, 7127330,  41,  7815392 (-o6-m25, -o16-m556, -o4-m1, -o8-m150)
15,979,788 test?.lpaq9m      890784, 7271981, 238,  7816785
16,052,738 test-m4.zpaq      845270, 7387186, 281,  7820001
16,061,399 test-m4-bs.zpaq   845229, 7390672,  43,  7825455
16,133,261 test?-nm-cc.nz    883891, 7383307,  88,  7864975
16,160,211 test-nm-cc.nz
16,205,798 test?-nm.nz       946801, 7278680,  71,  7980246
16,294,717 test-nm.nz
16,583,898 test-m3.zpaq      968929, 7651104, 154,  7963711
17,141,095 test?.bsc        1924989, 7313659, 128,  7902319
17,214,532 test-m2-b64.zpaq 1983109, 7270008, 233,  7961182
17,285,304 test-m2.zpaq     1951151, 7351234, 233,  7982686
18,356,068 test.7z
18,356,538 test?.7z         1452977, 7624224, 269,  9279068
18,909,367 test-m1-b64.zpaq 2291636, 7875915, 283,  8741533
18,980,351 test-m1.zpaq     2260792, 7957153, 283,  8762123
21,546,591 test?-9.gz       2299399, 8857636, 643, 10388913
22,470,515 test?.gz         2492488  9228041, 643, 10749343

The code snip shows how the splitting was done. Merge would be a similar procedure. The result is 4 files, test1...test4. I showed the first line of each file. test1 has a description of the read. It contains sorted data describing its location in the sequencing machine. test2 contains the DNA read. Each read is the same length, in this case 100 base pairs. test3 contains only a "+". test4 contains the phred (confidence) scores corresponding to the values in test2. Higher ASCII values represent higher confidence in the read values.

Of the programs I tested, I got the best result with ppmonstr by tuning the model order for each file. I tried orders 4, 6, 8, 12, 16, and 32. I showed the memory in MB needed for each one.

Nanozip, zpaq, and 7zip produces archives, but in each case I got better results by compressing the files separately. I showed both ways. For zpaq it makes no difference unless the -bs (solid archive) option is used.

test1 contains sorted data. All of the BWT methods (zpaq -m1, -m2, nz, bsc) do poorly on this file compared to CM (zpaq -m3, -m4, lpaq9m, nz -cc) and PPM (ppmonstr). test2 does better with BWT than CM.

gzip -9 does not do as well as fastq2fqz (18,825,692) even though both use deflate.

**Matt Mahoney** · 30th November 2011, 19:54

> Does anyone here need an AWS voucher? Why?

I guess because you need an Amazon EC2 account to access the test data. http://aws.amazon.com/ec2/pricing/

> Also I guess lpaq was N times faster than zpaq -m4 and NN times faster than paq8px.

Well, yes, but zpaq is constrained by archive compatibility between versions. And it's not that bad actually :D Tests on a 2.0 GHz T3200, Win32:

Code:

7,256,807 test2-8.lpaq9m,    77 sec, 774 MB
7,270,008 test2-m2-b32.zpaq, 23 sec, 160 MB
7,271,981 test2-6.lpaq9m,    68 sec, 198 MB
7,351,234 test2-m2-b16.zpaq, 13 sec, 160 MB (2 threads)

**schnaader** · 1st December 2011, 00:11

Using the split code from above and ZPAQ for compression, test1 and test4 can be improved. For test1, I removed the static parts and coded the 5 resulting numbers differently (delta for the first three, last one decreased by 1):

Code:

Original (lines 1499-1503):
@SRR062634.228723 HWI-EAS110_103327062:6:1:19776:5633/1
@SRR062634.228724 HWI-EAS110_103327062:6:1:19776:6695/1
@SRR062634.229115 HWI-EAS110_103327062:6:2:1427:1029/2
@SRR062634.229561 HWI-EAS110_103327062:6:2:1547:987/2
@SRR062634.229974 HWI-EAS110_103327062:6:2:1628:922/2


New:
1 0 0 5633 0
1 0 0 6695 0
391 1 -18349 1029 1
446 0 120 987 1
413 0 81 922 1

This helps a lot. It's also possible to write the output binary, as unsigned bytes and signed 16-bit words, so every line becomes 8 bytes (word, byte, word, word, byte), but this improves the result only a little (at least for ZPAQ).

For test4, I just removed the hash symbols ('#') that fill up short lines. Note that this has to be done with caution, because sometimes there still are characters between hash symbols (e.g. line 41). I also tried delta compression as the ASCII characters are in range '!' to 'H', but this worsens compression.

Code:

Original (line 41, lines 143-147):
?D=A=D=D@>DBBDDDD?D?DBD:DCDD-D;DBDBDD5D-CB:@5B-;::??@BBBB-=B#####################!##################
GGDEGGGGGGGFGGFFFBFGBFEBFF?FADEEEFDGFEBGBCCC;EEBED@A?BCAACED=BACC5>>>>,:AB;53>8=-,77=<=1<A?=?#######
EEAEEAEEEEEEEEEECEEEE=BEEEEEBBEEEEEBCEE?-BBBB@5B:E@@::0*>?;;>==2-==85=?=??94;???@?=@5''83):?60:@@??<
####################################################################################################
FB5AFEAEBE?DEEB-?A:AA-A=-7=;C6=C;?.>C=:B:A:?;4;>>4A?################################################
GGFGGGGFGGGGGGGGGGGGGGGGGGGGGGBGGGGGG?GGEEGBEFEDAFAEFDDFEDFEDFF?FGGFEBBB-BB@?:;@=?:>=?<B<A83==?AA1A?

New:
?D=A=D=D@>DBBDDDD?D?DBD:DCDD-D;DBDBDD5D-CB:@5B-;::??@BBBB-=B#####################!
GGDEGGGGGGGFGGFFFBFGBFEBFF?FADEEEFDGFEBGBCCC;EEBED@A?BCAACED=BACC5>>>>,:AB;53>8=-,77=<=1<A?=?
EEAEEAEEEEEEEEEECEEEE=BEEEEEBBEEEEEBCEE?-BBBB@5B:E@@::0*>?;;>==2-==85=?=??94;???@?=@5''83):?60:@@??<

FB5AFEAEBE?DEEB-?A:AA-A=-7=;C6=C;?.>C=:B:A:?;4;>>4A?
GGFGGGGFGGGGGGGGGGGGGGGGGGGGGGBGGGGGG?GGEEGBEFEDAFAEFDDFEDFEDFF?FGGFEBBB-BB@?:;@=?:>=?<B<A83==?AA1A?

Results:

Code:

  17.751.387 test1
   4.338.497 test1_compact
   2.470.760 test1_compact_binary
     845.257 test1.zpq_m4.zpaq
     638.257 test1_compact.zpq_m4.zpaq
     633.721 test1_compact_binary.zpq_m4.zpaq

  31.193.446 test4
  29.763.802 test4_no_hashes
   7.819.988 test4.zpq_m4.zpaq
   7.756.029 test4_no_hashes.zpq_m4.zpaq

**Alexander Rhatushnyak** · 1st December 2011, 00:29

Matt,
you tried really hard with ppmonstr, for only 0.0087%,
but with zpaq you get 0.27%!

Code:

15,935,560 test?.zpaq        845270, 7270008, 281,  7820001 (-m4, -m2-b32, -m4, -m4)
15,978,394 test?.pmm        1034631, 7127330,  41,  7815392 (-o6-m25, -o16-m556, -o4-m1, -o8-m150)
15,979,788 test?.lpaq9m      890784, 7271981, 238,  7816785
16,052,738 test-m4.zpaq      845270, 7387186, 281,  7820001

**Matt Mahoney** · 1st December 2011, 03:52

I looked at the DNA and phred data (test2 and test4) to see if there were correlations, in spite of non-solid archives always doing better. First, counting byte values:

Code:

Count      DNA    Phred
 10     308846   308846
 33 !        0    50037
 35 #        0  1480648
 36 $        0      451
 37 %        0     2138
 38 &        0     4044
 39 '        0     6715
 40 (        0    13140
 41 )        0    17294
 42 *        0    14716
 43 +        0    16470
 44 ,        0    39925
 45 -        0    75172
 46 .        0    12332
 47 /        0    12180
 48 0        0    20836
 49 1        0    16051
 50 2        0    26568
 51 3        0    34084
 52 4        0    44554
 53 5        0   194873
 54 6        0    45241
 55 7        0    52031
 56 8        0    49703
 57 9        0    59774
 58 :        0   346165
 59 ;        0    85035
 60 <        0   103149
 61 =        0   467880
 62 >        0   173698
 63 ?        0   895855
 64 @        0   541613
 65 A  9211089   962739
 66 B        0  1454517
 67 C  6221322  1251956
 68 D        0  1900604
 69 E        0  5344916
 70 F        0  3118306
 71 G  6317993 11949190
 78 N    50037        0
 84 T  9084159        0

We see that the only characters in test2 are A, C, G, T, N (meaning unknown base) and linefeed. In test4, phred values are always 33-71. Both files are exactly the same size because the line lengths are always the same (100 + linefeed). Looking at pairwise frequencies, we see that N always has a phred score of 33, and A,C,G,T always 35-71. So it is possible to unambiguously pack the base and phred score into 1 byte, which I do later.

Code:

Phred      A        C        G        T        N
 33        0        0        0        0    50037
 34        0        0        0        0        0
 35   347834   378888   414837   339089        0
 36      150       78      116      107        0
 37      730      386      394      628        0
 38     1386      755      731     1172        0
 39     2394     1202     1078     2041        0
 40     4654     2597     2141     3748        0
 41     6622     2897     2407     5368        0
 42     5443     3038     2326     3909        0
 43     6194     3157     2665     4454        0
 44    15074     6795     5393    12663        0
 45    28846    14070    14643    17613        0
 46     3827     2427     2491     3587        0
 47     4183     1995     1736     4266        0
 48     7199     3284     2834     7519        0
 49     5298     2682     2479     5592        0
 50     9469     4334     3778     8987        0
 51    11509     5330     5141    12104        0
 52    15118     7199     6947    15290        0
 53    71902    32960    41091    48920        0
 54    13745     7787     8074    15635        0
 55    16677     8448     7843    19063        0
 56    15380     8047     8192    18084        0
 57    18330     9856     9329    22259        0
 58   119684    59690    75281    91510        0
 59    24828    15542    17458    27207        0
 60    28822    16833    16021    41473        0
 61   159531    81925   106839   119585        0
 62    43870    33640    40476    55712        0
 63   282074   167577   202731   243473        0
 64   164264   103586   115089   158674        0
 65   289289   180581   214231   278638        0
 66   454460   271621   323959   404477        0
 67   344922   253329   276735   376970        0
 68   599242   363961   448486   488915        0
 69  1658414  1005609   998889  1682004        0
 70   958743   617623   732599   809341        0
 71  3470982  2541593  2202533  3734082        0

But first I look at the distribution of bases and average phred values by read location. We see that reliability drops as we progress to the end of the read. There is also some bias near the start, probably due to the chemistry of attaching DNA strands to the sequencing machine. (It helps to understand a little about how the machines work. http://www.illumina.com/technology/s...echnology.ilmn )

Code:

Pos        A        C        G        T        N  Phred
  0    95617    64361    40710   108158        0  70.16
  1   108985    51943    72545    75373        0  70.02
  2    96360    57343    69604    85539        0  69.98
  3    92923    64361    64117    87445        0  69.97
  4    94660    60057    60673    93456        0  69.98
  5    94775    62953    63212    87906        0  70.00
  6    92453    60365    64227    91801        0  69.96
  7    95892    60513    61364    91077        0  69.92
  8    94593    60524    60844    92885        0  69.91
  9    94766    60955    61648    91477        0  69.89
 10    94617    60316    60935    92978        0  69.90
 11    94452    60539    61581    92274        0  69.86
 12    95178    59940    60190    93538        0  69.85
 13    94543    60446    60608    93249        0  69.80
 14    94666    59989    60635    93556        0  69.80
 15    93534    60324    61720    93268        0  69.78
 16    93717    60376    61218    93535        0  69.72
 17    93499    60992    61225    93130        0  69.70
 18    93515    61161    61151    93019        0  69.69
 19    92598    61351    61766    93131        0  69.67
 20    93152    61278    61527    92889        0  69.56
 21    93143    61225    62000    92478        0  69.63
 22    92849    61546    61474    92977        0  69.54
 23    92661    61768    61980    92437        0  69.52
 24    92809    61778    62070    92189        0  69.38
 25    92502    61611    62347    92386        0  69.42
 26    92432    62029    62648    91737        0  69.39
 27    92826    62089    62252    91679        0  69.34
 28    92736    61765    62453    91892        0  69.26
 29    92462    61591    62708    92085        0  69.26
 30    91850    62291    62821    91884        0  69.18
 31    92088    62320    62700    91738        0  69.21
 32    93030    61746    62788    91282        0  69.18
 33    92446    61744    62907    91749        0  69.10
 34    92702    62216    62296    91632        0  69.04
 35    91803    62383    62855    91804        1  68.96
 36    92667    61646    62903    91341      289  68.87
 37    92197    62641    62872    91136        0  68.83
 38    92178    62057    62837    91154      620  68.75
 39    91883    62043    62722    91622      576  68.69
 40    92337    61811    63307    91391        0  68.61
 41    91767    62239    63172    91048      620  68.57
 42    91871    62200    63084    91638       53  68.44
 43    91868    61981    62924    91688      385  68.33
 44    91546    62768    63108    91005      419  68.33
 45    91701    62578    63091    90855      621  68.25
 46    91816    62275    63588    91096       71  68.19
 47    91525    62607    63095    91383      236  68.08
 48    91080    62933    63213    91594       26  68.03
 49    91514    62853    63341    91087       51  67.92
 50    91601    62538    63314    90800      593  67.81
 51    91085    62223    63782    90949      807  67.61
 52    91394    62302    63592    90938      620  67.57
 53    91299    62200    63677    91065      605  67.53
 54    91432    62496    63535    90761      622  67.37
 55    91458    62571    63791    90401      625  67.34
 56    91847    62282    63397    90597      723  67.26
 57    92137    62021    63202    91192      294  67.15
 58    91411    62605    63432    90904      494  66.96
 59    91268    63196    63529    90559      294  66.77
 60    91598    62999    63510    90368      371  66.77
 61    91073    62880    63647    90533      713  66.62
 62    91328    62393    63387    91009      729  66.41
 63    91542    62732    63781    90418      373  66.44
 64    91635    62653    63484    90784      290  66.23
 65    91372    63018    64039    90277      140  66.05
 66    91258    63377    63458    90129      624  65.89
 67    90667    62831    63823    90870      655  65.83
 68    91640    62987    63513    90278      428  65.66
 69    90783    62632    63972    90293     1166  65.55
 70    91091    62924    64306    89521     1004  65.22
 71    90956    63068    64238    89778      806  65.10
 72    90584    62586    64216    90931      529  64.82
 73    91272    62968    64399    89490      717  64.96
 74    90544    63070    64163    90620      449  64.77
 75    90839    63188    64298    89885      636  64.73
 76    91013    63121    63479    90301      932  64.45
 77    91346    63165    64210    89829      296  64.34
 78    91477    62714    64529    89936      190  64.12
 79    90623    63122    64859    89941      301  63.84
 80    91321    63211    64084    90099      131  63.53
 81    90609    63063    63896    89803     1475  63.52
 82    91461    62719    64306    88953     1407  63.25
 83    91013    62298    64231    89127     2177  63.19
 84    90259    62896    64105    89455     2131  62.93
 85    90830    63197    64079    89832      908  62.47
 86    90668    63068    64777    89976      357  62.36
 87    89978    63200    64812    89587     1269  62.11
 88    89904    63402    64443    89401     1696  61.76
 89    89654    63119    64467    88903     2703  61.66
 90    89889    63477    64829    89520     1131  61.16
 91    89244    63907    64882    89466     1347  61.04
 92    90096    63887    64353    89136     1374  60.92
 93    90123    63699    65209    88684     1131  60.69
 94    90551    63361    64956    89134      844  60.28
 95    89328    63860    65164    88496     1998  59.80
 96    90295    63725    64476    88534     1816  59.40
 97    90035    63638    64782    88409     1982  59.06
 98    89806    64095    65008    88343     1594  59.11
 99    89668    63817    65516    88273     1572  58.35

And then I look at 1-4 gram statistics, just ignoring N and linefeeds. These statistics depend on the biology of whatever is being sequenced. I have no idea what species this is, but it probably matters because higher organisms like humans have more non-coding and repetitive DNA than bacteria due to evolutionary pressure, and coding DNA is constrained by the genetic code, where each 3 bases codes one amino acid. Anyhow, we see some anomalies, such as the sequence CG being rare, and repeats of AAAA or its complement TTTT being common. DNA is double stranded, so we would expect that any sequence should have a similar frequency to its complement that you get by exchanging A with T and C with G and reversing the order like AATC -> GATT. I did find this was the case with e.coli from the Canterbury large corpus.

Code:

Seq        A        C        G        T
     9211089  6221322  6317993  9084159
  A  3104744  1538719  2125999  2441628
  C  2210457  1584409   343709  2082747
  G  1881862  1254219  1645800  1536111
  T  2014026  1843975  2202485  3023673
 AA  1220560   459466   623681   801038
 AC   618358   347246    83116   489999
 AG   693851   409154   535294   487699
 AT   641668   424398   584625   790937
 CA   588913   448528   592704   580312
 CC   562692   413772    91453   516492
 CG    83627    73970    97902    88210
 CT   394057   497408   586302   604980
 GA   661414   284568   510606   425274
 GC   430993   337940    78192   407094
 GG   505265   349318   443196   348021
 GT   350408   285033   449240   451430
 TA   633857   346157   399008   635004
 TC   598414   485451    90948   669162
 TG   599119   421777   569408   612181
 TT   627893   637136   582318  1176326
AAA   497743   179754   232139   310925
AAC   188501   100211    23699   147055
AAG   217211   111824   146064   148581
AAT   230809   135179   205722   229328
ACA   172731   123237   158569   163821
ACC   124946    88677    16423   117200
ACG    19890    17284    22760    23182
ACT   100188   109784   130567   149460
AGA   247120   112545   179136   155050
AGC   145019   112443    23261   128431
AGG   169053   127437   124270   114534
AGT   113279    90235   140013   144172
ATA   206681   103064   113483   218440
ATC   148313   101909    24931   149245
ATG   163042   103320   155304   162959
ATT   179427   174863   132189   304458
CAA   229452    97946   128766   132749
CAC   171381   110545    28295   138307
CAG   174610   129192   158202   130700
CAT   126389   111142   138674   204107
CCA   135977   118682   156666   151367
CCC   151839   113532    32830   115571
CCG    19701    22647    26574    22531
CCT    86289   134293   154423   141487
CGA    26513    11282    25339    20493
CGC    18418    27565     7917    20070
CGG    25110    21427    34274    17091
CGT    15469    18715    30483    23543
CTA   117358    80095    84569   112035
CTC   161205   139963    26746   169494
CTG   148478   124874   155378   157572
CTT   111215   140850   128284   224631
GAA   241420    92003   150061   177930
GAC   113764    64302    16469    90033
GAG   169110    90802   141788   108906
GAT   100845    78433   113914   132082
GCA   116176    81269   129658   103890
GCC   113071    81442    22084   121343
GCG    16299    17664    23625    20604
GCT    78527    87584   131151   109832
GGA   188979    68100   146155   102031
GGC   117394    85950    29207   116767
GGG   135686    85690   131922    89898
GGT    72901    65356   110879    98885
GTA   108877    55570    83270   102691
GTC    92650    66827    14228   111328
GTG   123609    79916   122271   123444
GTT    89653    89316    98006   174455
TAA   251945    89763   112715   179434
TAC   144712    72188    14653   114604
TAG   132920    77336    89240    99512
TAT   183625    99644   126315   225420
TCA   164029   125340   147811   161234
TCC   172836   130121    20116   162378
TCG    27737    16375    24943    21893
TCT   129053   165747   170161   204201
TGA   198802    92641   159976   147700
TGC   150162   111982    17807   141826
TGG   175416   114764   152730   126498
TGT   148759   110727   167865   184830
TTA   200941   107428   117686   201838
TTC   196246   176752    25043   239095
TTG   163990   113667   136455   168206
TTT   247598   232107   223839   472782

So I merged test2 and test4 into test5 by encoding A,C,G,T -> 0,1,2,3 in the high 2 bits and the phred score in the low 6 bits. N and linefeed are coded as 0 in the high bits. This can be decoded unambiguously because N always has phred=33 and the others never do, and because linefeeds always coincide. Here are the results:

Code:

Test5: merge test2 and test4: "ACGTN\n" -> {0,1,2,3,0,0}*64 + phred.

test 1+3+5  test 1..4   test5=2+4
15,904,313  16,052,738  15,058,762 test5-m4.zpaq
16,093,434  16,015,074  15,117,928 test5-o16-m454.pmm
16,027,218  15,961,885  15,137,673 test5-8.lpaq9m
16,034,507  16,133,261  15,150,528 test5-nm-cc.nz
16,054,775  15,979,788  15,163,753 test5-6.lpaq9m
16,211,459  16,583,898  15,242,376 test5-m3.zpaq
17,287,977  17,141,095  15,362,860 test5.bsc
17,382,080  17,214,532  15,398,738 test5-m2-b32.zpaq
17,427,450  17,285,304  15,476,066 test5-m2.zpaq
18,227,135  18,909,367  15,935,216 test5-m1-b32.zpaq
18,267,532  18,980,351  16,006,457 test5-m1.zpaq
18,544,500  18,356,538  17,091,254 test5.7z
18,396,169  16,205,798  17,449,297 test5-nm.nz
21,166,767  21,546,591  18,866,725 test5-9.gz
21,590,328  22,470,515  19,097,197 test5.gz
49,562,525  80,755,971  31,193,446 test5

The first column is the total of the 3 files test1,3,5 compressed separately and added together. The second column is for test1,2,3,4 mostly from the previous data. The third is test5 by itself. For ppmonstr I used -o16 for all files because it gave the best compression on test5. I also added lpaq9m -8, which scores best on the 4 separate files.

As you can see, the results are mixed. In some cases, merging the DNA and phred scores helps and sometimes it hurts. It helps with zpaq -m1, -m3, -m4, nanozip -cc, and gzip. It makes compression worse for zpaq -m2, nanozip (default BWT), lpaq9m, ppmonstr, bsc, and 7zip.

I can see 2 ways it might help. One is that it makes the input smaller, which would help with small memory compressors like gzip. The other is that it makes the weak correlations between bases and phred scores visible. The problem is that it hides the stronger correlations among bases (independent of phred scores) and among phred scores (independent of bases).

**Shelwien** · 1st December 2011, 04:45

How about simply deleting Ns and LFs from the ACTG file based on phred values?

**Alexander Rhatushnyak** · 1st December 2011, 06:20

Originally Posted by Matt Mahoney

you need an Amazon EC2 account to access the test data. http://aws.amazon.com/ec2/pricing/

Hm... If someone having EC2 grabs the 6G file and puts it online, makes it available to everyone, will it be legal or not?

Originally Posted by Matt Mahoney

> Also I guess lpaq was N times faster than zpaq -m4 and NN times faster than paq8px.
Well, yes, but zpaq is constrained by archive compatibility between versions. And it's not that bad actually :D Tests on a 2.0 GHz T3200, Win32:

Well, I meant the top three lines in your first table, with zpaq -m4 in the third line, here they are with timing and with two more lines, lpaq9m 9 and paq8px_v69 -8:

Code:

http://s3.amazonaws.com/sequencesqueeze-dev.pistoiaalliance/SRR062634.filt.fastq

15,035,056 test0-8.paq8px  // 5611 seconds
15,165,514 test0.paq8px    // with -5 ?  version px_v69 ?
16,046,060 test0-9.lpaq9m  //  89.3 seconds
16,115,901 test0-6.lpaq9m  //  83.2 seconds
16,235,639 test0-m4.zpaq   // 230 seconds

lpaq, zpaq: time is the minimal of three runs, and nul as archive name when measuring time.
paq8px_v69: only one run, and archive name is not nul.
Tested on my desktop, Intel Core i7 930, 2.8 GHz. zpaq -m4 is much faster than I expected.

Originally Posted by Matt Mahoney

The first column is the total of the 3 files test1,3,5 compressed separately and added together.

Can you please share either tests 1,3,5 or the code that builds them?

**m^2** · 1st December 2011, 07:37

Originally Posted by Alexander Rhatushnyak

Hm... If someone having EC2 grabs the 6G file and puts it online, makes it available to everyone, will it be legal or not?

Depends. Under TRIPS, databases are copyrighted, so in countries that signed it (most did) there is a significant barrier. Still it might be fair use or something.

Though the copyrighted part is collection, not the data itself. One could select a subset of it, either rearrange it (necessary unless it's arranged in some obvious way) and there will be no copyright held by the original creators.
Also, if both selection and arrangement are obvious already, there might be no copyright. So overall it's a tough question.

Asking for permission could work too. ;)

**Matt Mahoney** · 1st December 2011, 16:11

I made test5 like this, where f2, f4, f5 are test2, test4, test5. The upper 2 bits aren't really A,C,G,T because the phred value (c4) might be more than 63. But it is still unambiguous. If I was going to use this for an actual transform I would offset the phred values to 1..40 and use 0 for LF or something like that.

Code:

  int c2, c4;
  const char dna[]="ACGTN";
  while ((c2=getc(f2))!=EOF && (c4=getc(f4))!=EOF) {
    const char* p=strchr(dna, c2);
    if (p && p-dna<4) {
      assert(c4>=35 && c4<=71);
      putc((p-dna)*64+c4, f5);
    }
    else {
      assert((c2==10 && c4==10) || (c2=='N' && c4==33));
      putc(c4, f5);
    }
  }

**JamesB** · 1st December 2011, 16:57

Originally Posted by Matt Mahoney

The leaderboard at http://www.sequencesqueeze.org/ shows James Bonfield, who I recognized as the author of the SRF format for DNA sequence data that I wrote a compressor for at Ocarina. So I searched sourceforge.net for "fastq" and found 2 compressors out of 11 programs, KungFq by Elena Grassi (last updated Aug. 3 2011) and fqzcomp by James Bonfield (last updated Nov. 11, 2011).

KungFq is a Windows program that is supposed to do some input transform and compress with either zlib or lzma. However, it crashed on this file (Java bad number format exception). fqzcomp also does some transform, I guess create separate streams, and compresses with zlib. It came with source only, apparently for Linux, and must be linked with zlib. So I installed zlib (zlib.net, ./configure, make, sudo make install), changed icc to g++ in Makefile and compiled. There are 2 programs that read stdin to stdout: the compressor, fastq2fqz, and the decompresser, fqz2fast2. The result is 18,825,692, which is beaten by about half of the programs above.

OK it's time to stop being a lurker and register on the forums I guess! Well remembered about SRF too Matt. (A format that was *reasonably* well compressed in places and completely scuppered with uncompressed data elsewhere - a work in progress which became obsolete!)

Anyway the fqz coder is pretty noddy. It was something I'd already written a year ago, with minimal interest from the community at the time. I had to tweak it slightly for this as their fastq format differs in the additional of extra columns in the name, but other than that it's pretty much my old code - hence being able to submit a new baseline (it won't last for long) very quickly.

It does as you expected with separating by data type and some basic transformations. Specifically it does string delta on the names. So @SRR1234.431 followed by @SRR1234.488 becomes <10> bytes the same as before plus "88". Bizarrely this is so much more efficient than LZ encoding, maybe due to knowledge of the newline placement meaning no distance is required. Better still would be to tokenise into alpha and numeric fields and compress each independently via tokens for string delta or "same as before", and numeric deltas as required. The DSRC fastq compressor does this.

The sequence data gets packed into codons - that is triplets. I simply treat each triplet as a 3 digit base 5 number (ACGTN). ie map[seq[i]]*25+map[seq[i+1]]*5+map[seq[i+2]]. There's some hand wavy gut feeling that triplets make sense biologically because it's the unit of conversion from DNA to Protein via tRNAs, but realistically that's nonsense - the benefit comes from alphabet size. Huffman encoding is just rubbish with small alphabets, so going from 5 to 125 massive reduces the huffman overhead. It has the side effect of making the sequence data 1/3rd the size and dramatically speeding up zlib en/decoding too.

The quality data is hard to compress with zlib, so I left it as is. I've had good luck compressing the data using a low order range coder using previous symbols and distance since the last newline as context. Shelwien, what license is your example code in coders6c2 under? The range code can be trivially swapped out for the ppmd one, which looks almost identical anyway, and the model in coders6c2 looks to be ppmd derived too (from the same author), but there is no explicit statement of public domain nature of BSD licencing in your coders6c2 so I'm reticent to use it in any public code I release without first checking.

Anyway, back to fqzcomp - it's actually significantly faster than the sequencesqueeze chart implies, too fast infact. Their baseline looks to just be tar+gzip and I know I beat it even on uncompression by a good margin. However the EC2 filesystem is likely pretty slow and being the dominant factor for light-weight compression tools.

I think there's a good compromise for simple modelling and coding that will give high, but not best, compression without suffering PAQ level speed. IIRC I got to 16.3Mb or so and still only taking up a few seconds.

James

PS. In answer to a few other questions. I'd assume the full data set is available from NCBI's SRA (http://trace.ncbi.nlm.nih.gov/Traces...ew=run_browser) but I haven't looked. I don't know what filtering they did.

As for time and memory figures - I've no idea what units they're in. My guess is 100th of seconds and Kb, but it's really not very clear. They don't publish the size of their test set either.

**JamesB** · 1st December 2011, 17:06

Originally Posted by Shelwien

How about simply deleting Ns and LFs from the ACTG file based on phred values?

Some bioinformatics tools do this already (eg the MAQ aligner): 2 bits of base type (ACGT) and 6 bits of quality. 0 quality => N as DNA base. Technically it can be lossy as the Illumina sequencing instrument software that produces this data sometimes has a habbit of assigning different quality values (the "phred" scores) for base types even when the base is N (aNy, or unknown) which is plain nonsense. Such lossy nature is probably a feature rather than a problem.

**Shelwien** · 1st December 2011, 18:20

1. All aridemo sources including codersXXX (there're 6a,6b,6c,6c2) updates are public domain.
But the bitwise coders would be likely better for this case, because they allow for more parameter optimizations.

2. How about testing m1? https://sites.google.com/site/toffer86/m1-project
Especially after optimizing some profiles for data types with it?

3. The coder can assume that phred score for N is fixed, but still encode a match flag for it.
When a flag bit is always the same and a good rangecoder is used, there's nearly no redundancy,
so results would be valid even if the initial implementation is lossy and doesn't encode any such flags.

**Matt Mahoney** · 1st December 2011, 19:50

The bitwise arithmetic encoder (and everything else) in libzpaq is public domain. Acutally it's the MIT license with the restriction that you had to include the copyright notice removed. That was the only restriction, so really the same thing.

Anyway I was looking at the DNA bases (test2) and wondering if it would be worthwhile to do a transform where some of the reads are complemented like AAAC -> GTTT to increase redundancy in case both sides of the sequence make it into the machine. I'm not sure if the test file is a snip of a much larger file, in which case the technique wouldn't work. The same read needs to appear several times if you want to reassemble the original sequence.

Following is a list of all sequences not containing N up to length 31 that appear at least 100 times. A sequence is shorter than 31 only if bounded at both ends either by N or by the start or end of the read. Thus, a read of length 100 with no N would account for 70 sequences.

Code:

1535 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
1180 ATCGGAAGAGCGTCGTGTAGGGAAAGAGTG
1105 ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
1059 CGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
1016 CGGAAGAGCGTCGTGTAGGGAAAGAGTGTAG
 969 GGAAGAGCGTCGTGTAGGGAAAGAGTGTAGA
 930 GAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT
 893 AAGAGCGTCGTGTAGGGAAAGAGTGTAGATC
 846 AGCGTCGTGTAGGGAAAGAGTGTAGATCT
 813 AGCGTCGTGTAGGGAAAGAGTGTAGATCTC
 772 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
 768 AGCGTCGTGTAGGGAAAGAGTGTAGATCTCG
 727 CGTCGTGTAGGGAAAGAGTGTAGATCTCGG
 664 CGTCGTGTAGGGAAAGAGTGTAGATCTCGGT
 659 TCCATTCCATTCCATTCCATTCCATTCCATT
 644 GGAATGGAATGGAATGGAATGGAATGGAA
 641 GTCGTGTAGGGAAAGAGTGTAGATCTCGGTG
 628 TCCATTCCATTCCATTCCATTCCATTCCAT
 627 CCATTCCATTCCATTCCATTCCATTCCATTC
 624 GGAATGGAATGGAATGGAATGGAATGGAAT
 618 CATTCCATTCCATTCCATTCCATTCCATTCC
 617 TCGTGTAGGGAAAGAGTGTAGATCTCGGTGG
 599 AATGGAATGGAATGGAATGGAATGGAATGGA
 585 GAATGGAATGGAATGGAATGGAATGGAATGG
 565 CGTGTAGGGAAAGAGTGTAGATCTCGGTGGT
 556 ATTCCATTCCATTCCATTCCATTCCATTCCA
 538 GGAATGGAATGGAATGGAATGGAATGGAATG
 522 GTGTAGGGAAAGAGTGTAGATCTCGGTGGTC
 493 TGTAGGGAAAGAGTGTAGATCTCGGTGGTCG
 469 GTAGGGAAAGAGTGTAGATCTCGGTGGTCGC
 447 ACACACACACACACACACACACACACACAC
 439 TAGGGAAAGAGTGTAGATCTCGGTGGTCGCC
 423 ACACACACACACACACACACACACACACACA
 418 AGGGAAAGAGTGTAGATCTCGGTGGTCGCCG
 415 ATCGGAAGAGCGTCGTGTAGGGAAAGAGT
 366 GGAAAGAGTGTAGATCTCGGTGGTCGCCGT
 352 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
 338 GGAAAGAGTGTAGATCTCGGTGGTCGCCGTA
 334 TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
 330 GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
 319 GAAAGAGTGTAGATCTCGGTGGTCGCCGTAT
 299 AAAAAAAAAAAAAAAA
 297 AAAGAGTGTAGATCTCGGTGGTCGCCGTATC
 272 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
 271 AGTGTAGATCTCGGTGGTCGCCGTATCA
 265 AAAAAAAAAAAAAAAAA
 263 GGCTCACGCCTGTAATCCCAGCACTTTGGGA
 258 TGGCTCACGCCTGTAATCCCAGCACTTTGGG
 257 AGTGTAGATCTCGGTGGTCGCCGTATCAT
 256 CTCACGCCTGTAATCCCAGCACTTTGGGAGG
 255 GCTCACGCCTGTAATCCCAGCACTTTGGGAG
 250 TCACGCCTGTAATCCCAGCACTTTGGGAGGC
 244 AGTGTAGATCTCGGTGGTCGCCGTATCATT
 232 AAAAAAAAAAAAAAAAAA
 227 CCAGCTACTCGGGAGGCTGAGGCAGGAGAAT
 226 CCCAGCTACTCGGGAGGCTGAGGCAGGAGAA
 219 AGTGTAGATCTCGGTGGTCGCCGTATCATTA
 207 TGTAGATCTCGGTGGTCGCCGTATCATTAA
 206 CCCAAAGTGCTGGGATTACAGGCGTGAGCCA
 204 TCCCAAAGTGCTGGGATTACAGGCGTGAGCC
 201 AAAAAAAAAAAAAAAAAAA
 200 CTCCCAAAGTGCTGGGATTACAGGCGTGAGC
 200 CCAAAGTGCTGGGATTACAGGCGTGAGCCAC
 199 CCCAGCTACTCAGGAGGCTGAGGCAGGAGAA
 199 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
 195 CCTCCCAAAGTGCTGGGATTACAGGCGTGAG
 195 AAAGAAAGAAAGAAAGAAAGAAAGAAAGAAA
 194 CCAGCTACTCAGGAGGCTGAGGCAGGAGAAT
 194 AAAGAAAGAAAGAAAGAAAGAAAGAAAGAA
 191 AAAGAAAGAAAGAAAGAAAGAAAGAAAGA
 190 GTGGCTCACGCCTGTAATCCCAGCACTTTGG
 190 AAAGAAAGAAAGAAAGAAAGAAAGAAAG
 186 CACGCCTGTAATCCCAGCACTTTGGGAGGCC
 185 TGTAGATCTCGGTGGTCGCCGTATCATTAAA
 183 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 180 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
 171 CCCAGCTACTTGGGAGGCTGAGGCAGGAGAA
 169 AGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
 167 AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
 166 CCAGCTACTTGGGAGGCTGAGGCAGGAGAAT
 166 AAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
 165 TGGCTCATGCCTGTAATCCCAGCACTTTGGG
 165 GGCTCATGCCTGTAATCCCAGCACTTTGGGA
 163 AAAAAAAAAAAAAAAAAAAA
 162 CCTGTAATCCCAGCACTTTGGGAGGCCGAGG
 158 TCATGCCTGTAATCCCAGCACTTTGGGAGGC
 158 GCTCATGCCTGTAATCCCAGCACTTTGGGAG
 157 GTAGATCTCGGTGGTCGCCGTATCATTAAAA
 156 TTTTGTATTTTTAGTAGAGACGGGGTTTCAC
 156 CTCATGCCTGTAATCCCAGCACTTTGGGAGG
 152 CTCCCAAAGTGCTGGGATTACAGGCATGAGC
 151 TCCCAAAGTGCTGGGATTACAGGCATGAGCC
 149 CCTCCCAAAGTGCTGGGATTACAGGCATGAG
 149 CCCAAAGTGCTGGGATTACAGGCATGAGCCA
 149 CAAAGTGCTGGGATTACAGGCGTGAGCCACC
 146 TTTGTATTTTTAGTAGAGACGGGGTTTCACC
 146 CTGTAATCCCAGCACTTTGGGAGGCCGAGGC
 145 CTGTAATCCCAGCTACTCAGGAGGCTGAGGC
 145 CCAAAGTGCTGGGATTACAGGCATGAGCCAC
 144 CTCGGCCTCCCAAAGTGCTGGGATTACAGGC
 143 GCCTCCCAAAGTGCTGGGATTACAGGCGTGA
 142 TGTAATCCCAGCTACTCAGGAGGCTGAGGCA
 142 GGCTCACACCTGTAATCCCAGCACTTTGGGA
 141 GTAATCCCAGCTACTCAGGAGGCTGAGGCAG
 141 AAAAAAAAAAAAAAAAAAAAA
 139 CTCACACCTGTAATCCCAGCACTTTGGGAGG
 138 TGGCTCACACCTGTAATCCCAGCACTTTGGG
 138 GCTCACACCTGTAATCCCAGCACTTTGGGAG
 137 TAGATCTCGGTGGTCGCCGTATCATTAAAAA
 136 TGAAACCCCGTCTCTACTAAAAATACAAAAA
 136 TCACACCTGTAATCCCAGCACTTTGGGAGGC
 130 GTGAAACCCCGTCTCTACTAAAAATACAAAA
 130 CTCAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 130 CCCAGCTACTCAGGAGGCTGAGGCAGGAG
 130 CAGCTACTCAGGAGGCTGAGGCAGGAGAATC
 129 TAATCCCAGCTACTCAGGAGGCTGAGGCAGG
 129 AATCCCAGCTACTCAGGAGGCTGAGGCAGGA
 128 TCAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 128 CCCAGCTACTCAGGAGGCTGAGGCAGGAGA
 127 GTAGTCCCAGCTACTCGGGAGGCTGAGGCAG
 127 CTGTAGTCCCAGCTACTCGGGAGGCTGAGGC
 125 TGTAGTCCCAGCTACTCGGGAGGCTGAGGCA
 125 AAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
 123 TGTAATCCCAGCTACTTGGGAGGCTGAGGCA
 122 GCCTGTAATCCCAGCACTTTGGGAGGCCGA
 120 CTGTAATCCCAGCTACTTGGGAGGCTGAGGC
 119 GTTAGCCAGGATGGTCTCGATCTCCTGACCT
 119 GCCTGTAATCCCAGCACTTTGGGAGGCCGAG
 118 CCTCGGCCTCCCAAAGTGCTGGGATTACAGG
 118 AGTCCCAGCTACTCGGGAGGCTGAGGCAGGA
 117 TCCCAGCTACTCGGGAGGCTGAGGCAGGAGA
 117 TCCCAGCTACTCGGGAGGCTGAGGCAGGAG
 117 TAGTCCCAGCTACTCGGGAGGCTGAGGCAGG
 117 TAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
 117 TAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
 117 AAAAAAAAAAAAAAAAAAAAAA
 116 TTTTGTATTTTTAGTAGAGATGGGGTTTCAC
 116 TGAAACCCCATCTCTACTAAAAATACAAAAA
 116 GTAATCCCAGCTACTTGGGAGGCTGAGGCAG
 116 CTGTAATCCCAGCACTTTGGGAGGCCAAGGC
 116 CCTGTAATCCCAGCACTTTGGGAGGCCAAGG
 116 CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 114 GGTGGCTCACGCCTGTAATCCCAGCACTTTG
 114 ACGCCTGTAATCCCAGCACTTTGGGAGGCCG
 113 CCTCCCAAAGTGCTGGGATTACAGGTGTGAG
 113 AGATCGGAAGAGCGTCGTGTAGGGAAAGAG
 113 AATCCCAGCTACTCGGGAGGCTGAGGCAGGA
 112 CTGTAATCCCAGCTACTCGGGAGGCTGAGGC
 112 CTCAGCCTCCCAAAGTGCTGGGATTACAGGC
 112 AGATCTCGGTGGTCGCCGTATCATTAAAAAA
 111 TTTCTTTCTTTCTTTCTTTCTTTCTTTCTTT
 111 CTCCCAAAGTGCTGGGATTACAGGTGTGAGC
 111 CATGCCTGTAATCCCAGCACTTTGGGAGGCC
 110 TTTGTATTTTTAGTAGAGATGGGGTTTCACC
 110 GTGGCTCATGCCTGTAATCCCAGCACTTTGG
 110 GTGAAACCCCATCTCTACTAAAAATACAAAA
 110 ATATATATATATATATATATATATATATATA
 109 TGTAATCCCAGCTACTCGGGAGGCTGAGGCA
 109 CTTTCTTTCTTTCTTTCTTTCTTTCTTTCTT
 109 ATATATATATATATATATATATATATATAT
 108 TTCTTTCTTTCTTTCTTTCTTTCTTTCTTTC
 108 TCCCAAAGTGCTGGGATTACAGGTGTGAGCC
 108 TAATCCCAGCTACTTGGGAGGCTGAGGCAGG
 108 CCTGTAATCCCAGCACTTTGGGAGGCTGAGG
 108 CCCAGCTACTCGGGAGGCTGAGGCAGGAGA
 107 TCTCCTGCCTCAGCCTCCCGAGTAGCTGGG
 107 GTAATCCCAGCTACTCGGGAGGCTGAGGCAG
 107 CCCAGCTACTCGGGAGGCTGAGGCAGGAG
 107 AATGGAATGGAATGGAATGCAATGGAATGGA
 106 TCTTTCTTTCTTTCTTTCTTTCTTTCTTTCT
 106 GAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
 105 CTGTAATCCCAGCACTTTGGGAGGCTGAGGC
 105 CCCAGCTACTTGGGAGGCTGAGGCAGGAGA
 105 CCCAAAGTGCTGGGATTACAGGTGTGAGCCA
 105 CAGCTACTCGGGAGGCTGAGGCAGGAGAATC
 105 ATTCCATTCCATTCCATTCCATTCCATTCCA
 104 GGAATGGAATGGAATGCAATGGAATGGAA
 104 GCCTCCCAAAGTGCTGGGATTACAGGCATGA
 103 TAATCCCAGCTACTCGGGAGGCTGAGGCAGG
 103 GAAGAAATCCCGTTTCCAACGAAGGCCTCAA
 102 CCTCAGCCTCCCAAAGTGCTGGGATTACAGG
 102 CCCAGCTACTTGGGAGGCTGAGGCAGGAG
 102 AATCCCAGCTACTTGGGAGGCTGAGGCAGGA
 101 TCTCCTGCCTCAGCCTCCCGAGTAGCTGGGA
 101 GATTCCATTCCATTCCATTCCATTCCATTCC
 101 CCCATCTCTACTAAAAATACAAAAATTA
 101 AAAAAAAAAAAAAAAAAAAAAAA
 100 TCACAGAGTTTAACCTTTCTTTTCATAGA
 100 CAAAGTGCTGGGATTACAGGCATGAGCCACC
 100 ATCTCGGTGGTCGCCGTATCATTAAAAAAA

Following is the number of sequences not containing N that appear between 1 and 99 times. The first column is for sequences of length 1 to 30. The second column is length 31. For example, there are 26 sequences of length 31 that appear 99 times.

Code:

Count  1..30   31
 1 13903368 11898146
 2   170686   245876
 3    39081    66966
 4    20465    32782
 5    13318    19941
 6     9173    13402
 7     6900     9899
 8     5284     7140
 9     3895     5513
10     3353     4324
11     2501     3635
12     2044     2992
13     1813     2555
14     1401     2006
15     1126     1727
16     1080     1526
17      906     1301
18      695     1199
19      611     1043
20      565      961
21      515      827
22      459      765
23      429      650
24      318      647
25      270      605
26      229      580
27      251      477
28      234      449
29      229      443
30      220      394
31      178      388
32      149      355
33      152      314
34      147      304
35      141      283
36      128      294
37      128      252
38       79      254
39      101      231
40       91      225
41       81      192
42       66      187
43       50      175
44       51      161
45       64      142
46       56      147
47       51      119
48       47      134
49       35      101
50       47      110
51       33       97
52       29       99
53       25      100
54       29       86
55       25       82
56       25       87
57       31       73
58       23       71
59       30       82
60       28       60
61       21       74
62       14       71
63        9       65
64       24       62
65       16       55
66       18       62
67       20       63
68        6       47
69       13       50
70       12       50
71        5       59
72       15       48
73        7       46
74        8       36
75       12       53
76        9       47
77       13       48
78       10       38
79       10       47
80       10       48
81        5       45
82        3       50
83        5       33
84        3       35
85        6       44
86        7       28
87        8       35
88        6       42
89        5       30
90       10       42
91        6       39
92        6       28
93        3       36
94        4       30
95        7       32
96        3       32
97        6       31
98        2       24
99        3       26

**Matt Mahoney** · 2nd December 2011, 04:51

I tried an experiment where I complemented the DNA reads if it improved the number of partial matches to other reads elsewhere in the file. I found plenty of cases of such complementary matches over long strings. Unfortunately, the results were poor. Complementing a read saved on average 0.5 bits each with zpaq -m2, ppmonstr, and bsc (and worse with lpaq9m), but it takes at least 1 bit to encode the direction of the read so it can be decompressed. The result is the files are larger.

The first set below shows the original data. In the second I reversed reads where there were more matches going the other way, but did not encode the direction, only so the effect could be separated. In the third set I encoded a reversed read by changing the trailing newline to a space. (I tried other methods with similar effects).

Code:

31,193,446 test2 (308846 DNA reads of length 100, newline terminated)
 7,256,807 test2-8.lpaq9m
 7,270,003 test2-m2-b32.zpaq
 7,128,330 test2-o16-m556.pmm
 7,313,659 test2.bsc
60,162,245 bytes

31,193,446 test6 lossy, original=261019 backward=47326
 7,283,664 test6-8.lpaq9m
 7,268,396 test6-m2-b32.zpaq
 7,125,006 test6-o16-m556.pmm
 7,311,914 test6.bsc
60,182,426 bytes

31,193,446 test6 read direction encoded by changing '\n' to ' '
 7,288,219 test6-8.lpaq9m
 7,293,044 test6-m2-b32.zpaq
 7,173,666 test6-o16-m556.pmm
 7,340,591 test6.bsc
60,288,966 bytes

I am including code below so you can see exactly what I did, although the code is useless. I reversed a read by exchanging A with T and C with G and reversing the direction. N is unchanged. I made 10 passes over the data, looking at every 3, 5, 7..., 21 reads modulo the data size and comparing to previous reads. (This works because all of these step sizes are relatively prime to the number of reads, which has prime factors 2*154423). I found that more than 6-8 passes had no effect.

To make the search and comparison reasonably fast, I used a Rabin style rolling hash parsing (like used for deduplication) to segment the DNA strings on position independent but content-dependent boundaries. The hashes index a bitmap with no collision detection. If a bit is set, then it means that the same hash was seen before. I compute the hash in both directions, counting the number of segments and number of matches each way. If the fraction of matches in the reverse direction is greater than the forward direction, and the number of reverse matches is greater by at least 2, then I complement the string. I experimented with different threshold for reversal, but consistently failed to see any improvement. The more strings I complemented, the worse it got.

The hash function is h = (h+c)*10, where c is mapped "ACNGT" -> 0..4 (or 4..0 when testing in the other direction). When the high 4 bits (of 32) are 1111, then I mark a boundary and take the low 28 bits as the hash index into a bit vector of size 2^28, and set h=0. The hash is such that the low 28 bits depend on the last 28 bases, that the minimum length is 10, and that a boundary occurs with probability 1/16 after that. Thus, segments have an average length of about 18. This is long enough to make chance collisions unlikely. I threw away the first hash, which is of a partial segment, and compared the rest so that they would have content-dependent boundaries on both ends.

I found experimentally that about 1/3 of the reads were better matched in the forward direction, 1/3 reversed, and 1/3 made no difference. Other hash functions gave fewer reversed matches. I tried changing the multiplier from 10 to 14, which reduced the minimum length, or testing the high 3 bits instead of 4, which gave a shorter average length. Note that if the file were much larger, I would need a 64 bit hash and a larger bit vector.

To mark a complemented string I XOR the trailing newline with ' '^'\n' which changes it to a space, or changes it back if it gets un-reversed in a later pass. For the second test above, I commented out this line.

Code:

// t2.cpp - transform test2 -> test6
// Replace reads with their complements if it improves compression.
// Encode reversed reads by changing the trailing '\n' to ' '.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <vector>
#include <assert.h>
using namespace std;

int main() {
  FILE* in=fopen("test2", "rb");
  if (!in) return perror("test2"), 1;
  vector<char> v;  // read input into v
  for (int c; (c=getc(in))!=EOF;) v.push_back(c);
  fclose(in);
  assert(v.size()==31193446);  // prime factors are 101*2*154423
  const char dna[]="ACNGT";  // translation table
  for (int pass=0; pass<10; ++pass) {
    int revcount[4]={0};  // output stats
    vector<bool> s(1<<28);  // hash index
    for (int ii=0; ii<int(v.size()); ii+=101) {
      int i=ii*(pass*2+3)%v.size();
      assert(v[i+100]==10 || v[i+100]==' ');
      int fh=0, fm=0, rh=0, rm=0;  // forward/reverse hashes/matches
      uint32_t h=0;
      for (int j=99; j>=0; --j) {  // hash in reverse -> rm/rh
        const char* p=strchr(dna, v[i+j]);
        assert(p);
        h=(h+4-(p-dna))*10;
        if ((h>>28)==15) {
          h&=(1<<28)-1;
          if (++rh>1) {
            rm+=s[h];
          }
          h=0;
        }
      }
      h=0;
      for (int j=0; j<100; ++j) {  // hash forward -> fm/fh
        const char* p=strchr(dna, v[i+j]);
        assert(p);
        h=(h+(p-dna))*10;
        if ((h>>28)==15) {
          h&=(1<<28)-1;
          if (++fh>1) {
            fm+=s[h];
            s[h]=true;
          }
          h=0;
        }
      }

      // Replace read with complement and mark by changing \n to space
      int rev=rm*fh-fm*rh;
      if (rev>0 && rm>fm+1) {
        for (int j=0, k=99; j<=k; j++, k--) {
          char t1=v[i+j], t2=v[i+k];
          const char* p1=strchr(dna, t1);
          assert(p1);
          const char* p2=strchr(dna, t2);
          assert(p2);
          v[i+k]=dna[4-(p1-dna)];
          v[i+j]=dna[4-(p2-dna)];
        }
        v[i+100]^=' '^'\n';
        ++revcount[2+(v[i+100]=='\n')];
      }
      else
        ++revcount[(v[i+100]==' ')];

      // save hashes
      fh=fm=0;
      for (int j=0; j<100; ++j) {  // hash forward
        const char* p=strchr(dna, v[i+j]);
        assert(p);
        h=(h+(p-dna))*10;
        if ((h>>28)==15) {
          h&=(1<<28)-1;
          if (++fh>1) {
            fm+=s[h];
            s[h]=true;
          }
          h=0;
        }
      }
    }
    printf("original=%d backward=%d reversed=%d undone=%d\n",
            revcount[0], revcount[1], revcount[2], revcount[3]);
  }

  // write test6
  FILE* out=fopen("test6", "wb");
  if (!out) perror("test6");
  for (int i=0; i<int(v.size()); ++i)
    putc(v[i], out);
  fclose(out);
  printf("created test6\n");
  return 0;
}

**Shelwien** · 2nd December 2011, 05:09

Afaik the best idea for such things is to create artifical additional streams
(complementary, with reversed blocks for DNA palindromes, etc) and train some submodels on these.

Btw, is it possible with zpaql?

Also, there's that thing which Osman tried to use...
http://nishi.dreamhosters.com/chanlo...2014%3A28%3A19

**Steve** · 2nd December 2011, 06:47

What a pity! All you pro's working on it ^^
I already hoped for easy money, as i saw the first two entries :D

Did you already get the 6 Gb File?

6 GB Sample Data: http://s3.amazonaws.com/sequencesque...4_1.filt.fastq

77MB Sample Data: http://s3.amazonaws.com/sequencesque...634.filt.fastq

Just open the Links :)

But as i understood, even the 6 GB File isn't the File used to generate the final results.
(well otherwise they probably would ne some restrictions for the size of the programm itslef :D )

Interesting competition i think.

But i dislike the conditions / winning conditions.

They should provide a factor how they rate speed / size / memory...
As you already mentioned ( and i also testet), for example paq outperforms the current leader by far (compression ratio)....but is pretty lame ;) But unfortunately you don't know if that is what they want...

And this:

First place in the leaderboard in any single category does not necessarily mean overall winner.

well......not very transparent...they have some potential for optimization ^^

But still a very interesting competition i think.
And a nice price ^^

You already had some nice ideas for specific Fastq optimization, keep going :)

I wish you much success.

**schnaader** · 2nd December 2011, 09:54

Originally Posted by Shelwien

How about simply deleting Ns and LFs from the ACTG file based on phred values?

Tested this and it seems this works better than a merge. Results:

Code:

    31.193.446 test2
    31.143.409 test2_no_n
     7.708.641 test2_no_n_2bit

     7.387.173 test2.zpq_m4.zpaq
     7.371.207 test2_no_n.zpq_m4.zpaq

     6.916.848 test2_no_n_2bit.zpq_m4.zpaq
  +  7.819.988 test4.zpq_m4.zpaq
  = 14.736.836

     6.916.848 test2_no_n_2bit.zpq_m4.zpaq
  +  7.756.029 test4_no_hashes.zpq_m4.zpaq
  = 14.672.877

'no_n' has the 50037 'N' characters removed (as those can be reconstructed from test4). After that, 'no_n_2_bit' packs 4 ACTG-values into a byte, ignoring LFs.

Together with my previous result for test4, we get 14,672,877 bytes which is better than 15,058,762 bytes from the merge.

The assumptions needed to keep lossless are that we can reinsert 'N' and LF afterwards - for this testset we know that a 'N' character always has a phred value of 33 and vice versa and every line is 100 characters long (also, LFs are the same in test2 and test4), so it works.

EDIT: I just noticed that it could be better replacing 'N' with some value instead of removing them to keep the structure intact, and it indeed gives a slightly better result. In this case, the mapping is (T,C,G,A,N -> 00,01,10,11,00). We could even replace every 'N' with a different, chosen value e.g. to support matchfinding.

Code:

     7.721.150 test2_with_n_00_2bit
     6.914.517 test2_with_n_00_2bit.zpq_m4.zpaq

**JamesB** · 2nd December 2011, 10:25

Some things to consider. The DNA strings are all effectively little snapshots of a much larger (*maybe* known) string - the genome being sequenced. They haven't specified it, which IMO is unfair as in real data you usually do. I assume it's human, but it's easy to lookup actually.

So... we expect matches of a dna string with the genome, or in this case with other dna sequences, but only within the sequence itself and not across newline boundaries. Each newline represents a completely different starting point somewhere in the genome. It may not even be in the same orientation. Graphically it looks something like this when assembled together:

My own thoughts on the optimal compression of this is that fastq isn't what the scientists actually want anyway. The optimal fastq compression will almost certainly come from doing a denovo sequence assembly or by mapping the data back to a known reference, and then encoding it as a combination of the (possibly derived, possibly known) reference, position within it, and differences to it. Maybe we can write an algorithm to do this, but frankly you may as we start with the existing tools.

However here's the rub - what is the point of then throwing away all this useful information - how the data aligns - in order to recreate the original fastq file? It's pointless! The scientists will probably only end up aligning it again themselves anyway. Instead the real solution lies in just encoding the data in SAM file format, a generally recognised file format for mapped and assembled DNA fragments and forgetting about fastq completely.

Of course it's probably not what the organisers expect. I also realise I've totally given away my game plan, but frankly with the experts here I'm doomed anyway! All I can hope is to channel people in interesting and appropriate directions. :)

Edit: an explanation of the above pic. It's from Gap5, the software I work on for a living. The grey scale background is an indication of the phred quality values with dark being poor quality. The blue colouration is simply to show disagreements with the consensus (which is worked out on-the-fly). That's an old screenshot with early Illumina data - so the error rate is high giving lots of discrepancies and the sequence length is short.

On modern data, with a deep assembly (about 1000x) I was able to compress the sequence to 0.042 bits per base by modelling each column at a time; this is given the context of the position in the consensus and the alignment CIGAR string, but not the actual consensus itself. That basically shows how few errors there are in more modern data sets.

So differences then boil down to two things.

1) Errors, mostly random by sometimes specific to particular sequence motifs. In theory the quality values have been calibrated such that Q10 is 1 in 10 error rate, Q20 is 1 in 100, Q30 is 1 in 1000, etc. Ie -10* log10(Perror). So we can get an estimation of the likely chance of a base call differing based on its phred score... IF they've been calibrated correctly.

2) SNPs (single nucleotide polymorphisms). Basically these are differences between individuals. A lot of sequencing data is simply your whole genome smashed into little fragments and passed through a sequencing instrument. That means you have two copies of everything - derived from your mother and your father. The assembly then shows up various sites where those two copies differ -you'll likely see a column consisting of 50/50 ratio of A and G for example.

Finally, there's a strong relationship between compression technology and bioinformatics currently. The current best crop of DNA sequence aligners, matching dna fragments against a known reference genome, include bowtie and bwa programs - both based around BWT (or more specifically FM-Index I think).

**JamesB** · 2nd December 2011, 11:19

Originally Posted by Steve

What a pity! All you pro's working on it ^^
I already hoped for easy money, as i saw the first two entries :D

The cheek of it! I went for pure speed with fqzcomp, but it turns out it's too fast given the typical disk speed (faster than gzip even at decompression). Although I have plans for substantially better compression without being too bad on speed still.

Good catch on the 6Gb file. Last time I tried that it was still permission denied. I guess they set the acls on that s3 file now.

James

**Matt Mahoney** · 2nd December 2011, 16:27

Originally Posted by Shelwien

2. How about testing m1? https://sites.google.com/site/toffer86/m1-project
Especially after optimizing some profiles for data types with it?

After running m1 v0.6 overnight on test2, I got this after 14 iterations, then no further improvement. But it didn't save the profile or the compressed file, so I will have to run it again.

Code:

m1 o: test2

14 f* = 1.929044
 0 SSE0_W0 = 00110111 [0, 8, g]
 1 SSE1_W0 = 01111111 [0, 8, g]
 2 SSE0_W = 1111111110000000 [0, 16, g]
 3 FSM0_PARTITION = 0000000001000000000000100000000000000000000010000000000010000000 [4, 4]
 4 FSM0_DECAY_Q_MAX = 11100 [0, 5, vg]
 5 FSM0_DECAY = 0000000000000100000000100000100 [3, 3, b]
 6 FSM0_QDOMAIN = 111100 [0, 6, vg]
 7 MDL_0x40 = 1111 [0, 4, b]
 8 MDL_MASKS = 11111111000111101111001001011101 [0, 32, b]
 9 MDL_MUL = 1111111111110011101010001101000010010001011100110110111010110100 [0, 64]
10 MIX0_CTXTYPE = 1 [0, 1]
11 MIX0_SHIFT = 011 [0, 3, g]

**Matt Mahoney** · 2nd December 2011, 16:32

Originally Posted by JamesB

Good catch on the 6Gb file. Last time I tried that it was still permission denied. I guess they set the acls on that s3 file now.
James

Looks like the same format as the smaller file (Illumina 1.8, phred+33, 100 byte reads). Here's the first few lines.

Code:

@SRR062634.1 HWI-EAS110_103327062:6:1:1092:8469/1
GGGTTTTCCTGAAAAAGGGATTCAAGAAAGAAAACTTACATGAGGTGATTGTTTAATGTTGCTACCAAAGAAGAGAGAGTTACCTGCCCATTCACTCAGG
+
@C'@9:BB:?DCCB5CC?5C=?5@CADC?BDB)B@?-A@=:=:@CC'C>5AA+*+2@@'-?>5-?C=@-??)'>>B?D@?*?A#################
@SRR062634.2 HWI-EAS110_103327062:6:1:1107:21105/1
ACCGTGAGCAATCAGCTGCCATCAACGTGGAGGTAAGACTCTCCACCTGCAAAAACATTACAACTTGCTGAAGGCTGAGATACTTGTTCGCACATTTTTA
+
FDEFF?DFEFE?BEEEEED=DB:DCEAEEB,CC=@B=5?B?CC5C?B+A??=>:CC<9-B2=@>-?:-<A@@A?9>*0<:'0%6,>:9&-:>?:>==B??
@SRR062634.3 HWI-EAS110_103327062:6:1:1110:17198/1
TAGATATTTTTGTTTTAACTGCTGTAGAAAATTAAGACATAAACTAAGAAATATCCCATGAAGGAATGAGTATACTGTTTCTACTTGCAGAAAAGCTGCG
+
-?3-C22646@-@3@@3-=-====CBB@DB-A-=-AA@C-<AA7>D=ABDA;?CDDDD5D?DD55:>:AB>5?-CCCC######################
@SRR062634.4 HWI-EAS110_103327062:6:1:1112:12923/1
AGATGAGTTTCACAAAGTAAGCAACTTGATATCCAAATAATCAACACCCAACTCAAGAAAAAGATCATTACCAGAAACTAATAAACCAGCACATTAGGTG
+
??EEEDB?D-?AAA-AA?>->BC:ADB:--A55ACCA:D6C:?5@CADD6=C5:CD?D4;,::?,CC-5A@C-?D5@+-BB?BC*A-A?C:C@#######

BTW the hash tags "#" are phred values of 2, the lowest possible score except for N.

**JamesB** · 2nd December 2011, 16:51

A quick benchmark of fqzcomp2 - same parsing code as fqzcomp-1 but using the model+encoder derived from Shelwien's coders6c2 distribution:

jkb@deskpro102485[compress.../fqzcomp-...] time ./fastq2fqz < /dev/shm/s > /dev/shm/b
Output = 16627423 bytes
Name 1180443
Seq 7522909
Qual 7923497

real 0m1.991s
user 0m1.836s
sys 0m0.148s

jkb@deskpro102485[compress.../fqzcomp-...] time ./fqz2fastq < /dev/shm/b > /dev/shm/c

real 0m1.890s
user 0m1.724s
sys 0m0.160s

jkb@deskpro102485[compress.../fqzcomp-...] ls -l /dev/shm/b /dev/shm/s
-rw-r--r-- 1 jkb team117 16627431 Dec 2 14:44 /dev/shm/b
-rw-r--r-- 1 jkb team117 80755971 Nov 8 14:34 /dev/shm/s

So prog size, comp time, decomp time:

Code:

gzip2     25224902 7.667 0.687
fqzcomp-1 18825695 1.975 0.483
fqzcomp-2 16627431 1.991 1.890

The models used in fqzcomp2 are still pretty noddy. I have names packed into the same 3 arrays as before (this can be greatly improved on with a proper encoder, but zlib limited me majorly there) and passed through an O1 encoder, sequence packed in triplets passed through O1 encoder, and quality passed into an encoder using the previous 2 bytes plus dist/32 distance from the last newline.

Obviously it can be beaten with better models, but this seems to demonstrate a good size vs speed tradeoff. I'll upload it soon, but the code takes hacky to a new level and currently I'm ashamed to let Shelwien see what I did to his code.

Edit: I've improved the seq model by using 2 bit encoding and escaping Ns via knowledge of the quality value. This drops it down to 16438212 bytes (7334911 for seq). The names I minorly tweaked too but there's a lot more scope there. I want to model it as a series of numeric and non-numeric fields. Each field can then be output as "matches previous", "delta to previous" (good for both string and number) or "as is". I think this would substantially reduce the names.

There's room for quality shrinkage too. I have a simple command-line processor for just quality which gets it to 7824640 bytes in about 1 second, but ultimately the way to go there is lossy compression frankly. Same with the names - they're not that useful!

Thread: Compression Competition -- $15,000 USD

Thread Tools

Search Thread

Display

Compression Competition -- $15,000 USD

Hello, im new here :)

Posting Permissions