Page 1 of 6 123 ... LastLast
Results 1 to 30 of 154

Thread: Compression Competition -- $15,000 USD

  1. #1
    Member
    Join Date
    Aug 2011
    Location
    California, USA
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Compression Competition -- $15,000 USD

    Forgive me if this is off-topic for this forum, but I know some folks here like to engage in (and win) compression competitions.

    This one is, from what I can tell, not that well specified (as far as I can tell, they don't charge for the size of the compressor, for example). But in case there is any interest...

    http://www.sequencesqueeze.org/

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,887
    Thanks
    428
    Thanked 1,965 Times in 1,137 Posts

  3. #3
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Ireland
    Posts
    270
    Thanks
    83
    Thanked 122 Times in 62 Posts
    Is this the only file you need to compress? Or is it just a sample file?

    Judging

    Upon submission, entries will join a queue for automated judging.

    The total size of /contest-data/compressed will be used to calculate compression ratio. Timing and memory usage will be calculated using /usr/bin/time with the %e and %K options. Header, quality and sequence mismatches are simple counts of the number of lines in the original data which did not appear exactly the same in the decompressed data.

    Results will be emailed to you upon completion of automated judging. The top 5 entries will also appear on the frontpage of this website in the leaderboard section.

    The choice of overall winner of the US$15,000 prize will be a combination of performance in the leaderboard with the professional opinions of the judging panel, as described in the competition terms and conditions. First place in the leaderboard in any single category does not necessarily mean overall winner.
    Okay, let's have a look at Terms & conditions...

    7. The categories of judging and the process used to score and compare entries in each category are: speed of compression, speed of decompression, compression ratio, and round-trip accuracy. In addition judges will assess the novelty of concept, code style, scalability, and their opinion of the future potential of each entry in making their final decision.
    8. Judges decisions are final. Judging will take place within 1 month of the Closing Date.
    In short, judges may choose any entry, even those looking inefficient if only speed of compression, speed of decompression, and compression ratio are considered.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,887
    Thanks
    428
    Thanked 1,965 Times in 1,137 Posts
    There's another 6G file there, but seems that it can't be downloaded with http.
    http://s3.amazonaws.com/sequencesque...stoiaalliance/

    Also http://en.wikipedia.org/wiki/FASTQ_format

  5. #5
    Member
    Join Date
    Oct 2011
    Location
    Porto Alegre, Brazil
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Does anyone successfully get the sample data? If yes, could you provide it by http/ftp/torrent?

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,114
    Thanks
    570
    Thanked 517 Times in 198 Posts
    Yep, judges will receive lots of free sources... After, they submit to themselfs own software that will won the prize! Brilliant!

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    Anyone working on this? I ran a quick (well, several hours) test on the 80 MB fastq file that Shelwien kindly linked to (renamed to test0).

    Code:
    http://s3.amazonaws.com/sequencesqueeze-dev.pistoiaalliance/SRR062634.filt.fastq
    
    15,165,514 test0.paq8px
    16,115,901 test0-6.lpaq9m
    16,235,639 test0-m4.zpaq
    16,375,589 test0.paq9a
    16,393,695 test0.pmm
    16,405,427 test0-cc.nz
    16,462,281 test0.nz
    16,476,068 test0.ash
    16,554,743 test0-6.lpaq1
    16,826,593 test0-cfm100.bbb
    16,941,205 test0-m3.zpaq
    17,183,336 test0.bcm
    17,201,598 test0-d6-n16M-f16M.ctw
    17,238,309 test0-p.bsc
    17,238,344 test0.bsc
    17,252,864 test0-m256-o8.pmd
    17,471,255 test0-m2.zpaq
    17,555,051 test0-256.hook
    17,602,755 test0-m256-o16-r1.pmd
    17,714,291 test0-p-m6.bsc
    17,727,126 test0-m256-o4.pmd
    17,762,644 test0.pmd
    17,827,111 test0-m256-o16.pmd
    17,921,307 test0-p-m3.bsc
    19,133,776 test0-m1.zpaq
    19,531,518 test0-m256-o2.pmd
    19,815,265 test0-mx=9.7z
    20,279,408 test0.bz2
    20,373,589 test0.comprox_ba
    20,426,942 test0.7z
    21,272,446 test0-lzx21.cab
    21,773,012 test0.sr2
    23,202,980 test0.kzip
    23,835,069 TEST0.Z
    24,059,782 test0-12.tor
    24,468,825 test0.comprox_sa
    24,559,540 test0-9.zip
    25,145,150 test0.slug
    25,224,908 test0.gz
    25,225,134 test0.zip
    25,899,315 test0.fpaq0f2
    26,179,986 test0-5.tor
    27,098,436 test0.etincelle
    29,442,764 test0-9.lzo
    32,887,930 test0.lzrw5
    33,006,771 test0.lzrw3-a
    37,424,223 test0.fpaq0p
    39,742,983 test0.lzo
    40,086,254 test0.stz
    40,261,409 test0.lzrw3
    40,931,044 test0.fpaq0
    41,305,492 test0.lzrw2
    41,412,785 test0-1.tor
    42,614,128 test0.lzrw1-1
    43,257,616 test0.lzrw1
    52,827,193 test0.ppp
    53,268,197 test0.flzp
    80,755,971 test0
    The leaderboard at http://www.sequencesqueeze.org/ shows James Bonfield, who I recognized as the author of the SRF format for DNA sequence data that I wrote a compressor for at Ocarina. So I searched sourceforge.net for "fastq" and found 2 compressors out of 11 programs, KungFq by Elena Grassi (last updated Aug. 3 2011) and fqzcomp by James Bonfield (last updated Nov. 11, 2011).

    KungFq is a Windows program that is supposed to do some input transform and compress with either zlib or lzma. However, it crashed on this file (Java bad number format exception). fqzcomp also does some transform, I guess create separate streams, and compresses with zlib. It came with source only, apparently for Linux, and must be linked with zlib. So I installed zlib (zlib.net, ./configure, make, sudo make install), changed icc to g++ in Makefile and compiled. There are 2 programs that read stdin to stdout: the compressor, fastq2fqz, and the decompresser, fqz2fast2. The result is 18,825,692, which is beaten by about half of the programs above.

    Of course the contest considers speed by some unspecified formula and some subjective judgement of software quality. It compresses in about 3 seconds and decompresses in 1, compared to about 3 hours for paq8px_v69.

    But given that there are not a whole lot of people who know how to write data compression software, it might be something to look at.

  8. #8
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Ireland
    Posts
    270
    Thanks
    83
    Thanked 122 Times in 62 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Anyone working on this? I ran a quick (well, several hours) test on the 80 MB fastq file that Shelwien kindly linked to (renamed to test0).

    Code:
    http://s3.amazonaws.com/sequencesqueeze-dev.pistoiaalliance/SRR062634.filt.fastq
    
    15,165,514 test0.paq8px
    16,115,901 test0-6.lpaq9m
    16,235,639 test0-m4.zpaq
    16,375,589 test0.paq9a
    16,393,695 test0.pmm
    Surprisingly lpaq9m is again compressing better than zpaq and paq9 (just like with prime numbers in various formats: http://encode.su/threads/1367-Synthe...ll=1#post26504 ), though it was written for large English text files like enwik8 and enwik9, and it's not expected to compress well without DRT preprocessing.

    Also I guess lpaq was N times faster than zpaq -m4 and NN times faster than paq8px.
    When you run tests automatically it's easy (imho) to calculate processing times, either logging start/finish times explicitly or from outfile timestamps.


    ADDED two hours later:

    It compresses in about 3 seconds and decompresses in 1
    The Current leaderboard table at http://www.sequencesqueeze.org shows Compress.time=188.38 and Decompress.time=100.91, and it's not clear whether these are seconds or milliseconds (or are these results on a 100 times bigger file?) Also, it's not clear from memory usage numbers 36512 and 16016 whether these are kilobytes, megabytes, or data-specific records.
    Is this all explained somewhere on the site?

    Anyone working on this?
    I am not working on this, mainly because it's like playing lottery since rules sound like "we may choose any entry as a winner, regardless of performance, but nonetheless you are welcome to participate because the first 40 entries received will each receive a US$20 Amazon Web Services voucher. (Only one voucher per person.)"
    Does anyone here need an AWS voucher? Why?
    Last edited by Alexander Rhatushnyak; 30th November 2011 at 18:46. Reason: visited http://www.sequencesqueeze.org

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    I did some experiments in which I split the fastq file into 4 separate files, each containing every fourth line. This improved compression in every case I tested.

    Code:
    // Split input into 4 files each containing every 4'th line
    int c, i=1;
    while ((c=getc(f[0]))!=EOF) {
      putc(c, f[i]);
      if (c=='\n' && ++i>4) i=1;
    }
    
    17,751,387 test1    @SRR062634.321 HWI-EAS110_103327062:6:1:1446:951/2
    31,193,446 test2    TGATCATTTGATTAATACTGACATGTAGACAAGAAGAAAAGTATGTTTCATGCTATTTTGAGTAACTTCCATTTAGAAGCCTACTCCTGAGCACAACATT
       617,692 test3    +
    31,193,446 test4    B5=BD5DAD?:CBDD-DDDDDCDDB+-B:;?A?CCE?;D3A?B?DB??;DDDEEABD+>DAC?A-CD-=D?C5A@::AC-?AB?=:>CA@##########
    
    15,978,394 test?.pmm        1034631, 7127330,  41,  7815392 (-o6-m25, -o16-m556, -o4-m1, -o8-m150)
    15,979,788 test?.lpaq9m      890784, 7271981, 238,  7816785
    16,052,738 test-m4.zpaq      845270, 7387186, 281,  7820001
    16,061,399 test-m4-bs.zpaq   845229, 7390672,  43,  7825455
    16,133,261 test?-nm-cc.nz    883891, 7383307,  88,  7864975
    16,160,211 test-nm-cc.nz
    16,205,798 test?-nm.nz       946801, 7278680,  71,  7980246
    16,294,717 test-nm.nz
    16,583,898 test-m3.zpaq      968929, 7651104, 154,  7963711
    17,141,095 test?.bsc        1924989, 7313659, 128,  7902319
    17,214,532 test-m2-b64.zpaq 1983109, 7270008, 233,  7961182
    17,285,304 test-m2.zpaq     1951151, 7351234, 233,  7982686
    18,356,068 test.7z
    18,356,538 test?.7z         1452977, 7624224, 269,  9279068
    18,909,367 test-m1-b64.zpaq 2291636, 7875915, 283,  8741533
    18,980,351 test-m1.zpaq     2260792, 7957153, 283,  8762123
    21,546,591 test?-9.gz       2299399, 8857636, 643, 10388913
    22,470,515 test?.gz         2492488  9228041, 643, 10749343
    The code snip shows how the splitting was done. Merge would be a similar procedure. The result is 4 files, test1...test4. I showed the first line of each file. test1 has a description of the read. It contains sorted data describing its location in the sequencing machine. test2 contains the DNA read. Each read is the same length, in this case 100 base pairs. test3 contains only a "+". test4 contains the phred (confidence) scores corresponding to the values in test2. Higher ASCII values represent higher confidence in the read values.

    Of the programs I tested, I got the best result with ppmonstr by tuning the model order for each file. I tried orders 4, 6, 8, 12, 16, and 32. I showed the memory in MB needed for each one.

    Nanozip, zpaq, and 7zip produces archives, but in each case I got better results by compressing the files separately. I showed both ways. For zpaq it makes no difference unless the -bs (solid archive) option is used.

    test1 contains sorted data. All of the BWT methods (zpaq -m1, -m2, nz, bsc) do poorly on this file compared to CM (zpaq -m3, -m4, lpaq9m, nz -cc) and PPM (ppmonstr). test2 does better with BWT than CM.

    gzip -9 does not do as well as fastq2fqz (18,825,692) even though both use deflate.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    > Does anyone here need an AWS voucher? Why?

    I guess because you need an Amazon EC2 account to access the test data. http://aws.amazon.com/ec2/pricing/

    > Also I guess lpaq was N times faster than zpaq -m4 and NN times faster than paq8px.

    Well, yes, but zpaq is constrained by archive compatibility between versions. And it's not that bad actually :D Tests on a 2.0 GHz T3200, Win32:

    Code:
    7,256,807 test2-8.lpaq9m,    77 sec, 774 MB
    7,270,008 test2-m2-b32.zpaq, 23 sec, 160 MB
    7,271,981 test2-6.lpaq9m,    68 sec, 198 MB
    7,351,234 test2-m2-b16.zpaq, 13 sec, 160 MB (2 threads)

  11. #11
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    666
    Thanks
    366
    Thanked 287 Times in 149 Posts
    Using the split code from above and ZPAQ for compression, test1 and test4 can be improved. For test1, I removed the static parts and coded the 5 resulting numbers differently (delta for the first three, last one decreased by 1):

    Code:
    Original (lines 1499-1503):
    @SRR062634.228723 HWI-EAS110_103327062:6:1:19776:5633/1
    @SRR062634.228724 HWI-EAS110_103327062:6:1:19776:6695/1
    @SRR062634.229115 HWI-EAS110_103327062:6:2:1427:1029/2
    @SRR062634.229561 HWI-EAS110_103327062:6:2:1547:987/2
    @SRR062634.229974 HWI-EAS110_103327062:6:2:1628:922/2
    
    
    New:
    1 0 0 5633 0
    1 0 0 6695 0
    391 1 -18349 1029 1
    446 0 120 987 1
    413 0 81 922 1
    This helps a lot. It's also possible to write the output binary, as unsigned bytes and signed 16-bit words, so every line becomes 8 bytes (word, byte, word, word, byte), but this improves the result only a little (at least for ZPAQ).

    For test4, I just removed the hash symbols ('#') that fill up short lines. Note that this has to be done with caution, because sometimes there still are characters between hash symbols (e.g. line 41). I also tried delta compression as the ASCII characters are in range '!' to 'H', but this worsens compression.

    Code:
    Original (line 41, lines 143-147):
    ?D=A=D=D@>DBBDDDD?D?DBD:DCDD-D;DBDBDD5D-CB:@5B-;::??@BBBB-=B#####################!##################
    GGDEGGGGGGGFGGFFFBFGBFEBFF?FADEEEFDGFEBGBCCC;EEBED@A?BCAACED=BACC5>>>>,:AB;53>8=-,77=<=1<A?=?#######
    EEAEEAEEEEEEEEEECEEEE=BEEEEEBBEEEEEBCEE?-BBBB@5B:E@@::0*>?;;>==2-==85=?=??94;???@?=@5''83):?60:@@??<
    ####################################################################################################
    FB5AFEAEBE?DEEB-?A:AA-A=-7=;C6=C;?.>C=:B:A:?;4;>>4A?################################################
    GGFGGGGFGGGGGGGGGGGGGGGGGGGGGGBGGGGGG?GGEEGBEFEDAFAEFDDFEDFEDFF?FGGFEBBB-BB@?:;@=?:>=?<B<A83==?AA1A?
    
    New:
    ?D=A=D=D@>DBBDDDD?D?DBD:DCDD-D;DBDBDD5D-CB:@5B-;::??@BBBB-=B#####################!
    GGDEGGGGGGGFGGFFFBFGBFEBFF?FADEEEFDGFEBGBCCC;EEBED@A?BCAACED=BACC5>>>>,:AB;53>8=-,77=<=1<A?=?
    EEAEEAEEEEEEEEEECEEEE=BEEEEEBBEEEEEBCEE?-BBBB@5B:E@@::0*>?;;>==2-==85=?=??94;???@?=@5''83):?60:@@??<
    
    FB5AFEAEBE?DEEB-?A:AA-A=-7=;C6=C;?.>C=:B:A:?;4;>>4A?
    GGFGGGGFGGGGGGGGGGGGGGGGGGGGGGBGGGGGG?GGEEGBEFEDAFAEFDDFEDFEDFF?FGGFEBBB-BB@?:;@=?:>=?<B<A83==?AA1A?
    Results:

    Code:
      17.751.387 test1
       4.338.497 test1_compact
       2.470.760 test1_compact_binary
         845.257 test1.zpq_m4.zpaq
         638.257 test1_compact.zpq_m4.zpaq
         633.721 test1_compact_binary.zpq_m4.zpaq
    
      31.193.446 test4
      29.763.802 test4_no_hashes
       7.819.988 test4.zpq_m4.zpaq
       7.756.029 test4_no_hashes.zpq_m4.zpaq
    Last edited by schnaader; 1st December 2011 at 03:20.
    http://schnaader.info
    Damn kids. They're all alike.

  12. #12
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Ireland
    Posts
    270
    Thanks
    83
    Thanked 122 Times in 62 Posts
    Matt,
    you tried really hard with ppmonstr, for only 0.0087%,
    but with zpaq you get 0.27%!

    Code:
    15,935,560 test?.zpaq        845270, 7270008, 281,  7820001 (-m4, -m2-b32, -m4, -m4)
    15,978,394 test?.pmm        1034631, 7127330,  41,  7815392 (-o6-m25, -o16-m556, -o4-m1, -o8-m150)
    15,979,788 test?.lpaq9m      890784, 7271981, 238,  7816785
    16,052,738 test-m4.zpaq      845270, 7387186, 281,  7820001

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  13. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    I looked at the DNA and phred data (test2 and test4) to see if there were correlations, in spite of non-solid archives always doing better. First, counting byte values:

    Code:
    Count      DNA    Phred
     10     308846   308846
     33 !        0    50037
     35 #        0  1480648
     36 $        0      451
     37 %        0     2138
     38 &        0     4044
     39 '        0     6715
     40 (        0    13140
     41 )        0    17294
     42 *        0    14716
     43 +        0    16470
     44 ,        0    39925
     45 -        0    75172
     46 .        0    12332
     47 /        0    12180
     48 0        0    20836
     49 1        0    16051
     50 2        0    26568
     51 3        0    34084
     52 4        0    44554
     53 5        0   194873
     54 6        0    45241
     55 7        0    52031
     56 8        0    49703
     57 9        0    59774
     58 :        0   346165
     59 ;        0    85035
     60 <        0   103149
     61 =        0   467880
     62 >        0   173698
     63 ?        0   895855
     64 @        0   541613
     65 A  9211089   962739
     66 B        0  1454517
     67 C  6221322  1251956
     68 D        0  1900604
     69 E        0  5344916
     70 F        0  3118306
     71 G  6317993 11949190
     78 N    50037        0
     84 T  9084159        0
    We see that the only characters in test2 are A, C, G, T, N (meaning unknown base) and linefeed. In test4, phred values are always 33-71. Both files are exactly the same size because the line lengths are always the same (100 + linefeed). Looking at pairwise frequencies, we see that N always has a phred score of 33, and A,C,G,T always 35-71. So it is possible to unambiguously pack the base and phred score into 1 byte, which I do later.

    Code:
    Phred      A        C        G        T        N
     33        0        0        0        0    50037
     34        0        0        0        0        0
     35   347834   378888   414837   339089        0
     36      150       78      116      107        0
     37      730      386      394      628        0
     38     1386      755      731     1172        0
     39     2394     1202     1078     2041        0
     40     4654     2597     2141     3748        0
     41     6622     2897     2407     5368        0
     42     5443     3038     2326     3909        0
     43     6194     3157     2665     4454        0
     44    15074     6795     5393    12663        0
     45    28846    14070    14643    17613        0
     46     3827     2427     2491     3587        0
     47     4183     1995     1736     4266        0
     48     7199     3284     2834     7519        0
     49     5298     2682     2479     5592        0
     50     9469     4334     3778     8987        0
     51    11509     5330     5141    12104        0
     52    15118     7199     6947    15290        0
     53    71902    32960    41091    48920        0
     54    13745     7787     8074    15635        0
     55    16677     8448     7843    19063        0
     56    15380     8047     8192    18084        0
     57    18330     9856     9329    22259        0
     58   119684    59690    75281    91510        0
     59    24828    15542    17458    27207        0
     60    28822    16833    16021    41473        0
     61   159531    81925   106839   119585        0
     62    43870    33640    40476    55712        0
     63   282074   167577   202731   243473        0
     64   164264   103586   115089   158674        0
     65   289289   180581   214231   278638        0
     66   454460   271621   323959   404477        0
     67   344922   253329   276735   376970        0
     68   599242   363961   448486   488915        0
     69  1658414  1005609   998889  1682004        0
     70   958743   617623   732599   809341        0
     71  3470982  2541593  2202533  3734082        0
    But first I look at the distribution of bases and average phred values by read location. We see that reliability drops as we progress to the end of the read. There is also some bias near the start, probably due to the chemistry of attaching DNA strands to the sequencing machine. (It helps to understand a little about how the machines work. http://www.illumina.com/technology/s...echnology.ilmn )

    Code:
    Pos        A        C        G        T        N  Phred
      0    95617    64361    40710   108158        0  70.16
      1   108985    51943    72545    75373        0  70.02
      2    96360    57343    69604    85539        0  69.98
      3    92923    64361    64117    87445        0  69.97
      4    94660    60057    60673    93456        0  69.98
      5    94775    62953    63212    87906        0  70.00
      6    92453    60365    64227    91801        0  69.96
      7    95892    60513    61364    91077        0  69.92
      8    94593    60524    60844    92885        0  69.91
      9    94766    60955    61648    91477        0  69.89
     10    94617    60316    60935    92978        0  69.90
     11    94452    60539    61581    92274        0  69.86
     12    95178    59940    60190    93538        0  69.85
     13    94543    60446    60608    93249        0  69.80
     14    94666    59989    60635    93556        0  69.80
     15    93534    60324    61720    93268        0  69.78
     16    93717    60376    61218    93535        0  69.72
     17    93499    60992    61225    93130        0  69.70
     18    93515    61161    61151    93019        0  69.69
     19    92598    61351    61766    93131        0  69.67
     20    93152    61278    61527    92889        0  69.56
     21    93143    61225    62000    92478        0  69.63
     22    92849    61546    61474    92977        0  69.54
     23    92661    61768    61980    92437        0  69.52
     24    92809    61778    62070    92189        0  69.38
     25    92502    61611    62347    92386        0  69.42
     26    92432    62029    62648    91737        0  69.39
     27    92826    62089    62252    91679        0  69.34
     28    92736    61765    62453    91892        0  69.26
     29    92462    61591    62708    92085        0  69.26
     30    91850    62291    62821    91884        0  69.18
     31    92088    62320    62700    91738        0  69.21
     32    93030    61746    62788    91282        0  69.18
     33    92446    61744    62907    91749        0  69.10
     34    92702    62216    62296    91632        0  69.04
     35    91803    62383    62855    91804        1  68.96
     36    92667    61646    62903    91341      289  68.87
     37    92197    62641    62872    91136        0  68.83
     38    92178    62057    62837    91154      620  68.75
     39    91883    62043    62722    91622      576  68.69
     40    92337    61811    63307    91391        0  68.61
     41    91767    62239    63172    91048      620  68.57
     42    91871    62200    63084    91638       53  68.44
     43    91868    61981    62924    91688      385  68.33
     44    91546    62768    63108    91005      419  68.33
     45    91701    62578    63091    90855      621  68.25
     46    91816    62275    63588    91096       71  68.19
     47    91525    62607    63095    91383      236  68.08
     48    91080    62933    63213    91594       26  68.03
     49    91514    62853    63341    91087       51  67.92
     50    91601    62538    63314    90800      593  67.81
     51    91085    62223    63782    90949      807  67.61
     52    91394    62302    63592    90938      620  67.57
     53    91299    62200    63677    91065      605  67.53
     54    91432    62496    63535    90761      622  67.37
     55    91458    62571    63791    90401      625  67.34
     56    91847    62282    63397    90597      723  67.26
     57    92137    62021    63202    91192      294  67.15
     58    91411    62605    63432    90904      494  66.96
     59    91268    63196    63529    90559      294  66.77
     60    91598    62999    63510    90368      371  66.77
     61    91073    62880    63647    90533      713  66.62
     62    91328    62393    63387    91009      729  66.41
     63    91542    62732    63781    90418      373  66.44
     64    91635    62653    63484    90784      290  66.23
     65    91372    63018    64039    90277      140  66.05
     66    91258    63377    63458    90129      624  65.89
     67    90667    62831    63823    90870      655  65.83
     68    91640    62987    63513    90278      428  65.66
     69    90783    62632    63972    90293     1166  65.55
     70    91091    62924    64306    89521     1004  65.22
     71    90956    63068    64238    89778      806  65.10
     72    90584    62586    64216    90931      529  64.82
     73    91272    62968    64399    89490      717  64.96
     74    90544    63070    64163    90620      449  64.77
     75    90839    63188    64298    89885      636  64.73
     76    91013    63121    63479    90301      932  64.45
     77    91346    63165    64210    89829      296  64.34
     78    91477    62714    64529    89936      190  64.12
     79    90623    63122    64859    89941      301  63.84
     80    91321    63211    64084    90099      131  63.53
     81    90609    63063    63896    89803     1475  63.52
     82    91461    62719    64306    88953     1407  63.25
     83    91013    62298    64231    89127     2177  63.19
     84    90259    62896    64105    89455     2131  62.93
     85    90830    63197    64079    89832      908  62.47
     86    90668    63068    64777    89976      357  62.36
     87    89978    63200    64812    89587     1269  62.11
     88    89904    63402    64443    89401     1696  61.76
     89    89654    63119    64467    88903     2703  61.66
     90    89889    63477    64829    89520     1131  61.16
     91    89244    63907    64882    89466     1347  61.04
     92    90096    63887    64353    89136     1374  60.92
     93    90123    63699    65209    88684     1131  60.69
     94    90551    63361    64956    89134      844  60.28
     95    89328    63860    65164    88496     1998  59.80
     96    90295    63725    64476    88534     1816  59.40
     97    90035    63638    64782    88409     1982  59.06
     98    89806    64095    65008    88343     1594  59.11
     99    89668    63817    65516    88273     1572  58.35
    And then I look at 1-4 gram statistics, just ignoring N and linefeeds. These statistics depend on the biology of whatever is being sequenced. I have no idea what species this is, but it probably matters because higher organisms like humans have more non-coding and repetitive DNA than bacteria due to evolutionary pressure, and coding DNA is constrained by the genetic code, where each 3 bases codes one amino acid. Anyhow, we see some anomalies, such as the sequence CG being rare, and repeats of AAAA or its complement TTTT being common. DNA is double stranded, so we would expect that any sequence should have a similar frequency to its complement that you get by exchanging A with T and C with G and reversing the order like AATC -> GATT. I did find this was the case with e.coli from the Canterbury large corpus.

    Code:
    Seq        A        C        G        T
         9211089  6221322  6317993  9084159
      A  3104744  1538719  2125999  2441628
      C  2210457  1584409   343709  2082747
      G  1881862  1254219  1645800  1536111
      T  2014026  1843975  2202485  3023673
     AA  1220560   459466   623681   801038
     AC   618358   347246    83116   489999
     AG   693851   409154   535294   487699
     AT   641668   424398   584625   790937
     CA   588913   448528   592704   580312
     CC   562692   413772    91453   516492
     CG    83627    73970    97902    88210
     CT   394057   497408   586302   604980
     GA   661414   284568   510606   425274
     GC   430993   337940    78192   407094
     GG   505265   349318   443196   348021
     GT   350408   285033   449240   451430
     TA   633857   346157   399008   635004
     TC   598414   485451    90948   669162
     TG   599119   421777   569408   612181
     TT   627893   637136   582318  1176326
    AAA   497743   179754   232139   310925
    AAC   188501   100211    23699   147055
    AAG   217211   111824   146064   148581
    AAT   230809   135179   205722   229328
    ACA   172731   123237   158569   163821
    ACC   124946    88677    16423   117200
    ACG    19890    17284    22760    23182
    ACT   100188   109784   130567   149460
    AGA   247120   112545   179136   155050
    AGC   145019   112443    23261   128431
    AGG   169053   127437   124270   114534
    AGT   113279    90235   140013   144172
    ATA   206681   103064   113483   218440
    ATC   148313   101909    24931   149245
    ATG   163042   103320   155304   162959
    ATT   179427   174863   132189   304458
    CAA   229452    97946   128766   132749
    CAC   171381   110545    28295   138307
    CAG   174610   129192   158202   130700
    CAT   126389   111142   138674   204107
    CCA   135977   118682   156666   151367
    CCC   151839   113532    32830   115571
    CCG    19701    22647    26574    22531
    CCT    86289   134293   154423   141487
    CGA    26513    11282    25339    20493
    CGC    18418    27565     7917    20070
    CGG    25110    21427    34274    17091
    CGT    15469    18715    30483    23543
    CTA   117358    80095    84569   112035
    CTC   161205   139963    26746   169494
    CTG   148478   124874   155378   157572
    CTT   111215   140850   128284   224631
    GAA   241420    92003   150061   177930
    GAC   113764    64302    16469    90033
    GAG   169110    90802   141788   108906
    GAT   100845    78433   113914   132082
    GCA   116176    81269   129658   103890
    GCC   113071    81442    22084   121343
    GCG    16299    17664    23625    20604
    GCT    78527    87584   131151   109832
    GGA   188979    68100   146155   102031
    GGC   117394    85950    29207   116767
    GGG   135686    85690   131922    89898
    GGT    72901    65356   110879    98885
    GTA   108877    55570    83270   102691
    GTC    92650    66827    14228   111328
    GTG   123609    79916   122271   123444
    GTT    89653    89316    98006   174455
    TAA   251945    89763   112715   179434
    TAC   144712    72188    14653   114604
    TAG   132920    77336    89240    99512
    TAT   183625    99644   126315   225420
    TCA   164029   125340   147811   161234
    TCC   172836   130121    20116   162378
    TCG    27737    16375    24943    21893
    TCT   129053   165747   170161   204201
    TGA   198802    92641   159976   147700
    TGC   150162   111982    17807   141826
    TGG   175416   114764   152730   126498
    TGT   148759   110727   167865   184830
    TTA   200941   107428   117686   201838
    TTC   196246   176752    25043   239095
    TTG   163990   113667   136455   168206
    TTT   247598   232107   223839   472782
    So I merged test2 and test4 into test5 by encoding A,C,G,T -> 0,1,2,3 in the high 2 bits and the phred score in the low 6 bits. N and linefeed are coded as 0 in the high bits. This can be decoded unambiguously because N always has phred=33 and the others never do, and because linefeeds always coincide. Here are the results:

    Code:
    Test5: merge test2 and test4: "ACGTN\n" -> {0,1,2,3,0,0}*64 + phred.
    
    test 1+3+5  test 1..4   test5=2+4
    15,904,313  16,052,738  15,058,762 test5-m4.zpaq
    16,093,434  16,015,074  15,117,928 test5-o16-m454.pmm
    16,027,218  15,961,885  15,137,673 test5-8.lpaq9m
    16,034,507  16,133,261  15,150,528 test5-nm-cc.nz
    16,054,775  15,979,788  15,163,753 test5-6.lpaq9m
    16,211,459  16,583,898  15,242,376 test5-m3.zpaq
    17,287,977  17,141,095  15,362,860 test5.bsc
    17,382,080  17,214,532  15,398,738 test5-m2-b32.zpaq
    17,427,450  17,285,304  15,476,066 test5-m2.zpaq
    18,227,135  18,909,367  15,935,216 test5-m1-b32.zpaq
    18,267,532  18,980,351  16,006,457 test5-m1.zpaq
    18,544,500  18,356,538  17,091,254 test5.7z
    18,396,169  16,205,798  17,449,297 test5-nm.nz
    21,166,767  21,546,591  18,866,725 test5-9.gz
    21,590,328  22,470,515  19,097,197 test5.gz
    49,562,525  80,755,971  31,193,446 test5
    The first column is the total of the 3 files test1,3,5 compressed separately and added together. The second column is for test1,2,3,4 mostly from the previous data. The third is test5 by itself. For ppmonstr I used -o16 for all files because it gave the best compression on test5. I also added lpaq9m -8, which scores best on the 4 separate files.

    As you can see, the results are mixed. In some cases, merging the DNA and phred scores helps and sometimes it hurts. It helps with zpaq -m1, -m3, -m4, nanozip -cc, and gzip. It makes compression worse for zpaq -m2, nanozip (default BWT), lpaq9m, ppmonstr, bsc, and 7zip.

    I can see 2 ways it might help. One is that it makes the input smaller, which would help with small memory compressors like gzip. The other is that it makes the weak correlations between bases and phred scores visible. The problem is that it hides the stronger correlations among bases (independent of phred scores) and among phred scores (independent of bases).

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,887
    Thanks
    428
    Thanked 1,965 Times in 1,137 Posts
    How about simply deleting Ns and LFs from the ACTG file based on phred values?

  15. #15
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Ireland
    Posts
    270
    Thanks
    83
    Thanked 122 Times in 62 Posts
    Quote Originally Posted by Matt Mahoney
    you need an Amazon EC2 account to access the test data. http://aws.amazon.com/ec2/pricing/
    Hm... If someone having EC2 grabs the 6G file and puts it online, makes it available to everyone, will it be legal or not?

    Quote Originally Posted by Matt Mahoney View Post
    > Also I guess lpaq was N times faster than zpaq -m4 and NN times faster than paq8px.
    Well, yes, but zpaq is constrained by archive compatibility between versions. And it's not that bad actually :D Tests on a 2.0 GHz T3200, Win32:
    Well, I meant the top three lines in your first table, with zpaq -m4 in the third line, here they are with timing and with two more lines, lpaq9m 9 and paq8px_v69 -8:
    Code:
    http://s3.amazonaws.com/sequencesqueeze-dev.pistoiaalliance/SRR062634.filt.fastq
    
    15,035,056 test0-8.paq8px  // 5611 seconds
    15,165,514 test0.paq8px    // with -5 ?  version px_v69 ?
    16,046,060 test0-9.lpaq9m  //  89.3 seconds
    16,115,901 test0-6.lpaq9m  //  83.2 seconds
    16,235,639 test0-m4.zpaq   // 230 seconds
    lpaq, zpaq: time is the minimal of three runs, and nul as archive name when measuring time.
    paq8px_v69: only one run, and archive name is not nul.
    Tested on my desktop, Intel Core i7 930, 2.8 GHz. zpaq -m4 is much faster than I expected.

    Quote Originally Posted by Matt Mahoney
    The first column is the total of the 3 files test1,3,5 compressed separately and added together.
    Can you please share either tests 1,3,5 or the code that builds them?

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  16. #16
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,610
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    Hm... If someone having EC2 grabs the 6G file and puts it online, makes it available to everyone, will it be legal or not?
    Depends. Under TRIPS, databases are copyrighted, so in countries that signed it (most did) there is a significant barrier. Still it might be fair use or something.

    Though the copyrighted part is collection, not the data itself. One could select a subset of it, either rearrange it (necessary unless it's arranged in some obvious way) and there will be no copyright held by the original creators.
    Also, if both selection and arrangement are obvious already, there might be no copyright. So overall it's a tough question.


    Asking for permission could work too. ;)
    Last edited by m^2; 1st December 2011 at 07:42.

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    I made test5 like this, where f2, f4, f5 are test2, test4, test5. The upper 2 bits aren't really A,C,G,T because the phred value (c4) might be more than 63. But it is still unambiguous. If I was going to use this for an actual transform I would offset the phred values to 1..40 and use 0 for LF or something like that.

    Code:
      int c2, c4;
      const char dna[]="ACGTN";
      while ((c2=getc(f2))!=EOF && (c4=getc(f4))!=EOF) {
        const char* p=strchr(dna, c2);
        if (p && p-dna<4) {
          assert(c4>=35 && c4<=71);
          putc((p-dna)*64+c4, f5);
        }
        else {
          assert((c2==10 && c4==10) || (c2=='N' && c4==33));
          putc(c4, f5);
        }
      }
    Last edited by Matt Mahoney; 1st December 2011 at 16:16.

  18. #18
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    874
    Thanks
    402
    Thanked 378 Times in 237 Posts
    Quote Originally Posted by Matt Mahoney View Post
    The leaderboard at http://www.sequencesqueeze.org/ shows James Bonfield, who I recognized as the author of the SRF format for DNA sequence data that I wrote a compressor for at Ocarina. So I searched sourceforge.net for "fastq" and found 2 compressors out of 11 programs, KungFq by Elena Grassi (last updated Aug. 3 2011) and fqzcomp by James Bonfield (last updated Nov. 11, 2011).

    KungFq is a Windows program that is supposed to do some input transform and compress with either zlib or lzma. However, it crashed on this file (Java bad number format exception). fqzcomp also does some transform, I guess create separate streams, and compresses with zlib. It came with source only, apparently for Linux, and must be linked with zlib. So I installed zlib (zlib.net, ./configure, make, sudo make install), changed icc to g++ in Makefile and compiled. There are 2 programs that read stdin to stdout: the compressor, fastq2fqz, and the decompresser, fqz2fast2. The result is 18,825,692, which is beaten by about half of the programs above.
    OK it's time to stop being a lurker and register on the forums I guess! Well remembered about SRF too Matt. (A format that was *reasonably* well compressed in places and completely scuppered with uncompressed data elsewhere - a work in progress which became obsolete!)

    Anyway the fqz coder is pretty noddy. It was something I'd already written a year ago, with minimal interest from the community at the time. I had to tweak it slightly for this as their fastq format differs in the additional of extra columns in the name, but other than that it's pretty much my old code - hence being able to submit a new baseline (it won't last for long) very quickly.

    It does as you expected with separating by data type and some basic transformations. Specifically it does string delta on the names. So @SRR1234.431 followed by @SRR1234.488 becomes <10> bytes the same as before plus "88". Bizarrely this is so much more efficient than LZ encoding, maybe due to knowledge of the newline placement meaning no distance is required. Better still would be to tokenise into alpha and numeric fields and compress each independently via tokens for string delta or "same as before", and numeric deltas as required. The DSRC fastq compressor does this.

    The sequence data gets packed into codons - that is triplets. I simply treat each triplet as a 3 digit base 5 number (ACGTN). ie map[seq[i]]*25+map[seq[i+1]]*5+map[seq[i+2]]. There's some hand wavy gut feeling that triplets make sense biologically because it's the unit of conversion from DNA to Protein via tRNAs, but realistically that's nonsense - the benefit comes from alphabet size. Huffman encoding is just rubbish with small alphabets, so going from 5 to 125 massive reduces the huffman overhead. It has the side effect of making the sequence data 1/3rd the size and dramatically speeding up zlib en/decoding too.

    The quality data is hard to compress with zlib, so I left it as is. I've had good luck compressing the data using a low order range coder using previous symbols and distance since the last newline as context. Shelwien, what license is your example code in coders6c2 under? The range code can be trivially swapped out for the ppmd one, which looks almost identical anyway, and the model in coders6c2 looks to be ppmd derived too (from the same author), but there is no explicit statement of public domain nature of BSD licencing in your coders6c2 so I'm reticent to use it in any public code I release without first checking.

    Anyway, back to fqzcomp - it's actually significantly faster than the sequencesqueeze chart implies, too fast infact. Their baseline looks to just be tar+gzip and I know I beat it even on uncompression by a good margin. However the EC2 filesystem is likely pretty slow and being the dominant factor for light-weight compression tools.

    I think there's a good compromise for simple modelling and coding that will give high, but not best, compression without suffering PAQ level speed. IIRC I got to 16.3Mb or so and still only taking up a few seconds.

    James

    PS. In answer to a few other questions. I'd assume the full data set is available from NCBI's SRA (http://trace.ncbi.nlm.nih.gov/Traces...ew=run_browser) but I haven't looked. I don't know what filtering they did.

    As for time and memory figures - I've no idea what units they're in. My guess is 100th of seconds and Kb, but it's really not very clear. They don't publish the size of their test set either.

  19. #19
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    874
    Thanks
    402
    Thanked 378 Times in 237 Posts
    Quote Originally Posted by Shelwien View Post
    How about simply deleting Ns and LFs from the ACTG file based on phred values?
    Some bioinformatics tools do this already (eg the MAQ aligner): 2 bits of base type (ACGT) and 6 bits of quality. 0 quality => N as DNA base. Technically it can be lossy as the Illumina sequencing instrument software that produces this data sometimes has a habbit of assigning different quality values (the "phred" scores) for base types even when the base is N (aNy, or unknown) which is plain nonsense. Such lossy nature is probably a feature rather than a problem.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,887
    Thanks
    428
    Thanked 1,965 Times in 1,137 Posts
    1. All aridemo sources including codersXXX (there're 6a,6b,6c,6c2) updates are public domain.
    But the bitwise coders would be likely better for this case, because they allow for more parameter optimizations.

    2. How about testing m1? https://sites.google.com/site/toffer86/m1-project
    Especially after optimizing some profiles for data types with it?

    3. The coder can assume that phred score for N is fixed, but still encode a match flag for it.
    When a flag bit is always the same and a good rangecoder is used, there's nearly no redundancy,
    so results would be valid even if the initial implementation is lossy and doesn't encode any such flags.

  21. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    The bitwise arithmetic encoder (and everything else) in libzpaq is public domain. Acutally it's the MIT license with the restriction that you had to include the copyright notice removed. That was the only restriction, so really the same thing.

    Anyway I was looking at the DNA bases (test2) and wondering if it would be worthwhile to do a transform where some of the reads are complemented like AAAC -> GTTT to increase redundancy in case both sides of the sequence make it into the machine. I'm not sure if the test file is a snip of a much larger file, in which case the technique wouldn't work. The same read needs to appear several times if you want to reassemble the original sequence.

    Following is a list of all sequences not containing N up to length 31 that appear at least 100 times. A sequence is shorter than 31 only if bounded at both ends either by N or by the start or end of the read. Thus, a read of length 100 with no N would account for 70 sequences.

    Code:
    1535 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    1180 ATCGGAAGAGCGTCGTGTAGGGAAAGAGTG
    1105 ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
    1059 CGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
    1016 CGGAAGAGCGTCGTGTAGGGAAAGAGTGTAG
     969 GGAAGAGCGTCGTGTAGGGAAAGAGTGTAGA
     930 GAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT
     893 AAGAGCGTCGTGTAGGGAAAGAGTGTAGATC
     846 AGCGTCGTGTAGGGAAAGAGTGTAGATCT
     813 AGCGTCGTGTAGGGAAAGAGTGTAGATCTC
     772 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
     768 AGCGTCGTGTAGGGAAAGAGTGTAGATCTCG
     727 CGTCGTGTAGGGAAAGAGTGTAGATCTCGG
     664 CGTCGTGTAGGGAAAGAGTGTAGATCTCGGT
     659 TCCATTCCATTCCATTCCATTCCATTCCATT
     644 GGAATGGAATGGAATGGAATGGAATGGAA
     641 GTCGTGTAGGGAAAGAGTGTAGATCTCGGTG
     628 TCCATTCCATTCCATTCCATTCCATTCCAT
     627 CCATTCCATTCCATTCCATTCCATTCCATTC
     624 GGAATGGAATGGAATGGAATGGAATGGAAT
     618 CATTCCATTCCATTCCATTCCATTCCATTCC
     617 TCGTGTAGGGAAAGAGTGTAGATCTCGGTGG
     599 AATGGAATGGAATGGAATGGAATGGAATGGA
     585 GAATGGAATGGAATGGAATGGAATGGAATGG
     565 CGTGTAGGGAAAGAGTGTAGATCTCGGTGGT
     556 ATTCCATTCCATTCCATTCCATTCCATTCCA
     538 GGAATGGAATGGAATGGAATGGAATGGAATG
     522 GTGTAGGGAAAGAGTGTAGATCTCGGTGGTC
     493 TGTAGGGAAAGAGTGTAGATCTCGGTGGTCG
     469 GTAGGGAAAGAGTGTAGATCTCGGTGGTCGC
     447 ACACACACACACACACACACACACACACAC
     439 TAGGGAAAGAGTGTAGATCTCGGTGGTCGCC
     423 ACACACACACACACACACACACACACACACA
     418 AGGGAAAGAGTGTAGATCTCGGTGGTCGCCG
     415 ATCGGAAGAGCGTCGTGTAGGGAAAGAGT
     366 GGAAAGAGTGTAGATCTCGGTGGTCGCCGT
     352 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
     338 GGAAAGAGTGTAGATCTCGGTGGTCGCCGTA
     334 TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
     330 GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
     319 GAAAGAGTGTAGATCTCGGTGGTCGCCGTAT
     299 AAAAAAAAAAAAAAAA
     297 AAAGAGTGTAGATCTCGGTGGTCGCCGTATC
     272 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
     271 AGTGTAGATCTCGGTGGTCGCCGTATCA
     265 AAAAAAAAAAAAAAAAA
     263 GGCTCACGCCTGTAATCCCAGCACTTTGGGA
     258 TGGCTCACGCCTGTAATCCCAGCACTTTGGG
     257 AGTGTAGATCTCGGTGGTCGCCGTATCAT
     256 CTCACGCCTGTAATCCCAGCACTTTGGGAGG
     255 GCTCACGCCTGTAATCCCAGCACTTTGGGAG
     250 TCACGCCTGTAATCCCAGCACTTTGGGAGGC
     244 AGTGTAGATCTCGGTGGTCGCCGTATCATT
     232 AAAAAAAAAAAAAAAAAA
     227 CCAGCTACTCGGGAGGCTGAGGCAGGAGAAT
     226 CCCAGCTACTCGGGAGGCTGAGGCAGGAGAA
     219 AGTGTAGATCTCGGTGGTCGCCGTATCATTA
     207 TGTAGATCTCGGTGGTCGCCGTATCATTAA
     206 CCCAAAGTGCTGGGATTACAGGCGTGAGCCA
     204 TCCCAAAGTGCTGGGATTACAGGCGTGAGCC
     201 AAAAAAAAAAAAAAAAAAA
     200 CTCCCAAAGTGCTGGGATTACAGGCGTGAGC
     200 CCAAAGTGCTGGGATTACAGGCGTGAGCCAC
     199 CCCAGCTACTCAGGAGGCTGAGGCAGGAGAA
     199 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT
     195 CCTCCCAAAGTGCTGGGATTACAGGCGTGAG
     195 AAAGAAAGAAAGAAAGAAAGAAAGAAAGAAA
     194 CCAGCTACTCAGGAGGCTGAGGCAGGAGAAT
     194 AAAGAAAGAAAGAAAGAAAGAAAGAAAGAA
     191 AAAGAAAGAAAGAAAGAAAGAAAGAAAGA
     190 GTGGCTCACGCCTGTAATCCCAGCACTTTGG
     190 AAAGAAAGAAAGAAAGAAAGAAAGAAAG
     186 CACGCCTGTAATCCCAGCACTTTGGGAGGCC
     185 TGTAGATCTCGGTGGTCGCCGTATCATTAAA
     183 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     180 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
     171 CCCAGCTACTTGGGAGGCTGAGGCAGGAGAA
     169 AGAGAGAGAGAGAGAGAGAGAGAGAGAGAG
     167 AGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
     166 CCAGCTACTTGGGAGGCTGAGGCAGGAGAAT
     166 AAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
     165 TGGCTCATGCCTGTAATCCCAGCACTTTGGG
     165 GGCTCATGCCTGTAATCCCAGCACTTTGGGA
     163 AAAAAAAAAAAAAAAAAAAA
     162 CCTGTAATCCCAGCACTTTGGGAGGCCGAGG
     158 TCATGCCTGTAATCCCAGCACTTTGGGAGGC
     158 GCTCATGCCTGTAATCCCAGCACTTTGGGAG
     157 GTAGATCTCGGTGGTCGCCGTATCATTAAAA
     156 TTTTGTATTTTTAGTAGAGACGGGGTTTCAC
     156 CTCATGCCTGTAATCCCAGCACTTTGGGAGG
     152 CTCCCAAAGTGCTGGGATTACAGGCATGAGC
     151 TCCCAAAGTGCTGGGATTACAGGCATGAGCC
     149 CCTCCCAAAGTGCTGGGATTACAGGCATGAG
     149 CCCAAAGTGCTGGGATTACAGGCATGAGCCA
     149 CAAAGTGCTGGGATTACAGGCGTGAGCCACC
     146 TTTGTATTTTTAGTAGAGACGGGGTTTCACC
     146 CTGTAATCCCAGCACTTTGGGAGGCCGAGGC
     145 CTGTAATCCCAGCTACTCAGGAGGCTGAGGC
     145 CCAAAGTGCTGGGATTACAGGCATGAGCCAC
     144 CTCGGCCTCCCAAAGTGCTGGGATTACAGGC
     143 GCCTCCCAAAGTGCTGGGATTACAGGCGTGA
     142 TGTAATCCCAGCTACTCAGGAGGCTGAGGCA
     142 GGCTCACACCTGTAATCCCAGCACTTTGGGA
     141 GTAATCCCAGCTACTCAGGAGGCTGAGGCAG
     141 AAAAAAAAAAAAAAAAAAAAA
     139 CTCACACCTGTAATCCCAGCACTTTGGGAGG
     138 TGGCTCACACCTGTAATCCCAGCACTTTGGG
     138 GCTCACACCTGTAATCCCAGCACTTTGGGAG
     137 TAGATCTCGGTGGTCGCCGTATCATTAAAAA
     136 TGAAACCCCGTCTCTACTAAAAATACAAAAA
     136 TCACACCTGTAATCCCAGCACTTTGGGAGGC
     130 GTGAAACCCCGTCTCTACTAAAAATACAAAA
     130 CTCAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     130 CCCAGCTACTCAGGAGGCTGAGGCAGGAG
     130 CAGCTACTCAGGAGGCTGAGGCAGGAGAATC
     129 TAATCCCAGCTACTCAGGAGGCTGAGGCAGG
     129 AATCCCAGCTACTCAGGAGGCTGAGGCAGGA
     128 TCAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     128 CCCAGCTACTCAGGAGGCTGAGGCAGGAGA
     127 GTAGTCCCAGCTACTCGGGAGGCTGAGGCAG
     127 CTGTAGTCCCAGCTACTCGGGAGGCTGAGGC
     125 TGTAGTCCCAGCTACTCGGGAGGCTGAGGCA
     125 AAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
     123 TGTAATCCCAGCTACTTGGGAGGCTGAGGCA
     122 GCCTGTAATCCCAGCACTTTGGGAGGCCGA
     120 CTGTAATCCCAGCTACTTGGGAGGCTGAGGC
     119 GTTAGCCAGGATGGTCTCGATCTCCTGACCT
     119 GCCTGTAATCCCAGCACTTTGGGAGGCCGAG
     118 CCTCGGCCTCCCAAAGTGCTGGGATTACAGG
     118 AGTCCCAGCTACTCGGGAGGCTGAGGCAGGA
     117 TCCCAGCTACTCGGGAGGCTGAGGCAGGAGA
     117 TCCCAGCTACTCGGGAGGCTGAGGCAGGAG
     117 TAGTCCCAGCTACTCGGGAGGCTGAGGCAGG
     117 TAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
     117 TAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
     117 AAAAAAAAAAAAAAAAAAAAAA
     116 TTTTGTATTTTTAGTAGAGATGGGGTTTCAC
     116 TGAAACCCCATCTCTACTAAAAATACAAAAA
     116 GTAATCCCAGCTACTTGGGAGGCTGAGGCAG
     116 CTGTAATCCCAGCACTTTGGGAGGCCAAGGC
     116 CCTGTAATCCCAGCACTTTGGGAGGCCAAGG
     116 CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     114 GGTGGCTCACGCCTGTAATCCCAGCACTTTG
     114 ACGCCTGTAATCCCAGCACTTTGGGAGGCCG
     113 CCTCCCAAAGTGCTGGGATTACAGGTGTGAG
     113 AGATCGGAAGAGCGTCGTGTAGGGAAAGAG
     113 AATCCCAGCTACTCGGGAGGCTGAGGCAGGA
     112 CTGTAATCCCAGCTACTCGGGAGGCTGAGGC
     112 CTCAGCCTCCCAAAGTGCTGGGATTACAGGC
     112 AGATCTCGGTGGTCGCCGTATCATTAAAAAA
     111 TTTCTTTCTTTCTTTCTTTCTTTCTTTCTTT
     111 CTCCCAAAGTGCTGGGATTACAGGTGTGAGC
     111 CATGCCTGTAATCCCAGCACTTTGGGAGGCC
     110 TTTGTATTTTTAGTAGAGATGGGGTTTCACC
     110 GTGGCTCATGCCTGTAATCCCAGCACTTTGG
     110 GTGAAACCCCATCTCTACTAAAAATACAAAA
     110 ATATATATATATATATATATATATATATATA
     109 TGTAATCCCAGCTACTCGGGAGGCTGAGGCA
     109 CTTTCTTTCTTTCTTTCTTTCTTTCTTTCTT
     109 ATATATATATATATATATATATATATATAT
     108 TTCTTTCTTTCTTTCTTTCTTTCTTTCTTTC
     108 TCCCAAAGTGCTGGGATTACAGGTGTGAGCC
     108 TAATCCCAGCTACTTGGGAGGCTGAGGCAGG
     108 CCTGTAATCCCAGCACTTTGGGAGGCTGAGG
     108 CCCAGCTACTCGGGAGGCTGAGGCAGGAGA
     107 TCTCCTGCCTCAGCCTCCCGAGTAGCTGGG
     107 GTAATCCCAGCTACTCGGGAGGCTGAGGCAG
     107 CCCAGCTACTCGGGAGGCTGAGGCAGGAG
     107 AATGGAATGGAATGGAATGCAATGGAATGGA
     106 TCTTTCTTTCTTTCTTTCTTTCTTTCTTTCT
     106 GAGATCGGAAGAGCGTCGTGTAGGGAAAGAG
     105 CTGTAATCCCAGCACTTTGGGAGGCTGAGGC
     105 CCCAGCTACTTGGGAGGCTGAGGCAGGAGA
     105 CCCAAAGTGCTGGGATTACAGGTGTGAGCCA
     105 CAGCTACTCGGGAGGCTGAGGCAGGAGAATC
     105 ATTCCATTCCATTCCATTCCATTCCATTCCA
     104 GGAATGGAATGGAATGCAATGGAATGGAA
     104 GCCTCCCAAAGTGCTGGGATTACAGGCATGA
     103 TAATCCCAGCTACTCGGGAGGCTGAGGCAGG
     103 GAAGAAATCCCGTTTCCAACGAAGGCCTCAA
     102 CCTCAGCCTCCCAAAGTGCTGGGATTACAGG
     102 CCCAGCTACTTGGGAGGCTGAGGCAGGAG
     102 AATCCCAGCTACTTGGGAGGCTGAGGCAGGA
     101 TCTCCTGCCTCAGCCTCCCGAGTAGCTGGGA
     101 GATTCCATTCCATTCCATTCCATTCCATTCC
     101 CCCATCTCTACTAAAAATACAAAAATTA
     101 AAAAAAAAAAAAAAAAAAAAAAA
     100 TCACAGAGTTTAACCTTTCTTTTCATAGA
     100 CAAAGTGCTGGGATTACAGGCATGAGCCACC
     100 ATCTCGGTGGTCGCCGTATCATTAAAAAAA
    Following is the number of sequences not containing N that appear between 1 and 99 times. The first column is for sequences of length 1 to 30. The second column is length 31. For example, there are 26 sequences of length 31 that appear 99 times.

    Code:
    Count  1..30   31
     1 13903368 11898146
     2   170686   245876
     3    39081    66966
     4    20465    32782
     5    13318    19941
     6     9173    13402
     7     6900     9899
     8     5284     7140
     9     3895     5513
    10     3353     4324
    11     2501     3635
    12     2044     2992
    13     1813     2555
    14     1401     2006
    15     1126     1727
    16     1080     1526
    17      906     1301
    18      695     1199
    19      611     1043
    20      565      961
    21      515      827
    22      459      765
    23      429      650
    24      318      647
    25      270      605
    26      229      580
    27      251      477
    28      234      449
    29      229      443
    30      220      394
    31      178      388
    32      149      355
    33      152      314
    34      147      304
    35      141      283
    36      128      294
    37      128      252
    38       79      254
    39      101      231
    40       91      225
    41       81      192
    42       66      187
    43       50      175
    44       51      161
    45       64      142
    46       56      147
    47       51      119
    48       47      134
    49       35      101
    50       47      110
    51       33       97
    52       29       99
    53       25      100
    54       29       86
    55       25       82
    56       25       87
    57       31       73
    58       23       71
    59       30       82
    60       28       60
    61       21       74
    62       14       71
    63        9       65
    64       24       62
    65       16       55
    66       18       62
    67       20       63
    68        6       47
    69       13       50
    70       12       50
    71        5       59
    72       15       48
    73        7       46
    74        8       36
    75       12       53
    76        9       47
    77       13       48
    78       10       38
    79       10       47
    80       10       48
    81        5       45
    82        3       50
    83        5       33
    84        3       35
    85        6       44
    86        7       28
    87        8       35
    88        6       42
    89        5       30
    90       10       42
    91        6       39
    92        6       28
    93        3       36
    94        4       30
    95        7       32
    96        3       32
    97        6       31
    98        2       24
    99        3       26

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    I tried an experiment where I complemented the DNA reads if it improved the number of partial matches to other reads elsewhere in the file. I found plenty of cases of such complementary matches over long strings. Unfortunately, the results were poor. Complementing a read saved on average 0.5 bits each with zpaq -m2, ppmonstr, and bsc (and worse with lpaq9m), but it takes at least 1 bit to encode the direction of the read so it can be decompressed. The result is the files are larger.

    The first set below shows the original data. In the second I reversed reads where there were more matches going the other way, but did not encode the direction, only so the effect could be separated. In the third set I encoded a reversed read by changing the trailing newline to a space. (I tried other methods with similar effects).

    Code:
    31,193,446 test2 (308846 DNA reads of length 100, newline terminated)
     7,256,807 test2-8.lpaq9m
     7,270,003 test2-m2-b32.zpaq
     7,128,330 test2-o16-m556.pmm
     7,313,659 test2.bsc
    60,162,245 bytes
    
    31,193,446 test6 lossy, original=261019 backward=47326
     7,283,664 test6-8.lpaq9m
     7,268,396 test6-m2-b32.zpaq
     7,125,006 test6-o16-m556.pmm
     7,311,914 test6.bsc
    60,182,426 bytes
    
    31,193,446 test6 read direction encoded by changing '\n' to ' '
     7,288,219 test6-8.lpaq9m
     7,293,044 test6-m2-b32.zpaq
     7,173,666 test6-o16-m556.pmm
     7,340,591 test6.bsc
    60,288,966 bytes
    I am including code below so you can see exactly what I did, although the code is useless. I reversed a read by exchanging A with T and C with G and reversing the direction. N is unchanged. I made 10 passes over the data, looking at every 3, 5, 7..., 21 reads modulo the data size and comparing to previous reads. (This works because all of these step sizes are relatively prime to the number of reads, which has prime factors 2*154423). I found that more than 6-8 passes had no effect.

    To make the search and comparison reasonably fast, I used a Rabin style rolling hash parsing (like used for deduplication) to segment the DNA strings on position independent but content-dependent boundaries. The hashes index a bitmap with no collision detection. If a bit is set, then it means that the same hash was seen before. I compute the hash in both directions, counting the number of segments and number of matches each way. If the fraction of matches in the reverse direction is greater than the forward direction, and the number of reverse matches is greater by at least 2, then I complement the string. I experimented with different threshold for reversal, but consistently failed to see any improvement. The more strings I complemented, the worse it got.

    The hash function is h = (h+c)*10, where c is mapped "ACNGT" -> 0..4 (or 4..0 when testing in the other direction). When the high 4 bits (of 32) are 1111, then I mark a boundary and take the low 28 bits as the hash index into a bit vector of size 2^28, and set h=0. The hash is such that the low 28 bits depend on the last 28 bases, that the minimum length is 10, and that a boundary occurs with probability 1/16 after that. Thus, segments have an average length of about 18. This is long enough to make chance collisions unlikely. I threw away the first hash, which is of a partial segment, and compared the rest so that they would have content-dependent boundaries on both ends.

    I found experimentally that about 1/3 of the reads were better matched in the forward direction, 1/3 reversed, and 1/3 made no difference. Other hash functions gave fewer reversed matches. I tried changing the multiplier from 10 to 14, which reduced the minimum length, or testing the high 3 bits instead of 4, which gave a shorter average length. Note that if the file were much larger, I would need a 64 bit hash and a larger bit vector.

    To mark a complemented string I XOR the trailing newline with ' '^'\n' which changes it to a space, or changes it back if it gets un-reversed in a later pass. For the second test above, I commented out this line.

    Code:
    // t2.cpp - transform test2 -> test6
    // Replace reads with their complements if it improves compression.
    // Encode reversed reads by changing the trailing '\n' to ' '.
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <string.h>
    #include <vector>
    #include <assert.h>
    using namespace std;
    
    int main() {
      FILE* in=fopen("test2", "rb");
      if (!in) return perror("test2"), 1;
      vector<char> v;  // read input into v
      for (int c; (c=getc(in))!=EOF;) v.push_back(c);
      fclose(in);
      assert(v.size()==31193446);  // prime factors are 101*2*154423
      const char dna[]="ACNGT";  // translation table
      for (int pass=0; pass<10; ++pass) {
        int revcount[4]={0};  // output stats
        vector<bool> s(1<<28);  // hash index
        for (int ii=0; ii<int(v.size()); ii+=101) {
          int i=ii*(pass*2+3)%v.size();
          assert(v[i+100]==10 || v[i+100]==' ');
          int fh=0, fm=0, rh=0, rm=0;  // forward/reverse hashes/matches
          uint32_t h=0;
          for (int j=99; j>=0; --j) {  // hash in reverse -> rm/rh
            const char* p=strchr(dna, v[i+j]);
            assert(p);
            h=(h+4-(p-dna))*10;
            if ((h>>28)==15) {
              h&=(1<<28)-1;
              if (++rh>1) {
                rm+=s[h];
              }
              h=0;
            }
          }
          h=0;
          for (int j=0; j<100; ++j) {  // hash forward -> fm/fh
            const char* p=strchr(dna, v[i+j]);
            assert(p);
            h=(h+(p-dna))*10;
            if ((h>>28)==15) {
              h&=(1<<28)-1;
              if (++fh>1) {
                fm+=s[h];
                s[h]=true;
              }
              h=0;
            }
          }
    
          // Replace read with complement and mark by changing \n to space
          int rev=rm*fh-fm*rh;
          if (rev>0 && rm>fm+1) {
            for (int j=0, k=99; j<=k; j++, k--) {
              char t1=v[i+j], t2=v[i+k];
              const char* p1=strchr(dna, t1);
              assert(p1);
              const char* p2=strchr(dna, t2);
              assert(p2);
              v[i+k]=dna[4-(p1-dna)];
              v[i+j]=dna[4-(p2-dna)];
            }
            v[i+100]^=' '^'\n';
            ++revcount[2+(v[i+100]=='\n')];
          }
          else
            ++revcount[(v[i+100]==' ')];
    
          // save hashes
          fh=fm=0;
          for (int j=0; j<100; ++j) {  // hash forward
            const char* p=strchr(dna, v[i+j]);
            assert(p);
            h=(h+(p-dna))*10;
            if ((h>>28)==15) {
              h&=(1<<28)-1;
              if (++fh>1) {
                fm+=s[h];
                s[h]=true;
              }
              h=0;
            }
          }
        }
        printf("original=%d backward=%d reversed=%d undone=%d\n",
                revcount[0], revcount[1], revcount[2], revcount[3]);
      }
    
      // write test6
      FILE* out=fopen("test6", "wb");
      if (!out) perror("test6");
      for (int i=0; i<int(v.size()); ++i)
        putc(v[i], out);
      fclose(out);
      printf("created test6\n");
      return 0;
    }

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,887
    Thanks
    428
    Thanked 1,965 Times in 1,137 Posts
    Afaik the best idea for such things is to create artifical additional streams
    (complementary, with reversed blocks for DNA palindromes, etc) and train some submodels on these.

    Btw, is it possible with zpaql?

    Also, there's that thing which Osman tried to use...
    http://nishi.dreamhosters.com/chanlo...2014%3A28%3A19

  24. #24
    Member
    Join Date
    Dec 2011
    Location
    Germany / Hessen
    Posts
    18
    Thanks
    0
    Thanked 1 Time in 1 Post

    Hello, im new here :)

    What a pity! All you pro's working on it ^^
    I already hoped for easy money, as i saw the first two entries :D


    Did you already get the 6 Gb File?

    6 GB Sample Data: http://s3.amazonaws.com/sequencesque...4_1.filt.fastq

    77MB Sample Data: http://s3.amazonaws.com/sequencesque...634.filt.fastq

    Just open the Links :)


    But as i understood, even the 6 GB File isn't the File used to generate the final results.
    (well otherwise they probably would ne some restrictions for the size of the programm itslef :D )


    Interesting competition i think.

    But i dislike the conditions / winning conditions.

    They should provide a factor how they rate speed / size / memory...
    As you already mentioned ( and i also testet), for example paq outperforms the current leader by far (compression ratio)....but is pretty lame ;) But unfortunately you don't know if that is what they want...

    And this:
    First place in the leaderboard in any single category does not necessarily mean overall winner.
    well......not very transparent...they have some potential for optimization ^^

    But still a very interesting competition i think.
    And a nice price ^^



    You already had some nice ideas for specific Fastq optimization, keep going :)


    I wish you much success.

  25. #25
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    666
    Thanks
    366
    Thanked 287 Times in 149 Posts
    Quote Originally Posted by Shelwien View Post
    How about simply deleting Ns and LFs from the ACTG file based on phred values?
    Tested this and it seems this works better than a merge. Results:

    Code:
        31.193.446 test2
        31.143.409 test2_no_n
         7.708.641 test2_no_n_2bit
    
         7.387.173 test2.zpq_m4.zpaq
         7.371.207 test2_no_n.zpq_m4.zpaq
    
         6.916.848 test2_no_n_2bit.zpq_m4.zpaq
      +  7.819.988 test4.zpq_m4.zpaq
      = 14.736.836
    
         6.916.848 test2_no_n_2bit.zpq_m4.zpaq
      +  7.756.029 test4_no_hashes.zpq_m4.zpaq
      = 14.672.877
    'no_n' has the 50037 'N' characters removed (as those can be reconstructed from test4). After that, 'no_n_2_bit' packs 4 ACTG-values into a byte, ignoring LFs.

    Together with my previous result for test4, we get 14,672,877 bytes which is better than 15,058,762 bytes from the merge.

    The assumptions needed to keep lossless are that we can reinsert 'N' and LF afterwards - for this testset we know that a 'N' character always has a phred value of 33 and vice versa and every line is 100 characters long (also, LFs are the same in test2 and test4), so it works.

    EDIT: I just noticed that it could be better replacing 'N' with some value instead of removing them to keep the structure intact, and it indeed gives a slightly better result. In this case, the mapping is (T,C,G,A,N -> 00,01,10,11,00). We could even replace every 'N' with a different, chosen value e.g. to support matchfinding.

    Code:
         7.721.150 test2_with_n_00_2bit
         6.914.517 test2_with_n_00_2bit.zpq_m4.zpaq
    Last edited by schnaader; 2nd December 2011 at 11:27.
    http://schnaader.info
    Damn kids. They're all alike.

  26. #26
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    874
    Thanks
    402
    Thanked 378 Times in 237 Posts
    Some things to consider. The DNA strings are all effectively little snapshots of a much larger (*maybe* known) string - the genome being sequenced. They haven't specified it, which IMO is unfair as in real data you usually do. I assume it's human, but it's easy to lookup actually.

    So... we expect matches of a dna string with the genome, or in this case with other dna sequences, but only within the sequence itself and not across newline boundaries. Each newline represents a completely different starting point somewhere in the genome. It may not even be in the same orientation. Graphically it looks something like this when assembled together:


    My own thoughts on the optimal compression of this is that fastq isn't what the scientists actually want anyway. The optimal fastq compression will almost certainly come from doing a denovo sequence assembly or by mapping the data back to a known reference, and then encoding it as a combination of the (possibly derived, possibly known) reference, position within it, and differences to it. Maybe we can write an algorithm to do this, but frankly you may as we start with the existing tools.

    However here's the rub - what is the point of then throwing away all this useful information - how the data aligns - in order to recreate the original fastq file? It's pointless! The scientists will probably only end up aligning it again themselves anyway. Instead the real solution lies in just encoding the data in SAM file format, a generally recognised file format for mapped and assembled DNA fragments and forgetting about fastq completely.

    Of course it's probably not what the organisers expect. I also realise I've totally given away my game plan, but frankly with the experts here I'm doomed anyway! All I can hope is to channel people in interesting and appropriate directions. :)

    Edit: an explanation of the above pic. It's from Gap5, the software I work on for a living. The grey scale background is an indication of the phred quality values with dark being poor quality. The blue colouration is simply to show disagreements with the consensus (which is worked out on-the-fly). That's an old screenshot with early Illumina data - so the error rate is high giving lots of discrepancies and the sequence length is short.

    On modern data, with a deep assembly (about 1000x) I was able to compress the sequence to 0.042 bits per base by modelling each column at a time; this is given the context of the position in the consensus and the alignment CIGAR string, but not the actual consensus itself. That basically shows how few errors there are in more modern data sets.

    So differences then boil down to two things.

    1) Errors, mostly random by sometimes specific to particular sequence motifs. In theory the quality values have been calibrated such that Q10 is 1 in 10 error rate, Q20 is 1 in 100, Q30 is 1 in 1000, etc. Ie -10* log10(Perror). So we can get an estimation of the likely chance of a base call differing based on its phred score... IF they've been calibrated correctly.

    2) SNPs (single nucleotide polymorphisms). Basically these are differences between individuals. A lot of sequencing data is simply your whole genome smashed into little fragments and passed through a sequencing instrument. That means you have two copies of everything - derived from your mother and your father. The assembly then shows up various sites where those two copies differ -you'll likely see a column consisting of 50/50 ratio of A and G for example.

    Finally, there's a strong relationship between compression technology and bioinformatics currently. The current best crop of DNA sequence aligners, matching dna fragments against a known reference genome, include bowtie and bwa programs - both based around BWT (or more specifically FM-Index I think).
    Last edited by JamesB; 2nd December 2011 at 11:37.

  27. #27
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    874
    Thanks
    402
    Thanked 378 Times in 237 Posts
    Quote Originally Posted by Steve View Post
    What a pity! All you pro's working on it ^^
    I already hoped for easy money, as i saw the first two entries :D
    The cheek of it! I went for pure speed with fqzcomp, but it turns out it's too fast given the typical disk speed (faster than gzip even at decompression). Although I have plans for substantially better compression without being too bad on speed still.

    Good catch on the 6Gb file. Last time I tried that it was still permission denied. I guess they set the acls on that s3 file now.

    James

  28. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    Quote Originally Posted by Shelwien View Post
    2. How about testing m1? https://sites.google.com/site/toffer86/m1-project
    Especially after optimizing some profiles for data types with it?
    After running m1 v0.6 overnight on test2, I got this after 14 iterations, then no further improvement. But it didn't save the profile or the compressed file, so I will have to run it again.

    Code:
    m1 o: test2
    
    14 f* = 1.929044
     0 SSE0_W0 = 00110111 [0, 8, g]
     1 SSE1_W0 = 01111111 [0, 8, g]
     2 SSE0_W = 1111111110000000 [0, 16, g]
     3 FSM0_PARTITION = 0000000001000000000000100000000000000000000010000000000010000000 [4, 4]
     4 FSM0_DECAY_Q_MAX = 11100 [0, 5, vg]
     5 FSM0_DECAY = 0000000000000100000000100000100 [3, 3, b]
     6 FSM0_QDOMAIN = 111100 [0, 6, vg]
     7 MDL_0x40 = 1111 [0, 4, b]
     8 MDL_MASKS = 11111111000111101111001001011101 [0, 32, b]
     9 MDL_MUL = 1111111111110011101010001101000010010001011100110110111010110100 [0, 64]
    10 MIX0_CTXTYPE = 1 [0, 1]
    11 MIX0_SHIFT = 011 [0, 3, g]

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,267
    Thanks
    313
    Thanked 831 Times in 499 Posts
    Quote Originally Posted by JamesB View Post
    Good catch on the 6Gb file. Last time I tried that it was still permission denied. I guess they set the acls on that s3 file now.
    James
    Looks like the same format as the smaller file (Illumina 1.8, phred+33, 100 byte reads). Here's the first few lines.

    Code:
    @SRR062634.1 HWI-EAS110_103327062:6:1:1092:8469/1
    GGGTTTTCCTGAAAAAGGGATTCAAGAAAGAAAACTTACATGAGGTGATTGTTTAATGTTGCTACCAAAGAAGAGAGAGTTACCTGCCCATTCACTCAGG
    +
    @C'@9:BB:?DCCB5CC?5C=?5@CADC?BDB)B@?-A@=:=:@CC'C>5AA+*+2@@'-?>5-?C=@-??)'>>B?D@?*?A#################
    @SRR062634.2 HWI-EAS110_103327062:6:1:1107:21105/1
    ACCGTGAGCAATCAGCTGCCATCAACGTGGAGGTAAGACTCTCCACCTGCAAAAACATTACAACTTGCTGAAGGCTGAGATACTTGTTCGCACATTTTTA
    +
    FDEFF?DFEFE?BEEEEED=DB:DCEAEEB,CC=@B=5?B?CC5C?B+A??=>:CC<9-B2=@>-?:-<A@@A?9>*0<:'0%6,>:9&-:>?:>==B??
    @SRR062634.3 HWI-EAS110_103327062:6:1:1110:17198/1
    TAGATATTTTTGTTTTAACTGCTGTAGAAAATTAAGACATAAACTAAGAAATATCCCATGAAGGAATGAGTATACTGTTTCTACTTGCAGAAAAGCTGCG
    +
    -?3-C22646@-@3@@3-=-====CBB@DB-A-=-AA@C-<AA7>D=ABDA;?CDDDD5D?DD55:>:AB>5?-CCCC######################
    @SRR062634.4 HWI-EAS110_103327062:6:1:1112:12923/1
    AGATGAGTTTCACAAAGTAAGCAACTTGATATCCAAATAATCAACACCCAACTCAAGAAAAAGATCATTACCAGAAACTAATAAACCAGCACATTAGGTG
    +
    ??EEEDB?D-?AAA-AA?>->BC:ADB:--A55ACCA:D6C:?5@CADD6=C5:CD?D4;,::?,CC-5A@C-?D5@+-BB?BC*A-A?C:C@#######
    BTW the hash tags "#" are phred values of 2, the lowest possible score except for N.

  30. #30
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    874
    Thanks
    402
    Thanked 378 Times in 237 Posts
    A quick benchmark of fqzcomp2 - same parsing code as fqzcomp-1 but using the model+encoder derived from Shelwien's coders6c2 distribution:

    jkb@deskpro102485[compress.../fqzcomp-...] time ./fastq2fqz < /dev/shm/s > /dev/shm/b
    Output = 16627423 bytes
    Name 1180443
    Seq 7522909
    Qual 7923497

    real 0m1.991s
    user 0m1.836s
    sys 0m0.148s

    jkb@deskpro102485[compress.../fqzcomp-...] time ./fqz2fastq < /dev/shm/b > /dev/shm/c

    real 0m1.890s
    user 0m1.724s
    sys 0m0.160s

    jkb@deskpro102485[compress.../fqzcomp-...] ls -l /dev/shm/b /dev/shm/s
    -rw-r--r-- 1 jkb team117 16627431 Dec 2 14:44 /dev/shm/b
    -rw-r--r-- 1 jkb team117 80755971 Nov 8 14:34 /dev/shm/s

    So prog size, comp time, decomp time:
    Code:
    gzip2     25224902 7.667 0.687
    fqzcomp-1 18825695 1.975 0.483
    fqzcomp-2 16627431 1.991 1.890
    The models used in fqzcomp2 are still pretty noddy. I have names packed into the same 3 arrays as before (this can be greatly improved on with a proper encoder, but zlib limited me majorly there) and passed through an O1 encoder, sequence packed in triplets passed through O1 encoder, and quality passed into an encoder using the previous 2 bytes plus dist/32 distance from the last newline.

    Obviously it can be beaten with better models, but this seems to demonstrate a good size vs speed tradeoff. I'll upload it soon, but the code takes hacky to a new level and currently I'm ashamed to let Shelwien see what I did to his code.

    Edit: I've improved the seq model by using 2 bit encoding and escaping Ns via knowledge of the quality value. This drops it down to 16438212 bytes (7334911 for seq). The names I minorly tweaked too but there's a lot more scope there. I want to model it as a series of numeric and non-numeric fields. Each field can then be output as "matches previous", "delta to previous" (good for both string and number) or "as is". I think this would substantially reduce the names.

    There's room for quality shrinkage too. I have a simple command-line processor for just quality which gets it to 7824640 bytes in about 1 second, but ultimately the way to go there is lossy compression frankly. Same with the names - they're not that useful!
    Last edited by JamesB; 2nd December 2011 at 19:59.

Page 1 of 6 123 ... LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •