> Compress: L.Allison, Computer Science, Monash University 4/1998 1 TGATAGGTGA TAGATAGATT GATAGATGAT AGAAGATTGA TAGATGATAG 51 ATACATAGGT GATAGTAGAT GTAAGATGAT AGATGATAGA TAGATAGATG 101 ATAGACAGAT TGATAGATGA TAGAGAGA 128 > order-0 Markov Model > . . | 4.0 + > | 3.5 b > | 3.0 b > | 2.5 b >+..+.....+...+..-.+...+...-..+..+..+-.+..........+..-..+........|- 2.0 b > .. ..... ... ..+. ... ...... .. .. +. .......... ..... ........| 1.5 b > | 1.0 b > | 0.5 b > | 0.0 b > compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000 > hypothesis: (H) =10.0 bits > data: (D|H) =211.0 bits, =1.6487 b/ch > total: (H)+(D|H) =221.0 bits, =1.7269 b/ch > ran 00/01/21 from 15:32:55 to 15:32:55 > order-1 Markov Model > . . . . . . | 4.0 + > . . . | 3.5 b > . . . . | 3.0 b > | 2.5 b >----------------------------------------------------------------|- 2.0 b >. . . . . . .. . . . .. . . . .. . .. . . . . .. . . . . . ...| 1.5 b >... + + . + ... . + ....... + .. . .... + + + ... . ... + | 1.0 b > . . . . . . . .. . . . . . . . . . .. . . . . ... . .. ...| 0.5 b > | 0.0 b > compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000 > hypothesis: (H) =30.0 bits > data: (D|H) =152.1 bits, =1.1886 b/ch > total: (H)+(D|H) =182.2 bits, =1.4233 b/ch > ran 00/01/21 from 15:32:55 to 15:32:55 > AED fwd approx repeats > [Frequencies B:58.6 R:3.2 C:68.2 E:3.2 =:66.4 ~:2.0 i:1.0 d:2.1 tot:204.8] > [Frequencies B:41.0 R:4.2 C:86.0 E:4.2 =:83.6 ~:2.0 i:1.4 d:3.3 tot:225.8] > [Frequencies B:37.3 R:4.8 C:89.3 E:4.8 =:87.4 ~:1.7 i:1.5 d:3.6 tot:230.6] > . . . . . | 4.0 + > + . . . . | 3.5 b > + . | 3.0 b > . . | 2.5 b >------.----------------------------------.-------.-------------.|- 2.0 b >. . . . . .. . . . | 1.5 b >... +. . . .. ..+ . . + | 1.0 b > . .+. ....+.+.... . . . + ..++ .+.... .++..+..+.. ..| 0.5 b > +++.++. .+. + . . . ..++ | 0.0 b > compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000 > hypothesis: (H) =49.5 bits > data: (D|H) =128.5 bits, =1.0040 b/ch > total: (H)+(D|H) =178.0 bits, =1.3906 b/ch > ran 00/01/21 from 15:32:55 to 15:32:58 > --- end --- ------------------------------------------------------------------------------
Note that the approximate repeats (AED) model, the most complex one, gives the best compression of this sequence, even when the cost of stating the model's parameters are included.
Also see [http://www.allisons.org/ll/Bioinformatics/Compress/]