DNA searching

DNA searching

Starting with a DNA strand from the NBCI database, we get a long sequence of {A*C*T*G*}*

–E.g. {CTGACAT…}

This long string is broken into substrings of length N as shown below.

–For example, assume N=3. (Extremely small but chosen for simplicity. N could be 30 or 40 or more.)

–The sub-strings from this DNA strand are: {CTG, TGA, GAC, ACA, CAT…}

•Algorithm: First N characters are broken off into a sub-string, then we start reading N characters from the next character in the strand.

•Once we reach the end, we have a large collection of unique and redundant strings in the DNA strand.

        A few (minor) complications

        DNA consists of an inter-leaved helix of two strands



           DNA sequences from the NBCI database only have one strand of the DNA. However, given one strand, it is a simple matter to derive the other strand.

Also, it’s unclear which direction the sequence was captured. So…

• There are four strands to consider when searching the genome.

•1. The original strand from the database. E.g. Strand 1, forward direction.

•2. Strand 1, reverse direction. (flip #1 )

•3. Strand 2, forward direction (Derive this from #1 above) via simple rules.

• Aßà T (E.g. for every A in #1, replace it with a T….and vice versa.)

•C ß à G (Do the same for C and G.)

•4. Strand 2, reverse direction (flip #3 above.)

Optimization: Instead of creating and storing all 4 strands, it is possible to just adjust the query, e.g. a search for “CAT” could also result in these three other searches:

•TAC (reversed), GTA (second strand), ATG (second strand reversed)

•With this optimization, less memory is required to store the DNA information.