(Go back to index)

DNA searching

         Starting with a DNA strand from the NBCI database, we get a long sequence of {A*C*T*G*}*

E.g. {CTGACAT…}

This long string is broken into substrings of length N as shown below. 
For example, assume N=3. (Extremely small but chosen for simplicity. N could be 30 or 40 or more.)
The sub-strings from this DNA strand are: {CTG, TGA, GAC, ACA, CAT…}
Algorithm: First N characters are broken off into a sub-string, then we start reading N characters from the next character in the strand.
Once we reach the end, we have a large collection of unique and redundant strings in the DNA strand.

        A few (minor) complications


       
DNA consists of an inter-leaved helix of two strands
  
       DNA_Helix


           DNA sequences from the NBCI database only have one strand of the DNA. However, given one strand, it is a simple matter to derive the other strand.

 Also, it’s unclear which direction the sequence was captured. So…

There are four strands to consider when searching the genome.

1. The original strand from the database. E.g. Strand 1, forward direction.

2. Strand 1, reverse direction. (flip #1 )

3. Strand 2, forward direction (Derive this from #1 above) via simple rules.

Aßà T    (E.g. for every A in #1, replace it with a T….and vice versa.)
C ß à G   (Do the same for C and G.)

4. Strand 2, reverse direction (flip #3 above.)

Optimization: Instead of creating and storing all 4 strands, it is possible to just adjust the query, e.g. a search for “CAT” could also result in these three other searches:

TAC (reversed), GTA (second strand), ATG (second strand reversed)

With this optimization, less memory is required to store the DNA information.