(Go back to index)
DNA searching
Starting with a DNA strand from the NBCI
database, we get a long
sequence of {A*C*T*G*}*
–E.g.
{CTGACAT…}
This long string is broken into substrings of length N as shown below.
–For
example, assume N=3. (Extremely small but chosen for simplicity. N could be
30 or 40 or more.)
–The
sub-strings from this DNA strand are: {CTG, TGA, GAC, ACA, CAT…}
•Algorithm:
First N characters are broken off into a sub-string, then we start reading N
characters from the next character in the strand.
•Once
we reach the end, we have a large collection of unique and redundant strings
in the DNA strand.
A few (minor) complications
DNA consists of an
inter-leaved helix of two strands

DNA sequences from the NBCI
database only have one strand of the DNA. However, given one strand, it is a
simple matter to derive the other strand.
Also,
it’s unclear which direction the sequence was captured. So…
•
There are
four
strands to consider when searching the genome.
•1.
The original strand from the database. E.g. Strand 1, forward direction.
•2.
Strand 1, reverse direction. (flip #1 )
•3.
Strand 2, forward direction (Derive this from #1 above) via simple rules.
•
Aßà
T (E.g. for every A in #1, replace
it with a T….and vice versa.)
•C
ß
à
G (Do the same for C and G.)
•4.
Strand 2, reverse direction (flip #3 above.)
Optimization: Instead of creating and storing all 4 strands, it is possible
to just adjust the query, e.g. a search for “CAT” could also result in these
three other searches:
•TAC
(reversed), GTA (second strand), ATG (second strand reversed)
•With
this optimization, less memory is required to store the DNA information.