Sample Results
Our tool generates lists of sub-strings from the DNA input file of length N. Here is an extremely small sampling of sub-strings from the Sulfolobus DNA file of length 30.
This input file contains 2,694,756 {ACTG} characters from the Sulfolobus DNA file.
Our scanning code generates all possible sub-strings of length N, stores them in the tree, and records the location in the file of each sub-string.
Searching
Our tool currently allows the searching of exact matches of sub-strings of length N in the tree. If the string is found, a message is returned to the user along with the locations in the text file as well as the "span". The span is the
distance between two matches in the DNA file.
We also plan to provide graphical results to the user as well.
Sample Search Result
Found: TTTTATAACTTTTTACTTTATTTAGTTATT.
Locations: 824759 1020445 (Span:195686) 1166209 (Span:145764)
1543241 (Span:377032) 1773710 (Span:230469) 1923991 (Span:150281) 2001272
(Span:77281) 2027882 (Span:26610) 2240658(Span:212776)
Small sampling of sub-strings of length N from DNA input file
AAAAAAAAAAAAGATGAACAAAATTCAGA
AAAAAAAAAAAGATGAACAAAATTCAGAA
AAAAAAAAAAAGATGAGTTTAACATCTGC
AAAAAAAAAAAGTTGAATTGACAGAAGAC
AAAAAAAAAACAAATTTAAAAAAATTCCA
AAAAAAAAAACTGGAAGCGCTTAGCATAA
AAAAAAAAAAGATGAACAAAATTCAGAAA
AAAAAAAAAAGATGAGTTTAACATCTGCA
AAAAAAAAAAGTTGAATTGACAGAAGACG
AAAAAAAAAATCAACGATTCTCTCAATAA
. . .
TTTTTTTTTCATAATAAAAAGTCATAGAA
TTTTTTTTTCTTTTAATCTGCTTTTATTT
TTTTTTTTTTAAAAAAAAGAGCGTTAAAC
TTTTTTTTTTAATATGGAATTTCTTTCAC
TTTTTTTTTTTAATATGGAATTTCTTTCA
TTTTTTTTTTTTAATATGGAATTTCTTTC
TTTTTTTTTTTTTAATATGGAATTTCTTT
Sub-string uniqueness
In this particular data file, there is a high degree of unique sub-strings (where the length of the sub-string is 30). Our tool reports the total number of unique sub-strings out of the total number of possible sub-strings. The data shows
that 98.81% of all sub-strings are unique in the datafile with only a small number of duplicates.
Unique sub-strings: 2,662,746
Total sub-strings: 2,694,726