Since most data come in files and streams, a data scientist must be able to effectively work with them. Python provides many facilities to make this easy. In this class activity, we will review some of python's file, string, and dictionary facilities by examining a file containing the genetic code of the virus that has been disrupting our lives this term. Here is a transmission electron micrograph showing the virus (a public domain image from the CDC, credited to H. A. Bullock and A. Tamin).
The genetic code of each living organism is a long sequence of simple molecules called nucleotides or bases. Although many nucleotides exist in nature, only 4 nucleotides, labeled A, C, G, and T, have been found in DNA. They are abbreviations of Adenine, Cytosine, Guanine, and Thymine. Although it is difficult to put viruses in the category of living organisms, they also have genetic codes made up of nucleotides.
The NCBI (National Center for Biotechnology Information) has recently started maintaining a data hub for genetic sequences related to the virus causing COVID-19. Recall that the name of the virus is SARS-CoV-2 (which is different from the name of the disease, COVID-19), or "Severe Acute Respiratory Syndrome Coronavirus 2" in full. Searching the NCBI website with the proper virus name will help you locate many publicly available data sets.
Let's download NCBI's Reference Sequence NC_045512 giving the
complete genome extracted from a sample of SARS-CoV-2 from the Wuhan seafood market, called the Wuhan-Hu-1 isolate. Here is a code using urllib
that will attempt to directly download from the url specified below. It is unclear if this url would serve as a stable permanent link. In the event you have problems executing the next cell, please just head over to the webpage for NC_045512, click on "FASTA" (a data format) and then click on "Send to" a file. Then save the file in the same relative location mentioned below in f
within the folder where we have been putting all the data files in this course.
# NCBI url:
url = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&' + \
'save=file&log$=seqview&db=nuccore&report=fasta&id=1798174254&' + \
'extrafeat=null&conwithfeat=on&hide-cdd=on'
# your local downloaded file:
f = '../../data_external/SARS-CoV-2-Wuhan-NC_045512.2.fasta'
import os
import urllib
import shutil
if not os.path.isdir('../../data_external/'):
os.mkdir('../../data_external/')
r = urllib.request.urlopen(url)
fo = open(f, 'wb')
shutil.copyfileobj(r, fo)
fo.close()
As mentioned in the page describing the data, this file gives the RNA of the virus.
lines = open(f, 'r').readlines()
The file has been opened in read-only mode. The variable lines
contains a list of all the lines of the file. Here are the first five lines:
lines[0:5]
The first line is a description of the data. The long genetic code is broken up into the following lines. We need to strip end-of-line characters from each such line to re-assemble the RNA string. Here is a way to strip off the end-of-line character:
lines[1].strip()
Let's do so for every line starting ignoring the first. Since lines
is a list object, ignoring the first element of the list is done by lines[1:]
. (If you don't know this already, you must review the list access constructs.)
The following code uses the string operation join
to put together the lines into one long string. This is the RNA of the virus.
rna = ''.join([line.strip() for line in lines[1:]])
The first thousand characters and the last thousand characters of the RNA of the coronavirus are printed below:
rna[:1000]
rna[-1000:]
Here is the total length of the RNA:
len(rna)
While the human genome is over 3 billion in length, the genome of this virus does not even reach the length of 30000.
When describing RNA, the T (Thymine) is often replaced by U (Uracil). This is done for example in an interesting New York Times article that came out last Friday. The article explains how this RNA code makes infected host cells produce a variety of proteins. Scientists have a good understanding of what some of these proteins do, but not all.
Here is a quote from the article on a protein it nicknamed Virus Liberator. ORF7a
When new viruses try to escape a cell, the cell can snare them with
proteins called tetherin. Some research suggests that ORF7a cuts
down an infected cell’s supply of tetherin, allowing more of the
viruses to escape. Researchers have also found that the protein can
trigger infected cells to commit suicide - which contributes to the
damage Covid-19 causes to the lungs.
The article then gives the ORF7a sequence, which I have copied and pasted into the next cell, adding some string breaks. Note how the article has used lower case characters and the character u
instead of T
.
orf7a = \
'augaaaauuauucuuuucuuggcacugauaacacucgcuacuugugagcuuuaucacuaccaag' + \
'aguguguuagagguacaacaguacuuuuaaaagaaccuugcucuucuggaacauacgagggcaa' + \
'uucaccauuucauccucuagcugauaacaaauuugcacugacuugcuuuagcacucaauuugcu' + \
'uuugcuuguccugacggcguaaaacacgucuaucaguuacgugccagaucaguuucaccuaaac' + \
'uguucaucagacaagaggaaguucaagaacuuuacucuccaauuuuucuuauuguugcggcaau' + \
'aguguuuauaacacuuugcuucacacucaaaagaaagacagaaugauugaacuuucauuaauug' + \
'acuucuauuugugcuuuuuagccuuucugcuauuccuuguuuuaauuaugcuuauuaucuuuug' + \
'guucucacuugaacugcaagaucauaaugaaacuugucacgccuaaacgaac'
The next task in this class activity is to find if this sequence occurs in the RNA we just downloaded, and if it does, where it occurs. To this end, we first make the replacements required to read the string in terms of A, T, G, and C.
s=orf7a.replace('u', 'T').replace('a', 'A').replace('g', 'G').replace('c', 'C')
s
The next step is now a triviality in view of python's exceptional string handling mechanisms:
s in rna
We may also easily find the location of the ORF7a sequence and read off the entire string beginning with the sequence.
rna.find(s)
rna[27393:]
The frequency of a base or a nucleotide in a genetic code is the number of times it occurs divided by the length of the code. The varying frequency of different nucleotides, called the nucleotide bias varies between organisms and is known to have biological implications. Biologists also often talk of the GC content, the percentage of nitrogeneous bases (G and C) in an RNA or DNA to get insights into its stability.
The next task in this activity is to make a python dictionary, called freq
, whose keys are the nucleotide characters and whose values are the number of times it occurs in the virus RNA. Once you have made it, freq['A']
, for example, should output the frequency of nucleotide A
.
freq = {b: rna.count(b)/len(rna) for b in 'ATGC'}
freq
A more recent dataset at NCBI, apparently just submitted for peer-review on April 3, claims to contain the genome of a virus sample from our neighboring state of Washington. You can find it labeled there as the data set MT293201. Let us take a look. (Again, if the url below fails, please head over the NCBI webpage, find and download the corresponding data file for this sample, again in FASTA format, and save it using the name f2
below.)
url2 = 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?' + \
'tool=portal&save=file&log$=seqview&db=nuccore&report=fasta&' + \
'id=1828694245&extrafeat=null&conwithfeat=on&hide-cdd=on'
f2 = '../../data_external/SARS-CoV-2-Washington_MT293201.1.fasta'
r2 = urllib.request.urlopen(url2)
fo2 = open(f2, 'wb')
shutil.copyfileobj(r2, fo2)
You might have already heard in the news that there are multiple strains of the virus around the globe. Let's investigate this genetic code a bit closer.
Repeating the previous procedure on this new file, we now make a string object that contains the RNA from the Washington sample. We shall call it rna2
below.
lines = open(f2, 'r').readlines()
rna2 = ''.join([line.strip() for line in lines[1:]])
We should note that not all data sets uses just ATGC. There is a standard notation that extends the four letters, e.g., N is used to indicate any nucleotide. So, it might be a good idea to answer this question first: what are the distinct characters in the new rna2
? There can be very simply done in python if you use the set
data structure, which removes duplicates.
set(rna2)
The next natural question might be this. Are the lengths of rna
and rna2
the same?
len(rna2), len(rna)
We could also look at the first and last 30 characters and check if they are the same, like so:
rna2[:30], rna2[-30:]
rna[:30], rna[-30:]
Clearly, rna
and rna2
are different strings.
freq2 = {b: rna2.count(b)/len(rna2) for b in 'ATGC'}
freq2
Although the Washington genome is not identical to the Wuhan one, their nucleotide frequencies are very close to the Wuhan one, reproduced here:
freq
ORF7a
?¶s in rna2
rna2.find(s)
Thus, we located the same ORF7a instruction in this virus at a different location. Although the genetic code from the Washington sample and the Wuhan sample are different, they can make the same protein ORF7a and their nucleotide frequencies are very close.
This activity provided you with just a glimpse into the large field of bioinformatics, which studies, among other things, patterns of nucleotide arrangements. If you are interested in this field, you should take a look at Biopython, a bioinformatics python package.
Author: Jay Gopalakrishnan
License: ©2020. CC-BY-SA
$\ll$Table of Contents