EST Sequence Assembly
BIOL 265/COMP 113 Computer Laboratory
M. Weir / M. Rice / D. Krizanc
Assembly of long DNA or RNA sequences from
overlapping shorter sequences is a computational challenge for biologists. For example, an important component of the
genome projects is to assemble mRNA sequence information
for all genes. One approach towards this goal is to systematically sequence
large numbers of cDNAs from cDNA
libraries (cDNAs are DNA copies of mRNAs).
High quality sequence runs are typically about 500 bp,
whereas mRNAs are typically longer.
Therefore it is necessary to perform computational analysis of cDNA
sequences to identify overlaps and thereby predict larger sequence fragments of
mRNAs. This analysis can be complicated by issues including
the possibility of alternative splicing and sequence duplications.
We have assembled several
Drosophila sequence segments with overlapping sequence -- an example of the
kind of data that might emerge from this type of analysis.
S1
S2
S3
S4
- Use the "BLAST 2
sequences" server to determine the regions of overlap between
these four sequences (use FASTA format entering >sequencename
in the line above the sequence).
Record (e.g. with screen shots) the BLAST alignments of the overlap
regions. (Notice that the alignment results are distributed over several lines as
the sequences are quite long.)
- Draw a line diagram that
illustrates the regions of overlap between the four sequences (S1, S2, S3,
S4) showing the overlap coordinates for both sequences of each
overlap. This is analogous to
constructing a contig from the four sequences.
- Based on your alignments in questions 1 and 2,
define a set of string slices of S1, S2, S3 and S4 which when concatenated together
(in the correct order) create a contig representing the composite cDNA segment
(remember Python string indices start at 0 whereas BLAST outputs start counting at 1).
Be sure that the overlapping sequences are only included once.
Using Python, run your concatenation to create the contig sequence.
- Notice that the overlap between
two of the sequences contains some mismatches: what are three possible
explanations for this? Think about how the sequence information is obtained
as well as biological processes that might be involved. (Might the overlap with mismatches be misleading us?)
- To resolve this issue, and assess
whether the composite cDNA represents a real
mRNA, it is useful to compare the composite cDNA
with Drosophila genomic sequence. Go to the Drosophila Flybase BLAST
server. Use your composite cDNA as a query
against the whole euchromatic genome sequence
(i.e. choose the Genome Section "Genome Assembly (NT)"). Use the
program Blastn nt->NT.
- Does your output allow you to
distinguish between the possible explanations for the mismatches (step 4
above). Discuss the orientations of your cDNA fragments. (Assume that all the cDNA sequences correspond to the mRNA single strand
sequences, not the antisense sequences.) Develop a model (an explanation) to explain all
the results of your BLAST search.
- To test your model, perform a
BLAST search with the composite cDNA (input)
against the dataset of predicted Drosophila genes (using Database
"Annotated Genes (NT)" on Flybase BLAST server). Does the BLAST search
confirm your model?
- Use the BLAST result to link to
matching predicted gene(s).
- View the genes in the "Map(GBrowse)" link (on the LHS of the gene report).
This will facilitate assessing your model. You may find it useful to
reduce the scale of the map (e.g. change to "show 100 kbp") in order to see neighboring genes. Notice
that the Gene Region Map can show several maps based on your choices in
the “select tracks” tab:
- DNA sequence map (we are looking at 7M on chromosome
2R)
- cytologic map showing chromosome band names ("cytologic band"; we are in the region of band
47F17)
- mutation map ("point_mutation")
- gene model map ("Gene span"; notice the genes en
and inv)
- predicted gene map (e.g. "Genescan
prediction")
- mRNA map ("mRNA")
- protein coding sequence map ("CDS")
- DNA maps referring to DNA clones used in the sequence
assembly ("Tiling BAC")
- sequenced cDNA clones ("cDNA and other aligned sequences")
- microarray probes ("Affymetrix v1
or v2")
This analysis provides
indications of the kinds of issues that arise during sequence assembly -- of
genomic and mRNA sequences. It is wise to try to confirm interpretations using
independent data -- in this case, comparing cDNA and
genomic sequences.
Assignment:
Answer questions 3, 4,
6 and 7 above.
Additional challenges
(not part of assignment):
- Using artificial sequence
constructs (using S1, S2, S3, S4), determine a way to deduce ALL overlaps
between these four sequences using a SINGLE call of the "BLAST 2
sequences" server. Provide your input and output.
- Using Python, consider how you would design a
prefix-suffix overlap detection function that takes two strings as input,
and if the two strings overlap, output's the inferred combined sequence.
You may consider an exact match or imprecise match version (without gaps).
Copyright 2019 Wesleyan
University