Sequence
Alignment/Conservation:
CLUSTAL W and BLASTP
BIOL 265/COMP 113
Computer Laboratory
M. Weir
/ D. Krizanc / M. Rice
Introduction to sequence alignments: Sequence alignments are used extensively in biology.
In this lab, we will explore sequence alignments from several perspectives.
We will first carry out sequence alignments of
fragments of several proteins. We will then discuss one of the major reasons why
sequence alignments are so important -- the relationship between sequence
alignment, evolutionary conservation, and gene function.
ALIGNING SEQUENCES USING
CLUSTALW
Here are several fragments of protein sequences
(in FASTA format) taken from different organisms. Let us use an alignment
program, CLUSTALW, to see if there is sequence conservation between these
sequences. The CLUSTALW program can be run from several web sites including Kyoto, EMBL, or EMBnet.
Paste the list of sequences into the sequence
box and run CLUSTALW (e.g. at Kyoto).
For now, do not worry about program options -- use the default options for
protein sequences. The Kyoto site gives you results in several formats (here is a partial output from a previous
submission): alignment scores for pairs of sequences; multiple alignment; alignment trees showing distances between sequences are
linked at the bottom of the page.
[If you would like a color block diagram of
the aligment where similar amino acids have the same
color, use the EMBL site and press
“show colors” option.]
Notice that these sequence
fragments were chosen because they
contain a conserved motif as discussed below. Different organisms have similar
versions of the sequence (sequence conservation).
Notice that the Drosophila sequences are
somewhat different from the vertebrate sequences, and that the nematode
sequence is even less similar. [This property, observed with sequence
alignments with more and less closely related organisms, is used in the design
of sequence alignment algorithms to be discussed in class.]
Also notice that a single organism, e.g.,
Drosophila, has different proteins with similar motifs. Indeed, the sequence
conservation is seen in several other Drosophila proteins. For example, use
CLUSTALW to align the following sequences. [previous run]
A CONSERVED DEVELOPMENTAL GENE
The examples above are artificial in that we
pre-selected sequence fragments that show sequence conservation. Let us imagine
instead that we were studying a protein involved in animal development.
Studies in Drosophila have identified homeotic and segmentation genes which
play a central role in embryo development. When these genes are mutated, the
resulting phenotypes tell us about the functions of these genes.
For example, mutation of
the Ubx homeotic gene gives rise
to fruit flies with four wings
(RHS) instead of the normal two wings (LHS). Analysis of the sequence of the Ubx protein can provide insights into how the gene
functions in development.
You may get the full Ubx
protein (fasta) (previous download)
sequence from NCBI. A fragment of this sequence was included in one of the
sequence alignments above. However, we do not know a priori which regions of a protein
sequence might be conserved. To find conserved regions of protein, we can use a
pattern searching program such as BLAST (Basic Local
Alignment Search Tool) [Notice that this is local, not global alignment].
SEQUENCE CONSERVATION
In NCBI BLAST, go to Standard protein-protein BLAST [BLASTP]
and paste the complete Ubx sequence (with amino acid
numbers) into the input box and run the BLASTP program (use the "nr"
database). The result may take some time to appear. Once it does, mouse over
the color-coded alignment box. You should see the highly conserved genes as red
bars. Less well conserved proteins are different
colors (see color key). Click on one of the bars to see the actual sequence
alignment. The input sequence is the query; the subject sequence is the similar
protein sequence. Notice that parts of the input sequence are highly conserved
in many other protein sequences.
You may want to restrict the dataset searched
to, e.g., human sequences, using the general BLAST page -- this will show you the
closest human sequences to the Drosophila Ubx
protein. [For your search, use BLASTP, the refereed protein database refseq, and limit the query to Homo sapiens.] [A previous
blast search is available here.] Notice the
block of very strong conservation towards the C-terminus of the Ubx protein.
Let us follow up on this apparent
conservation: go to Search for
conserved domains and test the Ubx protein.
You will find that Ubx has a conserved motif called
the homeodomain -- this is the conserved domain that we examined
in the first section -- click on cd00086
homeodomain, to see information on the motif. Homeodomain-containing proteins are transcription factors
that bind to cis-regulatory DNA of genes that they
regulate. The proteins bind DNA through their homeodomain.
Vertebrates have a large family of proteins with homeodomains
-- they are called Hox proteins, and play a role in
coding the appropriate development of groups of cells.
If regions of proteins are conserved, this
suggests that they are important for functioning of the protein -- in this
case, DNA-binding function. We can gain insight into how conserved motifs
function by looking at their structures.
STRUCTURE CONSERVATION
The iCn3D program allows you to see sequence alignments for a conserved motif, and the structure of that motif based on X-ray crystallography or NMR studies. Choose one of the homeodomain family links; e.g., cd00086, Homeodomain. Listed at the bottom of the page is a group of aligned homeodomain-containing protein sequences -- similar to the homeodomain protein sequences we aligned above with CLUSTALW. Enter the homeodomain id (1AKH) into the iCn3D web page.
The homeodomain structure should be loaded. Try using your mouse to rotate the structure (move the mouse on the image while clicking down). Mouse over residues to see their identities.
Let us now relate the 3D structure to the homeodomain sequence. Using the Analysis pulldown menu, apply: Analysis > view sequence and annotation. Switch the sequence window to “Details.” Highlight Arg 122. This should show up as yellow in the structure; this amino acid makes a key contact with the DNA. Switching this amino acid identity changes the DNA-binding specificity of the homeodomain.
Try changing the representation of this Arg residue using the Style pulldown menu; apply Style > side chains > sphere. This will show the Arg side chain which is near the DNA.
Use the Select pulldown menu: Select > all. Now try: style > nucleotides > stick. This will show all the base paired nucleotides with their aligned bases. Try looking at solvent accessible surface area: Style > surface type > solvent accessible.
Although the structures of only some homeodomains
have been determined, it is likely that other homeodomains
have similar structures given that their sequences are conserved.
This example illustrates how protein sequence
motifs can be conserved because they have particular functions and corresponding
structures associated with those functions. Hence, sequence alignments to
identify conserved motifs can provide important insights into possible functions
and stuctures associated with that protein.
You can visit the page CSH: Homeodomains in development for more information on how
homeodomains function in development. Notice that many of the core mechanisms identified in model
organisms like Drosophila are likely to also operate in other organisms
including humans.
Assignment:
Summarize briefly what is meant by
"sequence conservation" and discuss its relationship to protein
structure.
Propose different kinds of criteria that
might be used to define permitted amino acid substitutions. Think about
theoretical criteria based on amino acid properties, and practical criteria
based on observed sequence alignments (consider how Blosum matrix values are computed).
Copyright 2021 Wesleyan University