Information Theoretic Analysis of Sequences:

Drosophila Splice Site Database

BIOL 265/COMP 113 Computer Laboratory

M. Weir / D. Krizanc / M. Rice

To analyze how biological machines such as the spliceosome successfully recognize target DNA or RNA sequences, it can be very useful to investigate large numbers of aligned target sites. Information theoretical analysis of the target sequences can provide important insights.

The Weir research group has constructed a relational database containing nucleotide sequences in the vicinity of 11,161 donor and acceptor splice sites in 3,375 Drosophila cDNAs (Weir and Rice, 2004). All the donors, and separately, the acceptors, are aligned, and the nucleotides at positions -32 to 32 (where the splice site position is 0) are stored in the database.

Several stored procedures are available for analyzing the data set by using the web interface found at https://numana2.wesleyan.edu/~mweir/igs2/project_files/main/index.htm/ (you may go to this site and click on "Databases & Tools" in the left hand menu; choose the option "Use WesQL to run stored procedures on public IGS Splice Site and RNA Databases"; then click the stored procedures tab to see the list of stored procedures). (Outputs from previous runs are also provided below.)

1. Select the stored procedure Compute Splice Site Information and click the Continue button. You will see a number of pre-set parameters including the cDNA table (Wesleyan Known cDNA), Intron Table (Wesleyan Known Introns), and Splice Site Table (Wesleyan Known Splice Sites), and Minimum Splice Element Length (20).

Change the default parameters by setting the Type of Site to Donor Sites and the Start and Finish positions to -4 and 8.

Click the Execute Stored Procedure button. Executing the procedure generates several HTML tables that contain the following entries:

-Transcripts - number of transcripts meeting filter criteria

-SpliceElements - number of introns

-Sitetype - donor (D) or acceptor (A)

-Nuclposition - nucleotide position with respect to aligned donor or acceptor sites (with splice sites at position 0)

-Information - at each nucleotide position j that is calculated using the formula

Info(s_j) = 2 - [-f_A * log₂(f_A) - f_C * log₂(f_C) - f_G * log₂(f_G) - f_T * log₂(f_T)] - g

where the quantity in brackets is the uncertainty (entropy) at position j based on the frequency of occurrence f_A, ..., f_T of the nucleotides A, ..., T, and the correction factor g depends on the number of splice sites that are being aligned.

-nA, nC, nG, nT - the numbers of nucleotides at position j

-pA, pC, pG, pT - the probability of each nucleotide at position j

[Note: The T in the cDNA corresponds to the U in the RNA.]

Previous run: input; output; SQL code

(a) How many cDNAs were used? (3,090)

(b) How many introns were analyzed ? (10,057)

(d) What are the percentages of occurrence of the predominant nucleotides at these positions? (99.8%, 99.2%)

Store (copy and paste) the HTML tables in MicroSoft Excel worksheets - we will use this data in part 3.

2. Repeat part 1 without any restriction on the lengths of the introns and exons (i.e. without restricting the minimum splice length to 20).

Previous run: input; output

(a) What are the percentages of occurrence of the predominant nucleotides at positions D+1, D+2 ?

(b) What are some possible reasons why the GT consensus is not as well represented as in part 1 ? [consider the algorithm used to compute the splice sites - see Weir and Rice (2004)]

(c) For the set of cDNAs that contain either an intron or an exon with length less than 20, calculate the frequency at each of the two positions of the canonical GT [Hint: compare nucleotide counts from 1(d) and 2(a) using Excel to subtract corresponding counts].

3. Using the higher quality data set from part 1, identify a consensus nucleotide sequence at the nucleotide positions with greater than 0.5 bits of information.

The sequence [3' UCCAUUCA 5'] in the Drosophila U1 snRNA is thought to bind near the donor site to the consensus sequence.

Draw a diagram of the predicted RNA/RNA base pairing.

4. To study the effects of intron length on information values, first set a range of small intron lengths (e.g. 64-80 using the Minimum and Maximum Intron Length parameters; also reset minimum splice element length to 20) and compute the information at both donor and acceptor sites for nucleotide positions -10 to 10. Store the output tables in Excel.

Previous run: input; output

Next, compute the information for positions -10 to 10 using a Minimum Intron Length of 8192. (Note - you will also need to set the Maximum Intron Length to 0.)

Previous run: input; output

(a) Compare the total information (summed over nucleotide positions -10 to 10) at the donor and acceptor sites for the sets of longer and shorter introns. You will find it useful to also graph the information values at EACH nucleotide position in the site. You can graph these values for the longer and shorter introns side-by-side in the same graph.

(b) Compare the amounts of information at each nucleotide position for the sets of longer and shorter introns.

(c) For the positions with large information differences, compare the nucleotide content. How does your result relate to the U1 snRNA binding sequence discussed in part 3?

(d) Do you notice any other differences between the longer and shorter intron data sets? For example, compare the A content in each data set. What conclusions can you draw?

See Weir and Rice (2004) for a more complete analysis of the two datasets.

5. In addition to using information to measure the degree of conservation in aligned sequences we can also measure the individual information of each sequence to see how well it conforms to the conservation.

Suppose t₁, t₂, . . . , t_n is a set of DNA or RNA sequences each of length m. For each base a and each position j between 1 and m, define the weight

w(a,j) = 2 + log₂(f_aj) = 2 – (-log₂(f_aj))

where f_a_j denotes the frequency of base a at position j.

The individual information score for a sequence t_i is

infoscore(t_i) = sum {w(t_ij, j) | 1 < j < m }

In other words, the score of a sequence t_i is the sum of the bits contributed by the symbols found at each of the positions in t_i.

(a) You can compute individual information scores using the stored procedure Compute Splice Site Information Matrix in the database dbDrosophilaSplice2. [You need to switch to the new database in the starting WesQL page.]

Compute the individual information scores for the set of Drosophila introns of length 16000-32000 using the default parameters and the following settings:

-Start position = -4

-Finish position = 8

-Site type = Donor

-Number of reference sequences = 250

-Reference set: Minimum intron length = 16000

-Reference set: Maximum intron length = 32000

-Test set: Minimum intron length = 16000

-Test set: Maximum intron length = 32000

Store your output results in Excel. You can use the histogram output to look at the distribution of individual information scores.

Previous run: input; output

(b) Compute the distribution of scores for introns of length 64 by using the same Reference set as in part (a) and setting Test set Minimum and Maximum intron length = 64. How does this distribution compare with the one in part (a)? Why is there some spread in both of these distributions?

Previous run: input; output; SQL code

(c) The calculation of the individual information scores (infoscore(t_i)) uses a matrix of the weights defined above. In order to see this weight matrix, follow the link

IGS Database Tools > Submit sequence alignments for Information Theoretic Analysis

Paste into the sequence window the 160 donor sequences from your output of reference sequences in part (a) (which you stored in Excel). Adjust the format setting to flat and click the Submit Alignment button.

You can display the weight matrix for the reference sequences by selecting the following options in the View Information Profile & Weight Matrix area:

Show residue counts
Show information weights
Show residue frequencies

and clicking on the View Information Profile button.

Previous run: input; output

(d) Working in Excel, use the weight matrix to calculate the individual information of the sequence AAAGGTAAGTAT (you should sum the "infoWeight" values corresponding to which nucleotide is at each position). In this sequence, the nucleotide choices at each position are those with the highest frequency. Therefore it has the highest possible individual information score. [Compare this value with the scores in the distribution in part (a); notice that this "perfect" sequence is rarely if ever seen.]

Assignment

A. Answer the questions in parts 3, 4(a), and 5(d).

B. State at least two general conclusions that you can draw from our analysis of splice sites. What possible molecular-based hypotheses are suggested by the analysis ?

References:

Stephens, R.M. and Schneider, T.D. 1992. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J Mol Biol 228: 1124-36.

Weir, M.P. and Rice, M.D. 2004. Ordered Partitioning Reveals Extended Splice Site Consensus Information. Genome Research 14:67-78.

Weir, M., Eaton, M. and Rice, M. 2006. Challenging the spliceosome machine. Genome Biology 7:R3.