Monday, July 07, 2014

Complementing Codons: A Riddle Solved?

For some time now, I've been puzzling over a fairly big riddle, and I think an answer is becoming clear.

The riddle is: Why, in so many organisms, do codons turn up at a rate approximately equal to the rate of usage of reverse-complement codons? Take a good look at the symmetry of the following graph (of codon usage rates in Frankia alni, a bacterium that causes nitrogen-fixing nodules to appear on the roots of alder plants).

Codon usage in Frankia alni. Notice that a given codon's usage corresponds, roughly, to the rate of usage of the corresponding reverse-complement codon.

This graph of codon freequencies in F. alni shows the strange correspondence (which I've commented on before) between codons and their reverse complements. If GGC occurs at a high frequency (which it does, in this organism's protein-coding genes), the reverse-complement codon GCC is also high in frequency. If a codon (say TAA) is low, its reverse complement (TTA) is also low.

I've seen this relationship in many organisms (hundreds, by now); too often to be by chance. The question is why codons so often occur in direct proportion to the rate of occurrence of corresponding reverse complements. It doesn't make sense. The notion of base pairing should not come into play when an organism (or natural selection) chooses codons, because all of a protein gene's codons are collinear, on one and the same strand of DNA; base-pairing rules do not play a role in choosing codons.

Or do they?

I think, in fact, base-pairing does a play a role. The answer is obvious, when you think about it. We know that (single-stranded) RNA, if properly constructed, will fold back on itself to form loops and stems: complementary regions will base-pair with each other. Certainly, if secondary structure in mRNA is widespread, it will have consequences for codon selection. Codons in "stem" regions will complement each other.

And so it's fairly obvious, it seems to me, that a reasonable explanation for the riddle of "reverse complement codon selection" is that secondary structure of mRNA (or possibly single-stranded DNA) is far more pervasive than any of us might have suspected. It's pervasive enough to affect codon usage in the way shown in the graph above.

Is there any evidence that secondary structure is widespread? I think there is. If you go looking for complementary sequences inside protein-coding genes in F. alni, for example, you find many. As a probe, I had a script check for intragenic complementing length-12 sequences ("12-mers") in all 6,711 protein-coding genes of F. alni. (I presented pseudocode for the script in an earlier post.) Based on the known base-composition stats of the organism, I expected to find 5,440 such 12-mer pairs by chance. What I found was 6,319 such pairs located in 2,689 genes. (When I looked for  complementing 13-mers, I expected to find 1,467 occurring by chance, but instead found 3,592 such  pairs in 2,086 genes.) In a previous post, I showed similar results for Sorangium cellulosum (a bacterium with an enormous genome). Previous to that, I showed similar results for Mycoplasma genitalium (which has one of the tiniest genomes of any free-living microbe).

But do these regions of internal complementarity affect codon choice? Indeed they do. When I looked at the top 40% of F. alni genes in terms of the number of internal complementing 12-mers, I found a Pearson correlation between codons and reverse-complement codons of 0.889. Looking at the bottom 60% of genes, I found the correlation to be lower: 0.766. These numbers, moreover, were virtually unchanged (0.888 and 0.763) when I re-calculated the Pearson coefficients using expectation-adjusted codon frequencies. That is to say, I used base composition stats to "predict" the frequencies of each codon, then I subtracted the predicted number from the actual number, for each codon. (Example: The frequency of occurrence of guanine, in F. alni protein genes, is 0.35794, and the frequency of cytosine is 0.37230, hence the expected frequency of GCC is 0.35794 * 0.37230 * 0.37230, or 0.04961. The actual frequency is 0.07802.) The correlation still existed, practically unchanged, after adjusting for expected rates of occurrence of codons.

The bottom line is that the correlation between the frequency of occurrence of a given codon and the frequency of its reverse-complement codon, which is otherwise very hard to explain, is quite readily explained by the presence, in protein-coding genes, of a significant amount of single-strand complementarity (of the type that could be expected to give rise to secondary structure in mRNA). On this basis, it's reasonable to suppose that conserved secondary structure is actually a major driver of codon usage bias.



Please show this post to your biogeek friends; thanks!

Wednesday, July 02, 2014

Pearson vs. Spearman: A Tale of Two Correlations

One of the most rudimentary yet most valuable types of statistics you can calculate for two data sets is their correlation value. Two widely used correlation methods are the Pearson method (which is what we normally think of when we think "correlation coefficient"),  and the Spearman Rank Coefficient method. Which is which? When would you use one rather than the other?

The Pearson method is based on the idea that if Measurement 1 tracks Measurement 2 (whether directly or inversely), you can get some idea of how "linked" they are by calculating Pearson's r (the correlation coefficient), which is a quantity derived from the products of the differences between each M1 and its average and each M2 and its average, duly normalized. The exact formula is here. Rather than talk about the math, I want to talk about the intuitive interpretation. The crucial point is that Pearson's r will be a real value between minus-one and plus-one. Minus-one means the data are negatively correlated, like CEO performance and pay. Okay, that was a lame example. How about age and beauty? No. Wait. That's kind of lame too. How about the mass of a car and its gas mileage? One goes up, the other goes down. That's negative correlation.

The statistical significance of a correlation depends on the magnitude of the correlation and the number of data points used in its computation. To get an idea of how that works, play around with this calculator. Basically, a low correlation value can still be highly significant if there are enough data points. That's the main idea.

Spearman's rank coefficient is similar to Pearson in producing a value from -1 to +1, but you would use Spearman (instead of Pearson) when the rank order of the data are important in some way. Let's consider a couple of examples. A hundred people take a standardized test (like the SAT or GRE), producing 100 English scores and 100 Math scores. You want to know if one is correlated with the other. Pearson's method is the natural choice.

But say you hold a wine-tasting party and you have guests rate ten wines on a decimal scale from zero to ten. You want to know how the judges' scores correlate with the wines' prices. Is the best-tasting wine the most expensive wine? Is the second-best-tasting wine the second-most-expensive? Etc. This is a situation calling for Spearman rather than Pearson. It's perfectly okay to use Pearson here, but you might not be as satisfied with the result. 

In the Spearman test, you would sort the judges' scores to obtain a rank ordering of wines by the "taste test." Then you would calculate scores using the Spearman formula, which values co-rankings rather than covariances per se. Here's the intuitive explanation: Say the most expensive wine costs 200 times more than the cheapest wine. Is it reasonable to expect that the most expensive wine will taste 200 times better than the cheapest wine? Will people score the best wine '10' and the worst wine '0.05'? Probably not. What you're interested in is whether the taste rankings track price in an orderly way. That's what Spearman is designed to find out.

If the taste scores are [10,9.8,8,7.8,7.7,7,6,5,4,2] and the wine prices are [200,44,32,24,22,17,15,12,8,4], the Spearman coefficient (calculator here) will be 1.0, because the taste scores (in rank order) exactly tracked the prices, whereas Pearson's r (calculator here) will be 0.613, because the taste scores didn't vary in magnitude the same way that the prices did. But say the most expensive wine comes in 4th place for taste. In other words, the second array is [44,32,24,200,22,17,15,12,8,4] but the first array is unchanged. Now Spearman gives 0.927 whereas Pearson gives 0.333. The Spearman score achieves statistical significance (p<.001) whereas Pearson does not.

There are plenty of caveats behind each method, which you can and should read up on at Wikipedia or elsewhere. But the main intuition is that if rank order is important for your data, consider Spearman. Otherwise Pearson will probably suffice.

Sunday, June 29, 2014

A Large Genome with Lots of Structure

In contrast to higher life forms, bacteria usually have compact genomes, with few duplicate genes and very little non-coding DNA. But some bacteria, for reasons not entirely understood, accumulate relatively large genomes. A good example is Sorangium cellulosum, a soil-dweller of the Myxococcales group, whose 14-million-base-pair genome (comprising 10,400 protein-coding genes) dwarfs that of E. coli B (with 4.6 million base pairs and 4,205 genes) and makes the 476-gene genome of Mycoplasma genitalium look puny. Bear in mind that the fruit fly genome contains about 14,000 protein genes (although several times that number of proteins may be produced through alternative splicing of exons).

Exactly why S. cellulosum needs more genes than, for example, baker's yeast (with 12 million base pairs and around 6,700 genes) is anybody's guess. It does have many accessory genes for producing secondary metabolites with interesting antifungal, antibacterial, and other properties (including anti-tumor properties). As a result, many labs are busy mining the Sorangium genome for genes of possible commercial importance.
Sorangium cellulosum

I recently decided to poke around inside the genome of S. cellulosum myself, looking for evidence of latent secondary structure (internal complementarity regions) in its genes. I was stunned at what I found. When I had my scripts look for complementing intragenic 11-mers (pairs of complementary sequencces of length 11), I found over 36,000 such pairs in Sorangium's genes.

Next, I went a step further and checked each gene for internal complementing sequences of length 14.

Based on Sorangium's actual A, G, C, and T composition stats, and considering all the kinds of 14-mers that actually exist in the coding regions of the genome, I expected to find 991 matching (complementing) pairs of 14-mers in 10,400 genes. What I actually found were 2,942 matching pairs inside 1,928 genes.

To make this clearer, I plotted the expected number of complementary 14-mers per gene versus the actual number per gene, in a graph:

Expected vs. actual complementing intragenic 14-mers for Sorangium. Expectation statistics were calculated individually, for each gene, based on actual A,G,C,T composition stats and gene length.
The points are arrayed in horizontal lines because while expectations can be calculated to several decimal points, actual occurrences are discrete (whole numbers).

It's fairly evident that length-14 complementary pairs tend to occur at higher than the expected rate(s). In fact, that's true for 98% of occurrences. It might not be obvious (from the above plot) that 98% of points lie above the 1:1 slope line, but that's only because so many points overlap each other.

The bottom line? These results strongly suggest that substantial amounts of secondary structure exist in a significant fraction of Sorangium's genes. The secondary structure could be tied to thermal regulation of gene expression (via RNA thermometers), or some mRNAs could incorporate metallo-sensitive riboswitches; or maybe secondary structure in mRNA is important for certain translocon-targeted genes. There could be other explanations as well. (If you have one, leave a comment.)

Why is this important? For one thing, we need a better understanding of how an organism with 10,400 protein genes regulates and coordinates gene expression. Secondary structure of regulatory and coding-region RNA might well hold important clues. But also, if secondary structure is conserved in large numbers of genes (as I believe it is), it has to affect codon bias. "Complementing" codons would be preferred (at least for certain regions) over non-complementing codons, and this would affect codon choice. It's a factor that has not been considered, to date, in arguments over why codon bias exists.

Friday, June 27, 2014

A Tiny Genome with Lots of Structure

Mycoplasma genitalium is a super-tiny bacterial parasite of the human urinary system, causing about 15% of urinary infections in men. Its genome, encoding just 476 protein-coding genes, is often cited as the smallest genome of any organism that can be grown in pure culture. The genome is so small that scientists have for years used M. genitalium as a kind of litmus test for the essentiality of genes: If a particular kind of gene exists in M. genitalium (the reasoning goes), it must be truly essential for life.
Mycoplasma genitalium

Recently, I looked at the genome of M. genitalium from the point of view of latent nucleic-acid secondary structure. (Secondary structure refers to the ability of a single strand of DNA or RNA to base-pair with itself.) I probed each gene with a script designed to detect intragenic self-complementing regions of length eleven (so-called 11-mers), the idea being that complementary runs of bases shorter than that might occur at a high rate by chance. Given M. genitalium's base composition stats (adenine being most prevalent, at 36.24% of all bases in protein-coding message regions, thymine being next-most-abundant, at 32.21% of protein-coding bases), the most likely 11-mer, 'AAAAAAAAAAA', could be expected to occur by chance once every 70,672 bases. In actuality, that particular 11-mer doesn't occur in the M. genitalium genome, but if it did we could expect around 8 occurrences of it in a genome of 528,500 base pairs. Instead, what we actually find are 749 occurrences of complementary 11-mers in 283 genes.

The procedure used to check for 11-mers is as follows (in pseudocode):

for ( i = 0; i < genes.length; i++ )  // for each gene
    for ( k = 0; k < genes[i].length - 11; k++ ) // for all bases
        sequence = genes[i].substring( k, k + 11 ) // get 11 bases
        complement = getReverseComplement( sequence ) 
        if ( genes[i].match( complement ) ) // a match exists
        if ( matchNotWithinMask( mask ) ) // not inside a previous match
        if ( matchNotInsideSequence( sequence ) ) // not inside sequence 
             matches++    // count it as a hit
             updateMask( )  // update mask

Note that with even-length sequences it would be important to guard against self-matching sequences, since a sequence like AGCT is its own reverse-complement. With length-11 sequences this is not an issue. Nevertheless, it's possible for successive matches to overlap, and I wrote a check to guard against that. That's what  matchNotWithinMask( mask ) and updateMask() are all about.

It should go without saying, but I'll say it anyway: 749 complementing 11-mers in 283 genes is a very substantial (and quite unexpected) amount of internal complementarity. The question is: What's the purpose of all that (putative) secondary structure?

One possibility (which I've talked about in a previous post) is that the secondary structure allows for thermostatic control of gene expression. RNA thermometers are a well established phenomenon and it could be that M. genitalium needs sensitive control over thermal expression of certain genes.

Another possibility is that many genes incorporate magnesium-, manganese-, or calcium-sensitive riboswitches. RNA is a potent chelator of metal ions, particularly doubly-charged ions like Mg+2, which is smaller than sodium or potassium (with twice the charge) and thus has unusually high charge density. It's possible that under conditions of osmotic stress (with high influx of water and low ion concentrations) certain mRNAs relax or uncoil and thereby become translatable. If this is true (if certain genes are under osmotic control), we might expect to find that many M. genitalium genes with high secondary structure potential are membrane-targeted genes tasked with managing the import and export of various things. And that's indeed what we find.

Further below, I present a table with the top 100 M. genitalium genes containing the most putative secondary structure based on the 11-mer probe outlined above. The table shows the gene name, gene product or function (where known), gene size in base pairs, and the number of complementing 11-mers in the gene. The genes tend to fall into only a few categories. About a third of the genes (34 out of 100) are DNA- or RNA-binding genes. A dozen genes are transporters or permeases; another 12 fall in the category of "other membrane-associated genes" (these are marked with asterisks); and ten encode lipoproteins. Sixteen are "hypothetical proteins" (which should probably be reannotated as proteins of unknown function, since we can be fairly sure the genes are expressed).

It's interesting that so many genes with (putative) secondary-structure potential are nucleic acid processing genes. Among genes in this category are ten genes that either acylate or modify transfer RNAs. What's interesting about the latter group is that most involve tRNAs for non-polar amino acids (alanine, leucine, isoleucine, valine, phenylalanine, and methionine). The reason this is interesting is that non-polar amino acids are extensively used in membrane-associated proteins. Thus we have a situation, possibly, in which osmo-switches (osmotically sensitive mRNAs) control the expression of tRNA synthetases for amino acids used in membrane proteins.

It is quite possible that some of the (few) metabolic genes listed in the table are membrane-associated. This is likely true for phosphomannomutase, UDP-galactopyranose mutase, and glycerophosphoryl diester phosphodiesterase, for example. Also interesting is that 2,3-bisphosphoglycerate-independent phosphoglycerate mutase from Thermoplasma has been shown to be manganese-stimulated. Riboswitch modulation of such an enzyme would not be unexpected.

Altogether, 34 to 37 out of 100 genes listed in the table are membrane-associated, and another 7 are tRNA synthetases involving non-polar amino acids (heavily used in membrane proteins), tending to support the hypothesis that M. genitalium uses secondary structure of mRNA (and/or ssDNA) to modulate gene expression in osmotically sensitive manner.

Table 1. Genes with high secondary structure potential in M. genitalium. Genes marked with asterisks are membrane-associated genes other than transporters or permeases. The final column shows the number of complementary 11-mer pairs found in the gene.
Gene Product
Size (bp)
11-mers
MG_468 ABC transporter, permease protein
5353
22
MG_064 ABC transporter, permease protein, putative
3997
17
MG_218 HMW2 cytadherence accessory protein *
5419
16
MG_386 P200 protein
4852
15
MG_414 conserved hypothetical protein
3112
13
MG_075 116 kDa surface antigen *
3076
12
MG_422 conserved hypothetical protein
2509
12
MG_244 UvrD/REP helicase
2113
11
MG_191 MgPa adhesin *
4336
11
MG_292 alanyl-tRNA synthetase
2704
10
MG_018 helicase SNF2 family, putative
3097
10
MG_298 chromosome segregation protein SMC
2950
10
MG_345 isoleucyl-tRNA synthetase
2689
9
MG_338 lipoprotein, putative
3814
9
MG_307 lipoprotein, putative
3535
8
MG_031 DNA polymerase III, alpha subunit, Gram-positive type
4357
8
MG_080 oligopeptide ABC transporter, ATP-binding protein
2548
8
MG_321 lipoprotein, putative
2806
7
MG_340 DNA-directed RNA polymerase, beta' subunit
3880
7
MG_069 PTS system, glucose-specific IIABC component *
2728
7
MG_390 ABC transporter, ATP-binding/permease protein
1984
7
MG_525 conserved hypothetical protein
1996
7
MG_226 amino acid-polyamine-organocation (APC) permease family protein
1480
6
MG_378 arginyl-tRNA synthetase
1615
6
MG_192 P110 protein
3163
6
MG_312 HMW1 cytadherence accessory protein *
3421
6
MG_328 conserved hypothetical protein
2272
6
MG_341 DNA-directed RNA polymerase, beta subunit
4174
6
MG_291 phosphonate ABC transporter, permease protein (P69), putative
1633
5
MG_001 DNA polymerase III, beta subunit
1144
5
MG_411 phosphate ABC transporter, permease protein PstA
1966
5
MG_136 lysyl-tRNA synthetase
1474
5
MG_336 aminotransferase, class V
1228
5
MG_261 DNA polymerase III, alpha subunit
2626
5
MG_430 2,3-bisphosphoglycerate-independent phosphoglycerate mutase
1525
5
MG_053 phosphoglucomutase/phosphomannomutase, putative
1654
5
MG_195 phenylalanyl-tRNA synthetase, beta subunit
2422
5
MG_260 lipoprotein, putative
2299
5
MG_277 membrane protein, putative *
2914
5
MG_366 conserved hypothetical protein
2005
5
MG_250 DNA primase
1825
4
MG_447 membrane protein, putative *
1645
4
MG_254 DNA ligase, NAD-dependent
1981
4
MG_003 DNA gyrase, B subunit
1954
4
MG_397 conserved hypothetical protein
1702
4
MG_096 conserved hypothetical protein
1954
4
MG_334 valyl-tRNA synthetase
2515
4
MG_141 transcription termination factor NusA
1597
4
MG_241 conserved hypothetical protein
1864
4
MG_278 GTP pyrophosphokinase
2164
4
MG_012 alpha-L-glutamate ligases, RimK family, putative
865
4
MG_123 conserved hypothetical protein
1417
4
MG_266 leucyl-tRNA synthetase
2380
4
MG_119 ABC transporter, ATP-binding protein
1696
4
MG_068 lipoprotein, putative
1426
3
MG_047 S-adenosylmethionine synthetase
1153
3
MG_456 conserved hypothetical protein
1006
3
MG_122 DNA topoisomerase I
2131
3
MG_421 excinuclease ABC, A subunit
2866
3
MG_223 conserved hypothetical protein
1237
3
MG_375 threonyl-tRNA synthetase
1696
3
MG_364 expressed protein of unknown function
676
3
MG_185 lipoprotein, putative
2107
3
MG_423 conserved hypothetical protein
1687
3
MG_067 lipoprotein, putative
1552
3
MG_008 tRNA modification GTPase TrmE
1330
3
MG_306 membrane protein, putative *
1183
3
MG_242 expressed protein of unknown function
1894
3
MG_040 lipoprotein, putative
1777
3
MG_281 conserved hypothetical protein
1672
3
MG_259 modification methylase, HemK family
1372
3
MG_204 DNA topoisomerase IV, A subunit
2347
3
MG_216 pyruvate kinase
1528
3
MG_229 ribonucleoside-diphosphate reductase, beta chain
1024
3
MG_187 ABC transporter, ATP-binding protein
1759
3
MG_094 replicative DNA helicase
1408
3
MG_184 adenine-specific DNA modification methylase
955
3
MG_303 metal ion ABC transporter, ATP-binding protein, putative
1075
3
MG_045 spermidine/putrescine ABC transporter, spermidine/putrescine binding protein, putative
1453
3
MG_051 pyrimidine-nucleoside phosphorylase
1267
3
MG_004 DNA gyrase, A subunit
2512
3
MG_385 glycerophosphoryl diester phosphodiesterase family protein *
712
3
MG_360 ImpB/MucB/SamB family protein
1237
3
MG_457 ATP-dependent metalloprotease FtsH *
2110
3
MG_065 ABC transporter, ATP-binding protein
1402
3
MG_419 DNA polymerase III, subunit gamma and tau
1795
3
MG_295 tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase
1105
3
MG_072 preprotein translocase, SecA subunit *
2422
3
MG_464 membrane protein, putative *
1159
3
MG_089 translation elongation factor G
2068
2
MG_439 lipoprotein, putative
820
2
MG_309 lipoprotein, putative
3679
2
MG_032 conserved hypothetical protein
2002
2
MG_194 phenylalanyl-tRNA synthetase, alpha subunit
1027
2
MG_137 UDP-galactopyranose mutase
1216
2
MG_203 DNA topoisomerase IV, B subunit
1903
2
MG_021 methionyl-tRNA synthetase
1540
2
MG_029 DJ-1/PfpI family protein
562
2
MG_255 conserved hypothetical protein
1099
2
MG_314 conserved hypothetical protein
1333
2

Thursday, June 26, 2014

Books Everyone Says Everyone Should Read

I found the graphic below in David McCandless's book Information Is Beautiful and felt it was worth sharing; I couldn't stop looking at it. Click to enlarge.


Sad to say, I've read shockingly few of these titles (maybe twenty percent of them?). Some, I read at too early an age and remember dimly. Most of the ones I did read do, IMHO, deserve to be on this list. I take issue, however, with The Da Vinci Code (an execrable dung-pile between two covers) and have to wonder why The Chronicles of Narnia made the list but Gravity's Rainbow didn't. Certainly Catch-22 deserves its prominent position in the word-cloud, but does the turgid Dune (look to the right of Lolita) deserve to displace The Road or [pick your favorite dystopian epic]? Does Stranger in a Strange Land (a wooden, laughably kitsch fable about a man from Mars) even hold a candle to a book like Flowers for Algernon? Does Ayn Rand rate not one, but two placements (for The Fountainhead and Atlas Shrugged)? Really??

Of course, David McCandless did not make this list. No single person did. It was drawn from Oprah's Book Club List, Goodreads.com, and other public sources. (Read the fine print at the bottom.) Garbage in, garbage out. Still, as infographics go, and considering how many word-clouds all of us (by now) have seen, it's surprisingly compelling.