Thursday, August 21, 2014

Complexity Begets Charm

Giamatti and Church in Sideways.
Giamatti and Curch in Sideways.
Today's post is reblogged (with permission) from

It took me ten years to finally get around to watching the quirky 2004 bromance Sideways, starring Paul Giamatti, Thomas Hayden Church, Virginia Madsen, and the ever-piquant Sandra Oh. While the film does many things well and is, on the whole, watchable, it's also the kind of highly forgettable flick that will spawn no sitcoms, inspire no remakes, prompt no sequels, but in all likelihood simply linger in cable purgatory for untold aeons, to perplex future millions.

One of the film's main characters is not a person but a thing: wine. (Stop. There's Lesson No. 1. Are you making at least one thing into a character?) Two buddies go on a road trip (and try to get laid) before one of them has to appear at his wedding; that's the logline. (Already forgettable, right?) But the road trip involves winery-hopping, and we get to watch and listen as characters constantly fuss over wines, swirling them, sipping them, describing their "sturdiness" or their spicy oakiness, or their past-their-peakness, etc., to the point where you want to reach for a paper bag and puke up something well past its primeness.

One of the failings of the movie, IMHO, is that for all the wine talk, it failed to draw parallels between characteristics of wine and themes of the story. The opportunities for synergistic symbolisms was (dare I say) ripe. They mostly went (dare I say) wasted.

But the script (by Alexander Payne and Jim Taylor) did do a lot of things right, and I suspect Rex Pickett, in writing the novel on which the film is based, did even more of those things right. I'm talking mostly about character development. The characters could easily have been cardboard cutouts, stereotypical bachelor-buddies, and in some ways they were, of course; but below the surface-level depictions lurked dark, thwarted dreams, conflicting desires, countervailing motivations, self-doubts, under-realized virtues, and unexpected flaws as well as unexpected behaviors. All of these sorts of things make the characters more believable (if not always more sympathetic), and that's what a reader or a viewer wants: real people with real faults, conflicting desires, complex agendas, the capacity for change (maybe only in some ways, not in all ways); and a capacity for optimism, despite an overlay of gloom. (In Sideways, the Giamatti character is a depressed school teacher whose novels keep getting rejected.)

A character with no faults is a caricature. No real human being is all good or all bad. Hitler had a dog, for crying out loud. (Of course, he famously euthanized it in the end.)

Just as a wine can have many small flaws and still exude charm, so can (and should) the characters in your story. Give all your characters real flaws, along with inner turmoil. Be sure your villain is not one-hundred-and-ten-percent villainous; give him or her some endearing qualities, and show him to have inner doubts as well.

And by the end of the story, force the characters to confront some demons. Force them to change, or at least consider what they lost by not changing.

People are complex, in the real world. Make it so in your story-world.

Get free tips in your inbox: sign up for the Author-Zone Newsletter.

Monday, August 18, 2014

When Is It Really Necessary to Hold a Meeting?

Jason Fried explains very nicely, in a 15-minute TED talk (watch it here), why the modern office is a singularly poor place in which to get work done. Bottom line: The modern office is specifically designed to facilitate interruptions. It's what one might call a high-interruptivity environment (by intent; by design).

For much the same reason that eight one-hour naps interspersed with short periods of wakefulness do not add up to a full night's sleep, a succession of 15- and 20-minute periods of work (in an office environment where you're interrupted by texts, phone calls, e-mails, IM, meetings, coworkers coming in to see you, bosses checking on you, trips to the coffee machine, and so forth) seldom adds up to a day's work. To get serious work done requires long periods of uninterrupted private time. This is why so many people say that their favorite place to get serious work done is the back porch, the basement, the attic, the shower, the library, the park, Starbucks, etc., or (if it's indeed the office) the office after everyone leaves to go home, or before everyone arrives in the morning. No one counts as their most productive environment the office during normal working hours.

Through technologies like Webex and Go-to-Meeting, we're now able to introduce productivity losses even to people working from home.

Toward the end of his TED talk, Jason Fried suggests, as an exercise (not as a policy), that any manager who has a meeting planned for Monday simply cancel (and not reschedule) the meeting and see what happens. Does the world come to an end? Do sales drop? Do people stop getting work done? Or, to the contrary, does the cancellation of a one-hour meeting involving ten people result in ten additional person-hours of productivity, reclaimed for free?

I wish Fried would have taken his own logic to the ultimate conclusion, which is that meetings (particularly regularly scheduled staff meetings) should simply be abolished, outlawed.

Over the years, I've worked for small, medium, and large companies, mostly in software development, both as an individual contributor and as a manager. As a manager at Novell, I attended an average of three meetings a week, or about 1000 meetings total. At Adobe, I attended a weekly status meeting (which was often cancelled, to no ill effect whatever) plus one or two division-level all-hands meetings per quarter, and one to two company-level all-hands meetings per quarter. I can say without reservation that all meetings I ever attended at Novell and Adobe, except for some one-on-one and impromptu 3- or 4-person meetings, were utterly without effect, aside from the undeniable effect of resetting everyone's productivity to zero at the time of the meeting.

You might ask yourself, if you're involved in setting up meetings, what it says about the fragility of your company or the ineffectuality of your organization (or your management style) if the success of the enterprise is somehow compromised by not holding that weekly status meeting, or that monthly planning meeting, or that quarterly all-hands meeting that you (vainly, stupidly) imagine is so necessary to inspiring team spirit and getting everyone "on the same page." In all my time at big companies (and smaller ones too) I can't remember a single meeting being indispensible (except, as I say, for a few one-on-one or two-on-one events). The meetings invariably distracted me from getting real work done. The bigger the meeting, the worse the productivity loss.

Whenever I missed a meeting, I found out later that I either missed nothing substantive whatsoever, or I could easily (and in a few minutes' time) catch up on whatever I'd missed simply by reading a transcript, a key memo, or someone's "takeaways" summary.

Meetings are a classic example of a central-command push technology that should be replaced by an opt-in/pull technology. Quarterly all-hands meetings, in particular, are surreal in the way they evoke 1980s business culture. Can we get beyond the 1980s? Just post what I need to know somewhere. Drop me a memo. I'll look at it when it's time for me to look at it. When I can spare some time to be nonproductive.

Saturday, August 09, 2014

Hachette vs. Amazon: A Trader's Perspective

Hachette and Amazon are locked in a pricing dispute that's gone on for some time now. It may eventually be resolved, although frankly, it could well go unresolved, too; Hachette doesn't have to cave in to Amazon's demands. Hachette may well hold its hard-line position forever (and suffer the consequences in terms of poor placecment on Amazon's sites, shipping delays by Amazon on Hachette books, and so forth) . But many people believe the dispute will eventually be resolved. Is there a way for investors to come out ahead?

Bear in mind, none of what follows constitutes investing advice. I mention it as a hypothetical scenario.

Here's how some investors would play this situation. We know that if the dispute is suddenly resolved, there's a good chance Amazon shares will go up in value. But there's also a chance Hachette will pull the plug on Amazon and announce a deal with another (r)etailer, in which case Amazon shares would likely plummet. A resolution to the dispute could send AMZN shares sharply higher, or sharply lower.

This is a classic example of a situation in which an options trader would hope to make money on a straddle. A straddle is where you buy a call option and a put option (one is a bullish bet; the other a bearish bet), simultaneously, at the same strike price.

Let's say you think the dispute will be resolved in the next 30 days. You might decide to buy an October AMZN call option for $16.05 (full contract price: $1605, 100 shares), at the $315 strike. An October 2014 put is $13.95. Total cost: $3000. (These prices are based on the 8 Aug 2014 closing prices.)

If the Hachette dispute is resolved and the dispute resolution moves AMZN share price by $30 (which is not unreasonable; it's less than 10% of Amazon's current share price), one option will move into the money and the other will move out of the money. Currently, a $30 into-the-money move on the call makes it worth $37.30 while a $30 out move makes its worth $5.25. On the put, a $30 move makes it worth either $33.20 (if the stock goes down $30) or $4.55 (if the stock moves up). In one case you end up with a total of $4185 and in the other case you end up with $3845. Your initial investment was $3000.

If you think the dispute won't resolve in September (and there's certainly no indication that it will), you could choose to buy January options. The prices of the options will be higher but the straddle would still work, in theory. The problem with all options contracts, though, is that they suffer substantial price decay over time (for January contracts, figure roughly ten to twelve cents a share per day, per option). Bottom line, 40 days' worth of waiting could cost you $800+ in decay (on one call and one put). So there's a substantial penalty for having to wait. And it looks like this dispute could go on for some time. In fact, it could go on forever, since the parties are not obligated to reach an agreement.

Also, bear in mind there's no guarantee AMZN stock price will move $30/share when the Hachette dispute is resolved (if indeed it ever is resolved). It could move a lot less, or not at all.

So use caution. This is far from a risk-free strategy.

Disclaimer: Consult your investment advisor (I am not one) before doing any trades.

Wednesday, August 06, 2014

A Workflow Wish List

In my day job, I evaluate content management systems (enterprise-grade systems, not light-duty blogware), and I get to see and touch a lot of so-called workflow systems. This is the part of the system that (for example) routes documents to people who can approve them (or not) before they're pushed out to a web site.

What I find is that almost all enterprise-grade content management systems have a workflow subsystem of some kind, but it's typically not well implemented. I understand the reason for this. Workflow, like so many things in life, is easy to do, hard to do well. It's hard to make a good workflow system. That's the bottom line.

Still, there are certain things I like to see. (If you're in the market for an enterprise-grade CMS, here's your checklist.)

1. A good visual designer. I like to be able to plop new activities down on a canvas, in arbitrary locations (and give them arbitrary names), and move them around, then draw links between them.

2. There should be a small number of easy-to-understand activity types.

3. I should be able to see what the input and output documents (including flow metadata) are for any given activity in the flow. Ideally, every activity should be capable of taking one or more XML documents in and spitting one or more XML documents out.

4. I should be able to draw arrows backwards from any activity to an upstream activity. In other words, the system shouldn't just support acyclic graphs.

5. Each activity should support the notion of a timeout value.

6. Each activity should support the notion of a retry count.

7. Each activity should support the notion of a retry interval.

8. The visual designer should let me (through some kind of easy gesture, whether a mouse click, doubleclick, right-click, etc.) expose an editable property sheet for the activity, where I can inspect and set values like timeout and retry interval.

9. I should be able to specify a logging activity at each step of the flow. The more configurable the logging activity, the better.

10. I should be able to specify a fault handler for any given activity. Including a default fault handler.

11. There should be an easy facility in the GUI for attaching JavaScript logic to any step. (And please don't invent a new scripting language. Use something standard. That means JavaScript.) Ideally, each step in the flow will have event handlers to which I can attach my own scripts. Scripts should have access to any XML that travels with the flow.

12. Ideally (and I know this is a lot to ask), the design-time tool will have an animation capability (for testing workflows) so that I can do certain kinds of QA testing right in the design environment.

The real point of Item 12 is that it should be easy to test a workflow and debug it. Very seldom is this actually the case, even in systems costing $50,000.

In addition to the 12 items above, a system that really "goes the distance" will also have some way to delegate approval tasks.

If all workflow systems supported these features, my job would be easier, and (needless to say) so would the jobs of the people who buy and use these systems. And that's the real point.

Monday, July 07, 2014

Complementing Codons: A Riddle Solved?

For some time now, I've been puzzling over a fairly big riddle, and I think an answer is becoming clear.

The riddle is: Why, in so many organisms, do codons turn up at a rate approximately equal to the rate of usage of reverse-complement codons? Take a good look at the symmetry of the following graph (of codon usage rates in Frankia alni, a bacterium that causes nitrogen-fixing nodules to appear on the roots of alder plants).

Codon usage in Frankia alni. Notice that a given codon's usage corresponds, roughly, to the rate of usage of the corresponding reverse-complement codon.

This graph of codon freequencies in F. alni shows the strange correspondence (which I've commented on before) between codons and their reverse complements. If GGC occurs at a high frequency (which it does, in this organism's protein-coding genes), the reverse-complement codon GCC is also high in frequency. If a codon (say TAA) is low, its reverse complement (TTA) is also low.

I've seen this relationship in many organisms (hundreds, by now); too often to be by chance. The question is why codons so often occur in direct proportion to the rate of occurrence of corresponding reverse complements. It doesn't make sense. The notion of base pairing should not come into play when an organism (or natural selection) chooses codons, because all of a protein gene's codons are collinear, on one and the same strand of DNA; base-pairing rules do not play a role in choosing codons.

Or do they?

I think, in fact, base-pairing does a play a role. The answer is obvious, when you think about it. We know that (single-stranded) RNA, if properly constructed, will fold back on itself to form loops and stems: complementary regions will base-pair with each other. Certainly, if secondary structure in mRNA is widespread, it will have consequences for codon selection. Codons in "stem" regions will complement each other.

And so it's fairly obvious, it seems to me, that a reasonable explanation for the riddle of "reverse complement codon selection" is that secondary structure of mRNA (or possibly single-stranded DNA) is far more pervasive than any of us might have suspected. It's pervasive enough to affect codon usage in the way shown in the graph above.

Is there any evidence that secondary structure is widespread? I think there is. If you go looking for complementary sequences inside protein-coding genes in F. alni, for example, you find many. As a probe, I had a script check for intragenic complementing length-12 sequences ("12-mers") in all 6,711 protein-coding genes of F. alni. (I presented pseudocode for the script in an earlier post.) Based on the known base-composition stats of the organism, I expected to find 5,440 such 12-mer pairs by chance. What I found was 6,319 such pairs located in 2,689 genes. (When I looked for  complementing 13-mers, I expected to find 1,467 occurring by chance, but instead found 3,592 such  pairs in 2,086 genes.) In a previous post, I showed similar results for Sorangium cellulosum (a bacterium with an enormous genome). Previous to that, I showed similar results for Mycoplasma genitalium (which has one of the tiniest genomes of any free-living microbe).

But do these regions of internal complementarity affect codon choice? Indeed they do. When I looked at the top 40% of F. alni genes in terms of the number of internal complementing 12-mers, I found a Pearson correlation between codons and reverse-complement codons of 0.889. Looking at the bottom 60% of genes, I found the correlation to be lower: 0.766. These numbers, moreover, were virtually unchanged (0.888 and 0.763) when I re-calculated the Pearson coefficients using expectation-adjusted codon frequencies. That is to say, I used base composition stats to "predict" the frequencies of each codon, then I subtracted the predicted number from the actual number, for each codon. (Example: The frequency of occurrence of guanine, in F. alni protein genes, is 0.35794, and the frequency of cytosine is 0.37230, hence the expected frequency of GCC is 0.35794 * 0.37230 * 0.37230, or 0.04961. The actual frequency is 0.07802.) The correlation still existed, practically unchanged, after adjusting for expected rates of occurrence of codons.

The bottom line is that the correlation between the frequency of occurrence of a given codon and the frequency of its reverse-complement codon, which is otherwise very hard to explain, is quite readily explained by the presence, in protein-coding genes, of a significant amount of single-strand complementarity (of the type that could be expected to give rise to secondary structure in mRNA). On this basis, it's reasonable to suppose that conserved secondary structure is actually a major driver of codon usage bias.

Please show this post to your biogeek friends; thanks!

Wednesday, July 02, 2014

Pearson vs. Spearman: A Tale of Two Correlations

One of the most rudimentary yet most valuable types of statistics you can calculate for two data sets is their correlation value. Two widely used correlation methods are the Pearson method (which is what we normally think of when we think "correlation coefficient"),  and the Spearman Rank Coefficient method. Which is which? When would you use one rather than the other?

The Pearson method is based on the idea that if Measurement 1 tracks Measurement 2 (whether directly or inversely), you can get some idea of how "linked" they are by calculating Pearson's r (the correlation coefficient), which is a quantity derived from the products of the differences between each M1 and its average and each M2 and its average, duly normalized. The exact formula is here. Rather than talk about the math, I want to talk about the intuitive interpretation. The crucial point is that Pearson's r will be a real value between minus-one and plus-one. Minus-one means the data are negatively correlated, like CEO performance and pay. Okay, that was a lame example. How about age and beauty? No. Wait. That's kind of lame too. How about the mass of a car and its gas mileage? One goes up, the other goes down. That's negative correlation.

The statistical significance of a correlation depends on the magnitude of the correlation and the number of data points used in its computation. To get an idea of how that works, play around with this calculator. Basically, a low correlation value can still be highly significant if there are enough data points. That's the main idea.

Spearman's rank coefficient is similar to Pearson in producing a value from -1 to +1, but you would use Spearman (instead of Pearson) when the rank order of the data are important in some way. Let's consider a couple of examples. A hundred people take a standardized test (like the SAT or GRE), producing 100 English scores and 100 Math scores. You want to know if one is correlated with the other. Pearson's method is the natural choice.

But say you hold a wine-tasting party and you have guests rate ten wines on a decimal scale from zero to ten. You want to know how the judges' scores correlate with the wines' prices. Is the best-tasting wine the most expensive wine? Is the second-best-tasting wine the second-most-expensive? Etc. This is a situation calling for Spearman rather than Pearson. It's perfectly okay to use Pearson here, but you might not be as satisfied with the result. 

In the Spearman test, you would sort the judges' scores to obtain a rank ordering of wines by the "taste test." Then you would calculate scores using the Spearman formula, which values co-rankings rather than covariances per se. Here's the intuitive explanation: Say the most expensive wine costs 200 times more than the cheapest wine. Is it reasonable to expect that the most expensive wine will taste 200 times better than the cheapest wine? Will people score the best wine '10' and the worst wine '0.05'? Probably not. What you're interested in is whether the taste rankings track price in an orderly way. That's what Spearman is designed to find out.

If the taste scores are [10,9.8,8,7.8,7.7,7,6,5,4,2] and the wine prices are [200,44,32,24,22,17,15,12,8,4], the Spearman coefficient (calculator here) will be 1.0, because the taste scores (in rank order) exactly tracked the prices, whereas Pearson's r (calculator here) will be 0.613, because the taste scores didn't vary in magnitude the same way that the prices did. But say the most expensive wine comes in 4th place for taste. In other words, the second array is [44,32,24,200,22,17,15,12,8,4] but the first array is unchanged. Now Spearman gives 0.927 whereas Pearson gives 0.333. The Spearman score achieves statistical significance (p<.001) whereas Pearson does not.

There are plenty of caveats behind each method, which you can and should read up on at Wikipedia or elsewhere. But the main intuition is that if rank order is important for your data, consider Spearman. Otherwise Pearson will probably suffice.

Sunday, June 29, 2014

A Large Genome with Lots of Structure

In contrast to higher life forms, bacteria usually have compact genomes, with few duplicate genes and very little non-coding DNA. But some bacteria, for reasons not entirely understood, accumulate relatively large genomes. A good example is Sorangium cellulosum, a soil-dweller of the Myxococcales group, whose 14-million-base-pair genome (comprising 10,400 protein-coding genes) dwarfs that of E. coli B (with 4.6 million base pairs and 4,205 genes) and makes the 476-gene genome of Mycoplasma genitalium look puny. Bear in mind that the fruit fly genome contains about 14,000 protein genes (although several times that number of proteins may be produced through alternative splicing of exons).

Exactly why S. cellulosum needs more genes than, for example, baker's yeast (with 12 million base pairs and around 6,700 genes) is anybody's guess. It does have many accessory genes for producing secondary metabolites with interesting antifungal, antibacterial, and other properties (including anti-tumor properties). As a result, many labs are busy mining the Sorangium genome for genes of possible commercial importance.
Sorangium cellulosum

I recently decided to poke around inside the genome of S. cellulosum myself, looking for evidence of latent secondary structure (internal complementarity regions) in its genes. I was stunned at what I found. When I had my scripts look for complementing intragenic 11-mers (pairs of complementary sequencces of length 11), I found over 36,000 such pairs in Sorangium's genes.

Next, I went a step further and checked each gene for internal complementing sequences of length 14.

Based on Sorangium's actual A, G, C, and T composition stats, and considering all the kinds of 14-mers that actually exist in the coding regions of the genome, I expected to find 991 matching (complementing) pairs of 14-mers in 10,400 genes. What I actually found were 2,942 matching pairs inside 1,928 genes.

To make this clearer, I plotted the expected number of complementary 14-mers per gene versus the actual number per gene, in a graph:

Expected vs. actual complementing intragenic 14-mers for Sorangium. Expectation statistics were calculated individually, for each gene, based on actual A,G,C,T composition stats and gene length.
The points are arrayed in horizontal lines because while expectations can be calculated to several decimal points, actual occurrences are discrete (whole numbers).

It's fairly evident that length-14 complementary pairs tend to occur at higher than the expected rate(s). In fact, that's true for 98% of occurrences. It might not be obvious (from the above plot) that 98% of points lie above the 1:1 slope line, but that's only because so many points overlap each other.

The bottom line? These results strongly suggest that substantial amounts of secondary structure exist in a significant fraction of Sorangium's genes. The secondary structure could be tied to thermal regulation of gene expression (via RNA thermometers), or some mRNAs could incorporate metallo-sensitive riboswitches; or maybe secondary structure in mRNA is important for certain translocon-targeted genes. There could be other explanations as well. (If you have one, leave a comment.)

Why is this important? For one thing, we need a better understanding of how an organism with 10,400 protein genes regulates and coordinates gene expression. Secondary structure of regulatory and coding-region RNA might well hold important clues. But also, if secondary structure is conserved in large numbers of genes (as I believe it is), it has to affect codon bias. "Complementing" codons would be preferred (at least for certain regions) over non-complementing codons, and this would affect codon choice. It's a factor that has not been considered, to date, in arguments over why codon bias exists.