Likely selection for D614G S

There’s some foolishness going around about this recent preprint from Los Alamos.   They’ve spotted several mutations in collected SARS2 sequences from around the world that look as though they may be under positive selection.  We’re going to look into one in particular, a D614G mutation in the spike protein.  This mutation looks as though it’s increasing in relative prevalence in every region with decent sampling through March and early April.  

As for the foolishness, there’s people responding to this by arguing that this pattern is more likely to be caused by bottlenecking or drift than selection.  Here’s an epidemiology prof at Harvard saying this.  Here’s a group lead at EMBL in the UK.  Here’s a Cambridge virologist.  The argument here is that the impact is random:  Italy happened to get this particular strain and then spread it across Europe.

Let’s dive into some specific regions in more detail.  The Harvard guy claims that Washington State data is consistent with a new seeding of the new strain from Europe or New York followed by roughly equivalent suppression of the old and new outbreaks by various measures.  I don’t have access to the raw GISAID database at the moment, so we’ll use the visualization provided by  Pull it up and set to north-america and color by the genotype at S614 to look at these two strains.

Then scroll down to the bottom and click on “Washington” in the “Filter by Admin Division” section.  That’ll show you only the (496 as of this writing) samples collected from Washington state.

 Then scroll up and look at the tree.  The color tells you which strain you’re looking at.  Blue is the old strain, yellow is the new.  The very first dot was from January 1st, with the original aspartic acid.  Then no sequenced cases for a month and a half.  From Feb 18- Mar 9, there are 148 new cases in Washington, all old strain, and all but two descended from the original January case.  So through early March we’re mostly seeing an epidemic growing from only a few initial introductions.

The first case with a G at this locus is March 10, there’s two of them, and they’re fairly diverged at other loci, suggesting multiple introductions of this new strain into Washington State, probably from the East Coast or Europe.  Let’s look at weekly cases in Washington after those first introductions.  You can get these by downloading the data at the bottom of the page.  The new strain is clade A2a, old strain is everything else.

WashingtonTotal casesOld StrainNew StrainFraction New
March 10-1677552229%
March 17-2354262852%
March 24-3079334658%
March 31-April 6180899151%
April 7-1316016100%

So this data agrees with the thesis presented in the article:  from early March to early April, the fraction of Washington cases with the new strain rises from 0% to 100%.  Now, the 100% is probably an overestimate.  In particular, they’re all gathered by the UW Virology lab, the importance of which I’ll get to in a minute.  However, they’re divergent, not all from a single cluster.  But it certainly looks as though the new strain is becoming dominant in Washington State.

Now, most of these samples come from either the UW Virology lab or the Washington State Department of Health.  The state gathered samples also have county information.  Let’s dig into that a little bit.  The two counties I’m going to focus on are King county and Yakima county.  King County contains Seattle, while Yakima County is much smaller and on the other side of the mountains.  The hope is that we can pick up separate trends due to the geographical separation.

First King County.  Remember that this doesn’t include the UW samples, so we get a much smaller count.

King CountyTotal casesOld StrainNew StrainFraction New
March 10-168800%
March 17-2332133%
March 24-3073457%
March 31-April 61531280%

Then Yakima County.  Here the state didn’t start testing until late March.

Yakima CountyTotal casesOld StrainNew StrainFraction New
March 10-160
March 17-230
March 24-302017315%
March 31-April 663501321%

So it looks like we’re seeing what I expected to see:  later introduction of the new strain into Yakima county data, followed by growth there.

We can do a similar kind of look at the California samples, though there aren’t nearly as many to work with (mostly San Francisco, from UCSF, with some other scattered samples).  There we see the first of the new strain show up on March 7th in Sacramento County.  From then:

CaliforniaTotal casesOld StrainNew StrainFraction New
March 7-1373457%
March 14-20103770%
March 21-2727151244%
March 28-April 442241843%
April 5-1771059%

Here it’s more ambiguous and we don’t have as much data, but a month after the new strain is first seen, it’s quite firmly established in California as well.

As far as the foolishness goes: it’s entirely possible to have a particular clade or strain take over in a single location due to chance: person X spreads it there, person X happens to have that strain, no other strain gets introduced for a while so that strain becomes predominant. But in both Washington and California we see new introduction of D614G in early March when there are already spreading epidemics, followed by rapid growth such that a month later ~half of all cases originate from the later introduction. I don’t yet have access to the full GISAID dataset, but the authors state that the same thing is happening in England, Germany, Japan, and Australia. The same thing happening multiple places is not random chance, it’s something else. As for the functional speculations about higher viral load, they’re suggestive but not dispositive. We ought to look at them. And as for the fools at various institutions, they should – but won’t – shut up when they don’t know what they’re talking about.

Join the Conversation


    1. I just got access to the raw GISAID database of SARS2 genome sequences. Took a quick look and there’s quite a lot of high quality data from NYU. No zip code data, but borough and county. I can’t share the data, but let me know if there’s something you think might be worth looking at.


      1. I’m not capable of dealing with the finer points of genetic variation – or even the rougher points. Although I will screw up the courage to ask a question later when I’ve looked at the sequences again – I had an interesting time comparing the various boros earlier today.

        I do have a general question, which I posed on West Hunt, but no one saw it. It’s this. Actually two questions.

        It’s official: NYC is 20% sero-positive. Yet deaths have been very unequally distributed – take a look at the zip code map in the link I provided. So is it reasonable to conclude that Manhattan has less than 20% infected, and Bronx and Queens have 20%+?

        Second question. Many wealthy Manhattanites have left town for the duration. They’ll eventually return, probably in September. See above: they left a place that wasn’t terribly affected. When they come back, it will be flu season. We will still have the virus. Social distancing will be spotty (although the crap weather will help).

        This is when Manhattan gets hit.

        Makes sense?


Leave a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at
Get started
%d bloggers like this: