Revisiting viral selection

This preproof from Nathan Grubaugh (Yale, PhD in microbiology, 2016), Bill Hanage (Harvard, PhD in epidemiology), and Angela Rasmussen (Columbia, PhD in microbiology, 2009) came across my desk. It’s about the spike protein variant that, as Bette Korber at Los Alamos demonstrated, has become the dominant strain of the coronavirus in the last couple of months.

These guys present no data, just opinion. Here’s the core bit of their paper:

[Korber et al.] present compelling data that an amino acid change in the virus’ spike protein, D614G, emerged early during the pandemic, and viruses containing G614 are now dominant in many places around the world. The crucial questions are whether this is the result of natural selection, and what it means for the COVID-19 pandemic. Korber et al. hypothesized that the rapid spread of G614 was because it is more infectious than D614…As an alternative hypothesis… the increase in the frequency of G614 may be explained by chance, and the epidemiology of the pandemic.

In February, the area with the most COVID-19 cases shifted from China to Europe, and then in March on to the US. As this and other work shows, the great majority of SARS-CoV-2 lineages in the US arrived from Europe, which is unsurprising considering the amounts of travel between the continents. Whether lineages become established in a region is a function not only of transmission, but also the number of times they are introduced. There is good evidence that for SARS-CoV-2, a minority of infections are responsible for the majority of transmission (Endo et al., 2020). Therefore, while most introductions go extinct, those that make it, make it big (Lloyd-Smith et al., 2005). Over the period that G614 became the global majority variant, the number of introductions from China where D614 was still dominant were declining, while those from Europe climbed. This alone might explain the apparent success of G614.

Got that? They think that random chance is a likely hypothesis. Europe happened to be the next big outbreak after China, Europe happened to get this G614 variant, and then Europe spread that variant to the rest of the world.

There are some historical problems with this account. But before I get to those, I want to go through some math. How likely is it that a new introduction of the virus “succeeds” – that is, seeds an infection chain that doesn’t die out? We can approximate this based on three parameters:

  1. The relative advantage s of the introduced strain as compared with the old strain.
  2. The growth rate in the new location R_t, which gives the mean new infections per infection.
  3. The overdispersion \alpha, which describes the variance in new infections per infection.

The assumptions I’ll need for the approximation are that s is small and R_t is close to 1. While the second wasn’t true early on in the pandemic (it was much bigger), social distancing and other measures have made that true for most areas since about early April. The authors of the preproof helpfully linked a paper that estimated the overdispersion parameter based on Chinese data to be about 10.

Novel Introduction

First, let’s estimate how likely it is that a single infection can seed a completely new outbreak in a new location – one where there isn’t another outbreak already. I won’t go through the details here, but we get

p_{new} = \frac{2(R_t - 1)}{1 + \alpha}

We can then use Haldane’s approximation to estimate how many introductions are needed to get a 50% chance of seeding a new outbreak:

n_{new} = \frac{ln 2(1 + \alpha)}{2(R_t - 1)}.

A couple of points: if the mean growth rate in the new location is less than 1, no new outbreaks. You could have an unlucky first “superspreader” event, but then those secondary infections would on average die out. The higher the growth rate is, the more likely the outbreak is to seed. Second, the overdispersion makes the infection harder to seed relative to a Poisson distribution. The more variable the growth rate is, the less likely any single infection is to seed a big event.

To feed in some actual numbers, if R_t=1.15 and \alpha=10, we get p_{new}=1.36% and n_{new} = 51. Any single new infected traveler has just over a 1% chance of seeding a new outbreak, and you’d need ~50 to have even odds of getting a new outbreak.

Introduction Under Competition

What happens if we try to seed an outbreak where there’s already cases? Depends on the difference in growth rate between the existing cases and the new strain. Here (1 + s) acts as a multiplier on the growth rate. We have

p_{compete} = \frac{2(s + R_t - 1)}{1 + \alpha} and

n_{compete} = \frac{ln 2(1 + \alpha)}{2(s + R_t - 1)}.

Increased fitness makes it more likely. Note that if the local growth rate is negative, you can still get a new outbreak if the higher fitness would push the growth rate positive.

History

Alright, let’s talk West Coast numbers. I’ll be using nextstrain.org, since that data’s publicly available. The West Coast (California, Oregon, and Washington) had very early outbreaks, seeded by travelers from China with the original D614 strain. By May 14, there are no more sequenced genomes of the original D614 in the database. The last one in Washington is May 6, in California May 12, and in Oregon May 13. Back in February, the original strain was the only one sequenced in Washington, and the dominant one in both Oregon and California. Note that Washington then was the center of the US epidemic, with several deaths by the end of February. The first of the new strain to show up in Washington is March 11, Oregon March 11, and California February 4. If we focus on Washington, that means in two months (or about 10 generations, with a 6 day generation time) the new strain completely takes over.

Now, the test volume in Washington was about 4000-5000 tests per weekday over the whole of that two months, with about 200-450 positive tests per day, peaking in late March and then dropping slowly until early May. That suggests the growth rate was slightly negative during that two months, making it very hard indeed to seed a new outbreak of the new strain from only a couple introductions.

Strain Replacement

Just introduction of the new strain by New Yorkers or Europeans won’t cause complete replacement of the old strain by itself. If there’s no fitness differential, all it will do is seed new infection chains with the new strain. Suppose you have 500 cases of the old strain and then 4,500 New Yorkers come to Seattle. Then most of the new cases will be of the new strain, but not all of them. There’s no reason for the old cases to stop presenting new cases.

One way for this to happen is if there is a differential fitness, so that the old strain has a negative growth rate while the new strain has a positive or less negative growth rate. That lets the new strain outcompete the old strain, until the population of the old strain is small enough that it can disappear by chance.

The other way is for all the existing cases to get swamped by new introductions under conditions of very low growth. Suppose the intrinsic growth rate in Washington was extremely low, say 0.5. So the number of cases drops by half every generation. But every generation new cases get introduced from another location dominated by the new strain (e.g., New York). So we start off with say, 400 cases in Washington. Then in the next generation, those cases only generate 200 new cases. But we also get another 150 cases imported from New York, for 350 total cases, 43% new. Repeat for 10 generations, and you’ll end up with no cases of the old strain, but a fair number of cases of the new. Note that you need a really low growth rate for this to happen in only 10 generations: R_t=0.8 only drops the population 10-fold.

Effective Local Response to SARS-CoV-2

I think it’s become fairly apparent that effective federal and even state responses to this pandemic are not going to happen in the short term. I had a discussion over at Greg’s blog where I pointed out that effective local responses are still perfectly possible in most places. So let’s take a brief look about what that might look like.

We’re going to look at this from the perspective of a small or medium size city that is:

  • Politically independent (has its own government and police force)
  • Reasonably geographically separated (not a suburb of a much larger metropolitan area)
  • Reasonably wealthy (such that the measures I’ll suggest are affordable without external support)
  • Not yet heavily infected

Let’s suppose it’s about 80,000 people, 20% children and 15% seniors. First let’s figure out what the impact on the city would be without effective response. We’ll conservatively guess that about 60% of the population would eventually be infected, with a hospitalization rate for kids, adults, and seniors of 2%, 20%, and 35%, and a case fatality rate of 0%, 0.5%, and 6%, respectively. That would give us at the end of the epidemic 156 dead adults, 432 dead senior citizens, and 192 hospitalized kids, 6,240 hospitalized adults, and 2,520 hospitalized seniors. That’s not great. Depending on how big your local hospital is and how bad the cases are, the death rate could be higher if those infections happen too fast.

So. Suppose you run this city. What can you do about it without federal or state help?

Random Testing

Let’s assume you have access to at least some testing capacity. If you’re like most places, it’s under-used right now. Let’s say you’ve got 100 tests per day that aren’t needed for testing of healthcare workers or new patients. Use that capacity: Randomly choose households in your town, call them up and read this script:

Hello. My name is X, with city Y. We’re running a testing program to help our citizens know whether they’ve been infected with the new coronavirus. We’ve chosen your household to be a part of this program. If you’re interested, bring at most two members of your household to the local hospital tomorrow and give them this ID number: #. Thanks.

At 100 tests a day, you’re randomly testing 0.125% of the population. Just this level of random testing is enough to spot a new outbreak by the time it gets to the size of ~500 people or so. That’s small enough it’s still containable.

Contact Tracing

Set up your own local contact tracing team. Once you’ve spotted an outbreak from random testing, a sick patient presenting, or a lead from another jurisdiction’s tracing team, you’ll want to figure out how much it’s spread and contain it as soon as possible.

Take your index patient and interview them: figure out how long they’ve been infectious. Where do they work, who do they live with, what shops have they been to. You’re trying to figure out who else could have been exposed to them while they were infectious. Once you’ve got potential links you call them up too and do the same thing: are they sick or showing any symptoms, were they at location X at time Y, and who have they been in contact with since then.

Recommend self-quarantine for all these folks and refer them for testing.

Basically all of this can be done with volunteers. You’ll need a script for each stage, volunteers with phones, and interpreters. In some cases you’ll want some boots on the ground as well. You could find underemployed public employees who want to help or just ask around in the community. There’s lots of people out of work who would be interested in helping. I’d look specially at librarians and public school administrators, but lots of people can be useful here.

Regulations to Help Tracing

You can do tracing with just phone work, but some additional regulation will make it easier. In particular we’re looking for things that will make it easier to see when two people were close to each other at the same time. Workplaces generally have timesheets that will let you see overlapping employees, but shops don’t keep track of customers. So ask them to: require that they keep a paper ledger that customers and employees need to sign into with the date and time when they entered the shop. A book like this costs ~$10; just buy one for every shop in your town and pick them up/drop them off weekly.

Centralized Quarantine

Intra-household infections are hard to stop with self-quarantine. So offer your citizens an alternative to getting grandma infected. Find some hotels in the area. They’re likely to be nearly empty. See if they’ll donate rooms for quarantine, or maybe offer a tax rebate for them. Set them up for quarantine of infected or contacted citizens. Make sure there’s food delivery and internet service available.

Then when you do your contact tracing and you tell someone they were in long contact with a known infection, you say “If you’re worried about infecting your family members, we can put you up in the hotel for free.”

Travel Restrictions

All of the above measures are meant to keep the infection from spreading widely in your town, but you’ll have to keep the number of new infections that you get from outside down as low as possible. So impose some simple travel restrictions: restrict entry to your town to residents or essential workers. Put some police or volunteers at the main roads and screen entrants. You could make it easier by doing something like mailing out a couple of stickers to put on the windshield to every household. Then cars with the sticker can get waved past.

Strength of these restrictions probably depends on how common the infection is in your local area. If your town has an airport or is near an airport, you’ll want to be a lot more careful, especially with long-distance travelers.

Other Public Health Measures

Standard stuff that everyone knows already should be kept up: have people wear masks outside, encourage social distancing, close schools, ban large gatherings. Try not to be a jerk about it; voluntary compliance is a lot better than forced compliance. That means, for example, having your cops hand out masks to people who don’t have them instead of roughing up those who aren’t wearing them.

Results

There’s no reason why sensible and mostly voluntary restrictions like these can’t let your town open its stores and factories and keep its economy going while also keeping the coronavirus from running rampant.

Likely selection for D614G S

There’s some foolishness going around about this recent preprint from Los Alamos.   They’ve spotted several mutations in collected SARS2 sequences from around the world that look as though they may be under positive selection.  We’re going to look into one in particular, a D614G mutation in the spike protein.  This mutation looks as though it’s increasing in relative prevalence in every region with decent sampling through March and early April.  


As for the foolishness, there’s people responding to this by arguing that this pattern is more likely to be caused by bottlenecking or drift than selection.  Here’s an epidemiology prof at Harvard saying this.  Here’s a group lead at EMBL in the UK.  Here’s a Cambridge virologist.  The argument here is that the impact is random:  Italy happened to get this particular strain and then spread it across Europe.


Let’s dive into some specific regions in more detail.  The Harvard guy claims that Washington State data is consistent with a new seeding of the new strain from Europe or New York followed by roughly equivalent suppression of the old and new outbreaks by various measures.  I don’t have access to the raw GISAID database at the moment, so we’ll use the visualization provided by nextstrain.org.  Pull it up and set to north-america and color by the genotype at S614 to look at these two strains.

Then scroll down to the bottom and click on “Washington” in the “Filter by Admin Division” section.  That’ll show you only the (496 as of this writing) samples collected from Washington state.

 Then scroll up and look at the tree.  The color tells you which strain you’re looking at.  Blue is the old strain, yellow is the new.  The very first dot was from January 1st, with the original aspartic acid.  Then no sequenced cases for a month and a half.  From Feb 18- Mar 9, there are 148 new cases in Washington, all old strain, and all but two descended from the original January case.  So through early March we’re mostly seeing an epidemic growing from only a few initial introductions.

The first case with a G at this locus is March 10, there’s two of them, and they’re fairly diverged at other loci, suggesting multiple introductions of this new strain into Washington State, probably from the East Coast or Europe.  Let’s look at weekly cases in Washington after those first introductions.  You can get these by downloading the data at the bottom of the page.  The new strain is clade A2a, old strain is everything else.

WashingtonTotal casesOld StrainNew StrainFraction New
March 10-1677552229%
March 17-2354262852%
March 24-3079334658%
March 31-April 6180899151%
April 7-1316016100%

So this data agrees with the thesis presented in the article:  from early March to early April, the fraction of Washington cases with the new strain rises from 0% to 100%.  Now, the 100% is probably an overestimate.  In particular, they’re all gathered by the UW Virology lab, the importance of which I’ll get to in a minute.  However, they’re divergent, not all from a single cluster.  But it certainly looks as though the new strain is becoming dominant in Washington State.


Now, most of these samples come from either the UW Virology lab or the Washington State Department of Health.  The state gathered samples also have county information.  Let’s dig into that a little bit.  The two counties I’m going to focus on are King county and Yakima county.  King County contains Seattle, while Yakima County is much smaller and on the other side of the mountains.  The hope is that we can pick up separate trends due to the geographical separation.

First King County.  Remember that this doesn’t include the UW samples, so we get a much smaller count.

King CountyTotal casesOld StrainNew StrainFraction New
March 10-168800%
March 17-2332133%
March 24-3073457%
March 31-April 61531280%

Then Yakima County.  Here the state didn’t start testing until late March.

Yakima CountyTotal casesOld StrainNew StrainFraction New
March 10-160
March 17-230
March 24-302017315%
March 31-April 663501321%

So it looks like we’re seeing what I expected to see:  later introduction of the new strain into Yakima county data, followed by growth there.


We can do a similar kind of look at the California samples, though there aren’t nearly as many to work with (mostly San Francisco, from UCSF, with some other scattered samples).  There we see the first of the new strain show up on March 7th in Sacramento County.  From then:

CaliforniaTotal casesOld StrainNew StrainFraction New
March 7-1373457%
March 14-20103770%
March 21-2727151244%
March 28-April 442241843%
April 5-1771059%

Here it’s more ambiguous and we don’t have as much data, but a month after the new strain is first seen, it’s quite firmly established in California as well.

As far as the foolishness goes: it’s entirely possible to have a particular clade or strain take over in a single location due to chance: person X spreads it there, person X happens to have that strain, no other strain gets introduced for a while so that strain becomes predominant. But in both Washington and California we see new introduction of D614G in early March when there are already spreading epidemics, followed by rapid growth such that a month later ~half of all cases originate from the later introduction. I don’t yet have access to the full GISAID dataset, but the authors state that the same thing is happening in England, Germany, Japan, and Australia. The same thing happening multiple places is not random chance, it’s something else. As for the functional speculations about higher viral load, they’re suggestive but not dispositive. We ought to look at them. And as for the fools at various institutions, they should – but won’t – shut up when they don’t know what they’re talking about.

IHME projections are absurdly optimistic

The state-specific projections from IHME have been widely disseminated and used for federal, state, and hospital planning. For example, here’s an interview with one of the UW professors on the project describing how their projections are being used by the feds. They have done yeoman’s work in collecting appropriate data, but their actual model has some serious flaws.

The details of the model itself can’t be found on their projection webpage; you have to go the medarxiv article. Their introduction gives a brief criticism of the SIR model that I’ve been using as a base for my estimates. In particular, SIR models generally don’t include the effect of changes in social behavior in response to the epidemic, they assume panmixia (random mixing among the population) and generally model using reported case numbers, which are distorted by variance in testing rates and testing strategies.

These criticisms are all pretty reasonable. It’s a good idea to look at the effect of controls (i.e., reducing R_0), which many simple models haven’t done. Panmixia isn’t accurate, and likely overestimates the speed of the epidemic, especially when the modeling is done on the scale of the whole country. And death rates are likely more accurate than case numbers, due to limited testing.

Let’s see what they’re suggesting as an alternative. Here’s how they’re doing the projections of total deaths in each state:

  • Gather state-specific data on imposed controls, hospital capacity and utilization, local age structure and death rates.
  • Estimate death rates for infection after correcting for local age structure.
  • Code controls from level 0-4 for # of the following: school closures, non-essential business closures (incl. bars/restaurants), stay-at-home recommendations, and travel restrictions.
  • Model the impact of controls on maximum death rate and time to reach the inflection point using published data from Wuhan, scaled based on the coded controls for each state, with Wuhan = 4.
  • Fit cumulative death rate (i.e., deaths per capita) to a Gaussian error function with three parameters: a location-specific inflection time where deaths stop accelerating, a location-specific growth rate, and a global (age-normalized) maximum death rate. Either the inflection time or the maximum death rate was assumed to depend on the delay between initial deaths and imposition of controls.
  • Assume that states without full controls impose full controls a week after the current date.
  • Project out the fits, incorporating variance in the parameter fits to generate a set of projections.

There are some really serious problems with this model, and all of them push the results towards underestimating the problem in the US. Let’s go through the problems.

Unqualified acceptance of CCP data

The data for deaths in Wuhan come from the provincial Hubei government, straight from the Chinese Communist Party. There are very good reasons to be skeptical of these numbers. The CCP lies constantly, and it is very hard for outsiders to verify these numbers. However, there is technical data that suggests strongly that the CCP is undercounting. A late-Feburary analysis looks at activity of crematory facilities in Wuhan as well as temporarily-leaked numbers on Tencent. This data suggests a 5-10 fold higher rate of deaths than was reported by the CCP in late February. There have also been recent reports of very high numbers of funerary urns needed in Wuhan, in excess of what would be required by the officially reported deaths.

This is a huge problem for the model, as it is highly parameterized by the assumed ending of the epidemic in Wuhan in short time, by the imposition of social distancing controls.

Undercounting of deaths elsewhere

There is good reason to believe that official COVID19 deaths are undercounted in most areas. In particular, total deaths per month for several hard-hit Italian towns went up 3-5 times compared to the same week last year, with only a fraction of that excess officially counted as COVID19 deaths.

In this model, current data on deaths in other polities are used to fit total deaths in the future, so a significant undercounting of current deaths would undercount the future projections.

Assumption of control imposition

They assume that states that do not have infection controls impose full controls within a week. This is unlikely at best.

Bad model choices

They choose to model the cumulative death rate as a Gaussian error function based on, AFAICT, no theoretical reason, only that it fit “the data” better. Given that the only data in their set that has run to completion is Wuhan’s, and that the Wuhan dataset is compromised, this is an extremely foolish decision.

Further, this model assumes that the acceleration of deaths as the epidemic grows is the same as the projected deceleration of deaths after infection controls are imposed. There is absolutely no reason to expect this. As the acceleration is related to R_0, this is equivalent to saying that R_0 before controls is equal to 1/R_{ctrl}. Even if controls are effective at bringing the growth rate negative, there is no reason why the deceleration can’t be much slower than the initial growth rate (e.g., if R_0 before controls is ~3 and after controls is ~0.95).

A better approach would be a modified SIR model using European, US, and non-CCP Asian data to estimate the impact of various infection control measures.

Overestimation of control effectiveness

They assume that state-wide infection controls are equivalent in effect with the controls the CCP imposed in Wuhan.

This is crazy. I was watching media reports and social media in January and February keeping an eye on what was happening in Wuhan. I saw officials welding apartment doors shut, abducting individuals with fevers from their homes or off the street, mass-disinfecting whole cities, setting up walls outside of villages to keep the infection out, people not wearing facemasks being harassed and assaulted, running internal checkpoints to identify people with fevers.

I live in an area with fairly strong infection controls by US standards. We are nowhere close to that level of infection control. I went out for the first time in two weeks to the grocery store to stock up. There was one other person wearing a face mask, and they were wearing it wrong (not covering their nose). I repeatedly had to dodge back out of aisles to avoid people coming straight at me. This assumption is ridiculous and crazy, and obviously so to anyone paying attention. They’re in Washington state, they should know better.

Chinese reports suggest that the infection controls imposed in Wuhan dropped the R_0 from ~3.5 to ~0.35, or about 10-fold. Note that they also include a period of partial controls, which in their case means:

  • Blocking outward transportation from Wuhan
  • Closing public transit and vehicular traffic inside Wuhan
  • Compulsory mask-wearing in public places
  • Cancellation of public events
  • Self-quarantine of confirmed or suspected cases

They estimate that during this period, R_0 dropped to ~1.25. For comparison, the full controls included as well:

  • Full quarantine of confirmed or suspected cases (i.e., extraction to a separate quarantine site), including contacts of confirmed cases
  • Temperature monitoring of all residents
  • Universal and strict stay-at-home orders for all residents

My expectation is that the current set of controls imposed by most European and US State governments is closer to the partial Wuhan controls than the full controls, and I fully expect growth rates to remain positive, though slower.

Conclusion

There are at least 5 serious, obvious problems with the IHME model, most seriously uncritical acceptance of Communist government data and wild overestimation of the effectiveness of US infection control measures. Given that I know from personal communication that multiple hospitals are using this to project out expectations for medical care required – and taking as valid that peak care will be required for only a matter of weeks – these reports are hugely irresponsible.

Further, the fact that all the problems I identified with their model cause it to be biased towards underestimating the scope of the problem, I suspect there are serious prejudices or biases in the people generating the model. Frankly, I think they’re projecting wishful thinking and looking for reasons to support optimism.

Updated Epidemiological Modeling

The previous model I built was a toy model: the simplest possible dynamics that would give you a feel for what’s going on, and parameter selection that was educated guesswork.

There was sufficient interest in it, though, that I decided it would be worth building a more sophisticated and hopefully accurate model.

State dynamics and parameterization

The previous model was a simple SIR model with only three states. Individuals were either susceptible, meaning vulnerable to infection but not yet infected, infected as well as infectious to susceptible individuals, or recovered (or dead). This is very simple to calculate, but has some obvious limitations. It doesn’t show the progression of the disease in individuals and the delay from action. Critical cases, for example, were simply calculated as a static fraction of all infected states at any point, when in fact they should be time delayed (it takes time for an infection to progress to a critical state). It’s hard to use to examine the effects of specific controls; all I did was vary the effective R_0 after imposing controls. But there’s no way to, e.g., specifically reduce the contact rate for symptomatic patients.

For the new model, I’ve spent some time reading case reports and aggregated studies. Key references are:

  • A clinical analysis of 1099 cases throughout China describing clinical course (Feb 28)
  • A case study of 9 clustered German cases with daily testing using both PCR and viral culturing (Mar 8)
  • A survey of hospital associations estimating total hospital bed and ICU bed capacity for the U.S. (Mar 17)
  • A survey/modeling study in two Chinese cities counting contact numbers, including between different age groups (Mar 20)
  • A CDC report of hospitalization rates, ICU admission rates, and mortality rates of confirmed American cases (Mar 26)
  • A CDC report of testing results from two cruise ships, including one where all crew and passengers were tested (Mar 26)
  • A detailed report of Chinese cases in Wuhan with estimation of transmission dynamics (Mar 26)

Based on my reading, it looks like the progression of the disease is as follows:

  • An incubation period of 4.1-7.0 days after infection, during which the person is not yet infectious to others.
  • A short pre-symptomatic period of ~2 days when the person is infectious but before onset of symptoms.
  • Some fraction of people are fully asymptomatic, and can spread the disease until it is cleared and they recover, about 12 days after the end of the incubation period.
  • The rest will develop noticeable upper respiratory symptoms, for a duration of 12 days before recovery or decline.
  • Some fraction of symptomatic cases will progress to non-severe or severe support needs from a hospital (for non-severe cases) or ICU (for severe cases). Hospitalization lasts a further 11-13 days, depending on severity.
  • Some fraction of hospitalized cases will die; the rest will recover and retain immunity to the disease.

Here is a diagram of the disease course:

Of particular note is that I have modeled hospitalized patients as no longer being infectious; this is due to the German study suggesting low viral shedding 10 days post symptom onset.

The basic R_0 value is estimated at 2.2 based on the Chinese Wuhan data. This can be converted into estimated time between contacts causing a new infection for infectious people of ~5.9 days.

The asymptomatic fraction is estimated at 50% based on data from the Diamond Princess cruise ship. All passengers and crew on the ship were tested and ~50% of the positive test results had shown no symptoms.

The results of symptomatic cases are age-stratified based on the CDC report of American cases, including rates of hospitalization, ICU admission, and death. New infections are assumed to be randomly distributed among age brackets proportional to their fraction of the susceptible population.

The most speculative part of the model is the effects of health care saturation. While I have reasonable estimates of the available hospital and ICU beds, it’s necessarily speculative to estimate what happens to people with cases requiring hospitalization if that hospitalization is not available. I’ve made what I feel are likely conservative guesses. I’ve assumed triage of scarce beds to younger and less-severe patients.

Model with no controls

The first test is simply to run the model without attempting to impose infection controls and see what the result is. Here is a log-scale plot of total hospitalizations, ICU admissions, and deaths over the 2-year course of the epidemic.

The deaths top out at ~10 million total. Note that this is higher than my previous model, in part because of better estimation of available beds and in part because of longer estimated hospital stays. The death rate is strongly biased towards the elderly, in part due to intrinsically worse outcomes and in part due to triage of scarce resources.

Nearly 20% of 85+ Americans would be expected to die, and even 1 in 19 of 55-64 year olds, 1 in 47 of 45-54 year olds, and 1 in 164 20-44 year olds.

Without the impact of scarce hospital resources, the results look much better:

You can see that the red line no longer bends upward when the hospitalized cases hit the ICU or hospital bed capacity, reducing the total deaths almost 4-fold to 2.8 million. As before, the deaths are heavily skewed towards the elderly:

Unfortunately, avoiding running out of hospital capacity at the peak would require millions of available ICU beds simultaneously, which is rather unrealistic. We’ll look into the effects of imposed infection controls in my next post.

Appendix: code

PopState = namedtuple("PopState", [
    "susceptible",
    "incubating",
    "presymptomatic",
    "asymptomatic",
    "symptomatic",
    "nonsevere",
    "severe",
    "dead", 
    "recovered",
])

GeneralParams = namedtuple("GeneralParams", [
    "incubating_days",
    "presympt_days",
    "asympt_days",
    "sympt_days",
    "nonsev_days",
    "severe_days",
    "R0",
    "frac_asymptomatic",
])

AgeBracketParams = namedtuple("AgeBracketParams", [
    "frac_nonsevere",
    "frac_severe",
    "nonsevere_dead_home",
    "severe_dead_icu",
    "severe_dead_hosp",
    "severe_dead_home",
])

shared_params = GeneralParams(
    5.0, # Days incubating before infectious
    2.0, # Days infectious before symptomatic
    10.0, # Days infectious for asymptomatic cases
    12.5, # Days infectious for symptomatic cases
    11.0, # Days hospitalized for non-severe cases
    13.0, # Days hospitalized for severe cases
    2.2, # R0, base infectivity rate
    0.5, # Estimated fraction asymptomatic cases
)

# These are high-end estimates from the CDC publication
bracket_params = {
    "age_0_19": AgeBracketParams(
        0.025, 0.0,
        0.10, 
        0.0, 0.0, 0.0,
    ),
    "age_20_44": AgeBracketParams(
        0.166, 0.042,
        0.15, 
        0.05, 0.10, 0.15,
    ),
    "age_45_54": AgeBracketParams(
        0.179, 0.104,
        0.18, 
        0.08, 0.16, 0.24,
    ),
    "age_55_64": AgeBracketParams(
        0.189, 0.112,
        0.20, 
        0.23, 0.45, 0.85,
    ),
    "age_65_74": AgeBracketParams(
        0.247, 0.188,
        0.25, 
        0.26, 0.55, 0.95,
    ),
    "age_75_84": AgeBracketParams(
        0.277, 0.310,
        0.31, 
        0.34, 0.67, 1.0,
    ),
    "age_85": AgeBracketParams(
        0.413, 0.290,
        0.35, 
        0.94, 1.0, 1.0,
    ),
}

bracket_states = {
    "age_0_19":  PopState(81.98e6, 0, 0, 0, 0, 0, 0, 0, 0),
    "age_20_44": PopState(108.84e6, 10, 0, 0, 0, 0, 0, 0, 0),
    "age_45_54": PopState(41.63e6, 0, 0, 0, 0, 0, 0, 0, 0),
    "age_55_64": PopState(42.27e6, 0, 0, 0, 0, 0, 0, 0, 0),
    "age_65_74": PopState(30.48e6, 0, 0, 0, 0, 0, 0, 0, 0),
    "age_75_84": PopState(15.39e6, 0, 0, 0, 0, 0, 0, 0, 0),
    "age_85":    PopState(6.55e6, 0, 0, 0, 0, 0, 0, 0, 0),
}
brackets = [
    "age_0_19",
    "age_20_44",
    "age_45_54",
    "age_55_64",
    "age_65_74",
    "age_75_84",
    "age_85",
]

ICU_BEDS = 31482
HOSPITAL_BEDS = 501470

def update_state(states, general, params, day=1, start_day=70, control_rate=2.2, icu_beds=ICU_BEDS, hospital_beds=HOSPITAL_BEDS):
    def total_state(state):
        return sum(states[age].__getattribute__(state) for age in brackets)
    
    total_S = total_state("susceptible")
    total_N = sum(sum(states[age]) for age in brackets)
    
    # Generate the new states as dicts so they're mutable
    new_states = {age: states[age]._asdict() for age in brackets}
    
    # Adjust the controlled rate
    R0 = control_rate if day >= start_day else general.R0
    
    # Adjust for constant time between contacts
    mean_infectious_days = general.presympt_days
    mean_infectious_days += general.asympt_days * general.frac_asymptomatic
    mean_infectious_days += general.sympt_days * (1 - general.frac_asymptomatic)
    days_betw_contacts = mean_infectious_days / R0
    
    # susceptible --> incubating
    total_preI = total_state("presymptomatic")
    new_infections_from_presympt = (total_preI * total_S) / (total_N * days_betw_contacts)
    total_asymI = total_state("asymptomatic")
    new_infections_from_asympt = (total_asymI * total_S) / (total_N * days_betw_contacts)
    total_symI = total_state("symptomatic")
    new_infections_from_sympt = (total_symI * total_S) / (total_N * days_betw_contacts)
    
    new_infections = new_infections_from_presympt + new_infections_from_asympt + new_infections_from_sympt
    for age in brackets:
        # each age gets new infections proportionate to its fraction of the susceptible population
        # (i.e., panmixia across all age groups)
        transfer = (states[age].susceptible / total_S) * new_infections
        assert transfer <= states[age].susceptible
        assert transfer >= 0
        new_states[age]["susceptible"] -= transfer
        new_states[age]["incubating"] += transfer

    # incubating --> presymptomatic
    for age in brackets:
        transfer = (states[age].incubating / general.incubating_days)
        assert transfer <= states[age].incubating
        assert transfer >= 0
        new_states[age]["incubating"] -= transfer
        new_states[age]["presymptomatic"] += transfer
    
    # presymptomatic --> asymptomatic / symptomatic
    for age in brackets:
        transfer = (states[age].presymptomatic / general.presympt_days)
        assert transfer <= states[age].presymptomatic
        assert transfer >= 0
        new_states[age]["presymptomatic"] -= transfer
        new_states[age]["asymptomatic"] += transfer * general.frac_asymptomatic
        new_states[age]["symptomatic"] += transfer * (1 - general.frac_asymptomatic)
    
    # asymptomatic --> recovered
    for age in brackets:
        transfer = (states[age].asymptomatic / general.asympt_days)
        assert transfer <= states[age].asymptomatic
        assert transfer >= 0
        new_states[age]["asymptomatic"] -= transfer
        new_states[age]["recovered"] += transfer
    
    # symptomatic --> recovered / nonsevere / severe
    for age in brackets:
        transfer = (states[age].symptomatic / general.sympt_days)
        assert transfer <= states[age].symptomatic
        assert transfer >= 0
        new_states[age]["symptomatic"] -= transfer
        new_states[age]["severe"] += transfer * params[age].frac_severe
        new_states[age]["nonsevere"] += transfer * params[age].frac_nonsevere
        new_states[age]["recovered"] += transfer * (1 - (params[age].frac_severe + params[age].frac_nonsevere))
    
    # nonsevere --> recovered / dead
    # and 
    # severe --> recovered / dead
    free_hospital_beds = hospital_beds
    free_icu_beds = icu_beds
    for age in brackets:
        # nonsevere patients get first pick of hospital beds for triage
        transfer = (states[age].nonsevere / general.nonsev_days)
        assert transfer <= states[age].nonsevere
        assert transfer >= 0
        if transfer > 0:
            new_states[age]["nonsevere"] -= transfer
            # youngest patients get first pick of hospital beds for triage
            nonsevere = states[age].nonsevere
            in_bed = min(nonsevere, free_hospital_beds)
            at_home = nonsevere - in_bed
            free_hospital_beds -= in_bed
            dying = (at_home / nonsevere) * transfer * params[age].nonsevere_dead_home
            assert dying <= transfer
            new_states[age]["dead"] += dying
            new_states[age]["recovered"] += transfer - dying
        
        # severe patients get what's left over
        transfer = (states[age].severe / general.severe_days)
        assert transfer <= states[age].severe
        assert transfer >= 0
        if transfer > 0:
            new_states[age]["severe"] -= transfer
            # bed shortage
            severe = states[age].severe
            in_icu = min(severe, free_icu_beds)
            free_icu_beds -= in_icu
            in_bed = min(severe - in_icu, free_hospital_beds)
            free_hospital_beds -= in_bed
            at_home = severe - (in_icu + in_bed)
            dying = transfer * (
                (at_home / severe) * params[age].severe_dead_home + 
                (in_bed / severe) * params[age].severe_dead_hosp + 
                (in_icu / severe) * params[age].severe_dead_icu
            )
            assert dying <= transfer
            new_states[age]["dead"] += dying
            new_states[age]["recovered"] += transfer - dying
    
    # Now cast everything back to states
    final_states = {
        age: PopState(
            new_states[age]["susceptible"],
            new_states[age]["incubating"],
            new_states[age]["presymptomatic"],
            new_states[age]["asymptomatic"],
            new_states[age]["symptomatic"],
            new_states[age]["nonsevere"],
            new_states[age]["severe"],
            new_states[age]["dead"],
            new_states[age]["recovered"],
        )
        for age in brackets
    }
    return final_states, day + 1

def run_model(states, general, params):
    def total_state(state):
        return sum(states[age].__getattribute__(state) for age in brackets)
    
    def total_infections():
        return sum([
            total_state("incubating"), 
            total_state("presymptomatic"), 
            total_state("asymptomatic"), 
            total_state("symptomatic"),
            total_state("nonsevere"),
            total_state("severe"),
        ])

    infections = [total_infections()]
    nonsevere = [total_state("nonsevere")]
    severe = [total_state("severe")]
    deaths = [total_state("dead")]
    
    day = 1
    
    while total_infections() >= 0.25:
        states, day = update_state(states, general, params, day=day, start_day=230, control_rate=2.2,
                                  icu_beds=ICU_BEDS, hospital_beds=HOSPITAL_BEDS
                                  )
        infections.append(total_infections())
        nonsevere.append(total_state("nonsevere"))
        severe.append(total_state("severe"))
        deaths.append(total_state("dead"))
    return np.array(infections), np.array(nonsevere), np.array(severe), np.array(deaths), states

I, H, C, D, states = run_model(bracket_states, shared_params, bracket_params)

plt.figure(figsize=(12, 8))
plt.grid()

plt.plot(H, label="All Hospitalizations", color='k')
plt.plot(C, label="ICU admissions", color='orange')
plt.plot(D, label="Deaths", color='red')
plt.legend()
plt.yscale('log')
plt.ylim(ymin=1)
plt.xlim((0, 600))

plt.hlines(ICU_BEDS, 0, 600, linestyle='dashed', color='g')
plt.hlines(HOSPITAL_BEDS + ICU_BEDS, 0, 600, linestyle='dashed', color='b')
plt.ylabel("# of People")
plt.xlabel("Days After Initial Infection")

death_rates = [100 * states[age].dead / (states[age].dead + states[age].recovered + states[age].susceptible) for age in brackets]

plt.figure(figsize=(8, 6))
plt.bar(range(len(brackets)), death_rates)
plt.grid()
labels = [
    "0-19",
    "20-44",
    "45-54",
    "55-64",
    "65-74",
    "74-85",
    "85+",
]
plt.xticks(range(len(brackets)), labels, rotation=45)
plt.xlabel("Age Bracket")
plt.ylabel("% of Bracket Dying of COVID-19")

Epidemiological modeling – costs of controls

One point that’s come up a couple of times is the expense of imposing serious infection controls. China probably lost hundreds of billions of dollars on controlling the outbreak in Wuhan. Let’s try to make a guess of how much it would cost us, and use that to figure out what the best path forward is.

For starters, let’s take some reference points. Assume that completely uncontrolled, the R0 for the virus is 2.0. This probably incorporates small personal, relatively costless changes like better handwashing, sneeze and cough etiquette, etc. Evidence from China suggests that their very strong controls brought the R0 down to about 0.35. We’ll assume that those level of controls mostly shutdown the economy, reducing economic activity by 90%. U.S. 2019 GDP was about $21.44 trillion, so we can estimate the daily cost of these extremely stringent controls in the U.S. as being about $50 billion.

Next, we’ll guess that the daily cost of imposing controls gets harder as you go lower – it’s not linear in R0. Instead we figure that every 1% reduction in R0 costs the same amount. So, to drop R0 from 2.0 to 0.35 we estimate would cost about $50 billion, and thats 173.42 1% reductions, so each 1% reduction would cost about $288 million dollars per day. There’s one other thing to keep in mind for cost accounting: it’s much, much cheaper to impose controls when the total number of infections is low. In the low regime, you can do contact tracing and targeted quarantine rather than wholesale lockdowns. So if the total number of infections in the population is below 1,000, we’ll divide the cost of controls for that day by 500.

Finally, we need to balance the costs of infection controls with the benefits from reduced infection and mortality. All cases, including mild, non-critical cases, cost personal lost income: about $200 a day (median income of ~70K, divided by 365). For critical cases in the ICU, about $5000 a day For critical cases in a general hospital bed, about $1200 a day. For critical cases at home, no additional marginal cost. Deaths will cost $5 million.

Now we can make a back of the envelope estimate of the costs and benefits of different courses of infection control.

We’ll impose infection controls on day 130, when there’s 30 dead and 30,500 infections out there. Then we’ll sweep through different strengths of controls and see what the total costs look like over the whole course of the epidemic.

The red line is the cost of the disease itself, largely due to deaths. The green line is the cost of maintaining infection controls throughout the epidemic. The dashed line is the total cost, all in billions of dollars. Note that with no controls, our estimate of the total cost to the US is $14.8 trillion dollars, or about 2/3 of the total US GDP.

There are two local minima on the total cost curve. The first is right around R0=1.50. This makes the spread slow enough that hospital capacity is never overloaded, greatly reducing the number of deaths and thus the cost of the disease. This reduces total cost to $7.45 trillion dollars, just about half of what the uncontrolled cost would be. Note that even with this small amount of reduction, the majority of the cost is due to the controls rather than directly due to the disease.

The second local minima is at the far left, right around at R0=0.35, or about what the Chinese managed in Wuhan. This slows the spread of the disease so fast that we rapidly enter back into the much cheaper contact-tracing regime where wholesale lockdowns are no longer necessary. This reduces total cost to $3.11 trillion dollars, half again the cost with only moderate controls.

The cost savings with very serious controls imposed as soon as possible comes from two sources: First, lower disease burden in deaths, hospitalization, and lost productivity. Secondly, it’s cheaper to impose very serious controls because they need to be imposed for a much shorter period of time to control the disease.

Epidemiological modeling

A basic SIR model for the epidemic.

Total population 327 million, with a single initiating infection. 12 day course of disease, initial R0 of 3.5 (as per the stats from China that Steve linked.) Assume 5% of cases are critical, 2% of those critical cases die with ICU care, 5% with general hospital care, and 50% with no care. 90,000 available ICU beds and 900,000 available hospital beds.

Run this model through to completion and it sweeps through the population in about a year, infecting most everyone (less about 9 million who escape), killing 5.4 million.

Now, suppose we impose infection controls on day 80, right about when there’s 1000 deaths from this thing. And then we vary how strong those controls are: from 0.35 (what the Chinese managed) up to nothing at all.

Here we see how the # of deaths varies with the strength of our controls. If we impose Chinese-style controls, we get away with only 5K deaths, or 1000-fold fewer than without the controls. But the inflection point isn’t just at 1.0. In particular if we can get the R0 below about 1.5 that gets us down under 500K, or a 10-fold reduction. At 1.0 we get down to about 50K.

If we take the slope of this plot, we can see the marginal impact of better controls at each point, in terms of # of deaths saved:

Why the discontinuity at R0 of just below 1.5? Well, here’s the number of critical cases over time for several different scenarios. The vertical line is when we impose controls, the horizontal lines are ICU bed capacity and total hospital bed capacity. Right below an R0 of 1.5, we dip the # of critical cases low enough that we never exceed hospital capacity. The trick here though is that until you get R0 below 1.0 you’re still sweeping through most of the population, just more slowly, slowly enough that the hospitals can handle it. And that means that, for example, if you get R0 to exactly 1.0 you have to keep those infection controls in place for ~5 years or else you’ll be back up close to 5million dead again.

Appendix: code to generate this model.

from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from collections import namedtuple

PopState = namedtuple("PopState", [
    "day", 
    "susceptible", 
    "infected", 
    "dead", 
    "recovered",
])
EpiParams = namedtuple("EpiParams", [
    "days_infected", 
    "infection_rate", 
    "frac_critical", 
    "frac_dead_icu",
    "frac_dead_hospital",
    "frac_dead_home",
    "start_day",
    "controlled_infection_rate",
])

ICU_BEDS = 90000
HOSPITAL_BEDS = 900000

def update_state(state, params):
    day = state.day
    S = state.susceptible
    I = state.infected
    D = state.dead
    R = state.recovered
    N = S + I + D + R
    
    days_betw_contacts = params.days_infected / params.infection_rate
    if day >= params.start_day:
        days_betw_contacts = params.days_infected / params.controlled_infection_rate
    
    new_infections = (I * S) / (N * days_betw_contacts)
    recovered_infections = (I / params.days_infected) * (1 - params.frac_critical)

    # For critical cases, we split into ICU, hospital, and home care, in descending order
    C = I * params.frac_critical;
    C_icu = min(C, ICU_BEDS);
    C_hospital = 0 if C < ICU_BEDS else min(C - ICU_BEDS, HOSPITAL_BEDS)
    C_home = 0 if C < (ICU_BEDS + HOSPITAL_BEDS) else C - (ICU_BEDS + HOSPITAL_BEDS)

    recovered_critical = (
        ((C_icu / params.days_infected) * (1 - params.frac_dead_icu)) + 
        ((C_hospital / params.days_infected) * (1 - params.frac_dead_hospital)) + 
        ((C_home / params.days_infected) * (1 - params.frac_dead_home))
    )
    
    dead_critical = (
        ((C_icu / params.days_infected) * (params.frac_dead_icu)) + 
        ((C_hospital / params.days_infected) * (params.frac_dead_hospital)) + 
        ((C_home / params.days_infected) * (params.frac_dead_home))
    )
    
    newS = S - new_infections
    newI = I + new_infections - recovered_infections - recovered_critical - dead_critical
    newR = R + recovered_infections + recovered_critical
    newD = D + dead_critical
    
    return PopState(day + 1, newS, newI, newD, newR)

def run_model(state, params):
    infections = [state.infected]
    deaths = [state.dead]
    while (state.infected) >= 1:
        state = update_state(state, params)
        infections.append(state.infected)
        deaths.append(state.dead)
    return infections, deaths, state
	
total_population = 327 * 1e6
init_state = PopState(
    day=1,
    susceptible=total_population - 1, 
    infected=1, 
    dead=0, 
    recovered=0)
params = EpiParams(
    days_infected=12, 
    infection_rate=3.5, 
    frac_critical=0.05, 
    frac_dead_icu=0.02,
    frac_dead_hospital=0.05,
    frac_dead_home=0.5,
    start_day=79,
    # Modify this parameter to vary the strength of infection controls
    controlled_infection_rate=3.5,
)

infections_over_time, deaths_over_time, final_state = run_model(init_state, params)

Selection pressure on prevalence of a dominant mutation

Suppose we have a dominant mutation in the population with some prevalence p. That means that a fraction p of the alleles at that locus in the population are our mutation, while the rest are wild-type. We’d like to know how that prevalence changes over time. Some of those changes will be random, but we can average out to ask what the expectation of the change is at any particular point. We’ll have to make some assumptions first:

  • Generations will be single events. We’ll start off with some generation, apply a selection model, and use that to generate the next generation, ad infinitum.
  • Individuals will randomly mate to produce the next generation. This means there won’t be any population substructure and we can calculate the fraction of homozygotes and heterozygotes using only the mutation prevalence p.
  • The mutation confers a selective advantage s that is equal in both homozygote and heterozygote carriers.

Given this, we can work out a table comparing different parts of the population, where q = 1 - p is the probability of the wild-type allele:

Wild-typeHeterozygoteHomozygote
Fraction of current generationq^22pqp^2
Relative fitness11 + s1 + s

Relative fitness here means that parents with the allele will have on average 1 + s children if wild type parents have 1. That lets us calculate the fraction of the parents of the next generation, adjusting for increased reproduction of the carriers.

Wild-typeHeterozygoteHomozygote
Relative proportion of parentsq^22pq (1 + s)p^2 (1 + s)
Normalized fraction of parents\frac{q^2}{1 + sp + spq}\frac{2pq(1+s)}{1 + sp + spq}\frac{p^2(1+s)}{1 + sp + spq}

Now that we know what fraction of the parental population has each status, we can calculate the expected prevalence of the allele after one generation. The next generation will have the same prevalence as in the parental population.

p' = \frac{pq(1+s) + p^2(1+s)}{1 + sp + spq}.

That is, we sum up half the fraction of heterozygotes and the fraction of homozygotes. This simplifies to

p' = \frac{p (1 + s)}{1 + sp + spq}.

Note that for the case where s=0, this reduces to p' = p, indicating no expected change in allele prevalence when there is no selective pressure, as we would expect. We can then calculate the change in prevalence as \Delta p = p' - p, which simplifies down to

\Delta p = \frac{spq^2}{1 + sp + spq}.

This change is zero if and only if spq^2 is zero. That is, if any of s, p, or q are zero. This means that the allele changes frequency unless there is no selective pressure or the frequency is fixed at 0 or 1.

We can get a better handle on the behavior of this by treating it as a differential equation and solving. We can rewrite it as

\frac{dp}{dt} \frac{-p^2 + 2p + \frac{1}{s}}{p^3 - 2p^2 + s} = 1.

Then we integrate both sides with respect to t:

\int \frac{-p^2 + 2p + \frac{1}{s}}{p^3 - 2p^2 + s} \frac{dp}{dt} dt = \int 1 dt.

Simplify the left side by partial fraction decomposition and solve the right side:

\int \left [\frac{1}{sp} + \frac{1}{1-p} + \frac{1}{s(1-p)} + \frac{1}{(1-p)^2} + \frac{1}{s(1-p)^2} \right] dp = t + c.

Then we simply integrate each piece of the left integral separately and combine the constants of integration.

\frac{1 + s}{sq} + \frac{1}{s}ln(\frac{p}{q}) - ln(q) = t + c.

There’s no way to solve explicitly for p at any particular number of generations. Instead what this gives us is a solution for t, the number of generations it takes to reach any particular prevalence, where the constant of integration is provided by the initial prevalence of the allele.

t(p | s) = \frac{1+s}{s} \left ( \frac{q_0 - q}{q_0q} - ln \frac{q}{q_0} \right ) + \frac{1}{s} ln \frac{p}{p_0}.

This lets us answer the question of how changes in the selection pressure affects the speed of selection. Suppose we keep the same starting prevalence and we wish to know how fast the population will reach some ending prevalence. The time needed is of the form

t(p | s) = \frac{1}{s}(s K(p) + Q(p))

K(p) = \frac{q_0 - q}{q_0q} - ln \frac{q}{q_0}

Q(p) = \frac{q_0 - q}{q_0q} - ln \frac{q}{q_0} + ln\frac{p}{p_0}

If K(p) >> Q(p) then the time becomes approximately independent of s. If K(p) << Q(p) then the time becomes approximately inversely proportional to the selection pressure. This gives us a bound on how fast increased selection can work: at most, doubling the selection pressure will half the time needed to increase allele prevalence to a given point.

Variant fixation probability

Introduction

What happens to a new mutation?

Every person has some new mutations in their somatic genome, mostly caused by errors in DNA replication. They can pass that mutation on to their descendants. We know of a couple of historical mutations that look like they are all from a single common ancestor – for example, the common variant that’s responsible for blue eyes. How likely is a new mutation to succeed and become common?

This depends on the selective advantage s of the new mutation. Relative to non-carriers, how many more surviving children do carriers of the mutation have than others? If non-carriers have on average 1 child, then carriers will have 1 + s children. Intuitively, higher selective advantage should make it more likely for a mutation to succeed.

We’re most interested in humans right now, and humans are diploid; they have two copies of each chromosome. Any variant can either be on both chromosomes or only one, and the selective advantage can be different depending on this. There are many common variants that are helpful when only one copy is present and harmful when two copies are present. Whether a mutation succeeds and becomes common mostly depends on the selective advantage in carriers with a single copy, because carriers with two copies are very rare when the mutation is uncommon.

There have been multiple approaches to solving this problem. Today we’ll look at one of the earliest and most simple. It’s straightforward, but requires a couple of assumptions that are not always valid. But we can get a lot of utility out of the simple result.

Haldane’s approach

Haldane’s original 1927 paper has a derivation of the likelihood of a new mutation becoming common, but it can be somewhat difficult to understand. We’ll work out his results in slightly more detail.

Suppose we have a large population with a new mutation present in a single individual. Let’s let p_r be the chance that this mutation is passed down to r of its children in the next generation. We’ll define a new function

f(x) = \sum_{r=0}^{\infty} p_r x^r

We’ll see why this function is useful in more detail later. By construction, the coefficient of x^r in f(x) is the probability of r carriers in the next generation. Suppose instead we have multiple carriers in this generation. If the fraction of carriers is small, then we can ignore potential matings between carriers. That means that with m carriers, the probability of r carriers in the next generation is the coefficient of x^r in [f(x)]^m.

Expanding on this a bit, suppose that p_0=0.5 and p_1=0.5, so that each carrier has an equal probability of zero and one children who are also carriers. Then

f(x) = 0.5 + 0.5x

If we have two carriers,

[f(x)]^m = 0.25 + 0.5x + 0.25x^2

And we can see that the coefficients of the powers of x compose the probability distribution of carriers in the next generation.

Now, what would happen if instead of multiplying the function, we iteratively apply it to itself? This forms a composite function

S_f^N(x) = f( f( f( ... f(x) ... )))

Here the superscript N denotes the number of times the function has been applied. Again, let’s look at this for our simple example. Applied twice,

S_f^2(x) = f(f(x)) = 0.5 + 0.5 (0.5 + 0.5x) = 0.75 + 0.25x

We can see that this function S_f^N(x) represents the probability distribution of carriers after N generations of reproduction. In this case, after each generation, half the carriers produce no children with carriers, so after two generations, we have only a 0.25 chance of a single carrier.

We can then represent the likelihood that the variant disappears from the population as \lim_{N \to \infty} S_f^N(0). That is, as the number of generations goes up, what is the probability of no carriers in the population, represented as the coefficient of x^0?

This kind of repeated application of a function to an input is called a fixed point, and it will converge to a root of the equation f(x) = x if

  1. f(x) is continuously differentiable around the fixed point.
  2. |f'(x_0)| < 1.

Now, to be able to go further, we have to make some assumptions about the form of the function f(x). We want to model the case of a new mutation with selective advantage s, so that the expected number of carrier children of carriers with this allele is 1 + s. We will model the probabilities p_r as sampling from a Poisson distribution with mean 1+s, giving:

f(x) = \sum_{r=0}^{\infty} \frac{1}{r!} x^r (1+s)^r e^{-(1+s)}

This can be simplified with some algebraic work I won’t show here to

f(x) = e^{(x - 1)(1 + s)}

So the probability of extinction is 1-y, where y is the probability of survival, and we know that

1 - y = \lim_{N \to \infty} S_f^N(0)

So let’s plug x=1-y into the fixed point equation x = f(x), giving us

1 - y = e^{-y(1 + s)}

We take the natural log of both sides to get

y(1 - s) = -ln(1 - y).

Then we can take the Taylor series of -ln(1-y) to give us

y(1 - s) = y + \frac{y^2}{2} + \frac{y^3}{3} + ...

Or simplify to give

s = \frac{y}{2} + \frac{y^2}{3} + ....

Now, if the selective advantage s is small, we can drop higher order terms, and we get Haldane’s result: the probability that the mutation survives from a single starting copy is y=2s.

Implications

First let’s remember the assumptions that went into this calculation:

  • s is “small”, so that we can drop higher-order terms in the Taylor series.
  • The population size is “big”, so that we can ignore the likelihood of carrier-carrier matings.
  • The number of surviving children per individual is Poisson-distributed.

The biggest implication of this result is that the likelihood of the mutation surviving doesn’t depend on the size of the population! It’s just as easy for a mutation to become common in a small population as a large population.

Second, the likelihood of survival is directly proportional to the selective advantage of the mutation. But since even highly adaptive mutations have low absolute values of s, most new adaptive mutations are eventually lost.

Finally, if you repeatedly introduce the same beneficial mutation into a population, it’s very likely to become common. This suggests that even small amounts of crossbreeding will result in beneficial mutations from one population becoming common in the other population.

Create your website at WordPress.com
Get started