Tag Archives: featured

Trees, Population Structure, F-statistics!

I recently uploaded a new preprint to biorXiv, discussing the F-statistic framework developed by David Reich and Nick Patterson. This is my author post for Haldane’s Sieve:

I began thinking about this paper more than a year ago, when Joe Pickrell and David Reich posted their perspective paper on human genetic history on biorxiv. In that paper, they presented a very critical perspective of the serial founder model, the model I happened to be working on at the time. Needless to say, my perspective on the use (and usefulness) of the model was, and still is, quite different.

Part of their argument was based on the usage of the F3-statistic, and the fact that it is negative for many human populations, indicating admixture. Now, at that time, I was familiar with the basic idea of the statistic and had convinced myself – following the algebraic argument in Patterson et al. (2012) – that it should be positive under models of no admixture. However, I still had many open questions that this paper did not answer. Why should we use F2 as a measure of genetic drift to begin with? Why does F3 have this positivity property? How robust is this to other structure models? The ‘path’-diagrams that Patterson et al. (2012) used personally did not help me, because I am not familiar with Feynman diagrams, and I did not understand how drift could have ‘opposite’ directions.

The other primary sources did not help me, partly because they are buried in supplements and repetitive. Unfortunately, I initially missed what I now find the most comprehensive resource – the Supplementary Material of Reich et al. (2009), which did not help my understanding. However at that time – early summer last year – I had a thesis to finish, and so the F-statistics left my mind.

I finished my Ph. D. in July, moved to Chicago in October 2014 and forgot about F-statistics in the meantime. When I started my postdoc, John Novembre proposed that I have a look at EEMS, a program one of Matthew’s former students, Desi Petkova, had developed to visualize migration patterns. Strikingly, Desi also used a matrix of squared difference in allele frequency, but she did so in a coalescence framework and for diploid samples, as opposed to the diffusion framework and population sample used for the F-statistics. However, the connection is immediately obvious, and it took only a few pages of algebra to figure out what is now Equation 5 in the paper; namely that F2 has a very easy interpretation under the coalescent.

This was a very useful result, and was what eventually made me decide to start writing a paper, and research the other issues I did not understand about F-statistics. It takes very little algebra (or some digging through supplementary materials) to figure out that F3 and F4 can be written in terms of F2. The interesting bit, however, is the form of these expressions – they immediately reminded me of quantities that are used in distance-based phylogenetics – the Gromov product and tree splits, and made it obvious, that the statistics should be interpreted in that context as tests of treeness, with admixture as the alternative model, and that F3 and F4 are just lengths of external and internal branches on a tree, and that the workings of the tests can be neatly explained using that phylogenetic theory.

Now, essentially a year later, I finished a version of my paper that I am comfortable with sharing. Because of my initial difficulties with the subject – and my suspicion I might not be the only one that only has a vague understanding of the statistics – I kept the first part as basic as possible, starting with how drift is measured as decay in heterozygosity, as increase in uncertainty or relatedness, then explore in depth the phylogenetic theory underlying the null model of the admixture tests, and briefly talk about the path interpretation of the admixture model. Only then I present my main result, the interpretation in terms of coalescent times and internal branch lengths, some small simulations as sanity checks and some applications and population structure models.

A big challenge has been to attribute ideas correctly, sometimes because sources were sometimes difficult to find, and sometimes because key ideas were only implicitly stated. So if parts are unclear, or if I misattributed anything, please let me know, and I am happy to fix it. Similarly, if there are parts of the manuscript that are hard to understand, please contact me, the aim of this paper is meant to serve both as an useful introduction to the topic, and to present some interesting results.

Detecting range expansions from genetic data

Co-Author: Monty Slatkin

Populations are often structured, in the sense that individuals living next to each other are often related. This paper, we develop a statistic to test for equilibrium isolation-by-distance. Equilibrium isolation-by-distance implies a constant and symmetric population structure, and deviations can be useful in many contexts. In this paper, we look at range expansions, the process where a species expands its range from a small origin. We show that we can detect these expansions, as well as estimate the most likely origin.

Google Scholar


Altitude adaptation in Tibetans caused by introgression of Denisovan-like DNA

Co-Authors: Emilia Huerta-Sanchez (first author), Rasmus Nielsen and 23 others.

Many Tibetans have a variant of the EPAS1 gene that facilitates survival in very high altitudes. In this paper, we show that the Tibetan variant of EPAS1 is the same as that of the Denisovans, an extinct hominid from which we only know the morphology and a tooth and its genome sequence. This supports our hypothesis that Denisovans have interbred with an ancestor of the modern Tibetans, and that the high-altitude allele increased in frequency due to natural selection.

Google Scholar link

Nature Link

Distinguishing between population bottleneck and population subdivision by a Bayesian model choice procedure

Co-Authors: Daniel Wegmann, Laurent Excoffier

This is my first paper, published in Molecular Ecology in 2010. Using simulations and an inference procedure called Approximate Bayesian Computation (ABC), we showed that some methods to estimate population size changes may be biased if population structure is present.

We also showed that it is in principle possible to distinguish these two models, as we showed using two explicit models. For application purposes we demonstrated that the inbreeding coefficient FIS can be used to detect structure. The basic idea is that FIS will be non-zero if the population size is structured, as the two copies in a diplid individuals are from the same deme, and therefore more closely related than expected by chance.

Google Scholar Link

Molecular Ecology Link