Trees, Population Structure, F-statistics!

I recently uploaded a new preprint to biorXiv, discussing the F-statistic framework developed by David Reich and Nick Patterson. This is my author post for Haldane’s Sieve:

I began thinking about this paper more than a year ago, when Joe Pickrell and David Reich posted their perspective paper on human genetic history on biorxiv. In that paper, they presented a very critical perspective of the serial founder model, the model I happened to be working on at the time. Needless to say, my perspective on the use (and usefulness) of the model was, and still is, quite different.

Part of their argument was based on the usage of the F3-statistic, and the fact that it is negative for many human populations, indicating admixture. Now, at that time, I was familiar with the basic idea of the statistic and had convinced myself – following the algebraic argument in Patterson et al. (2012) – that it should be positive under models of no admixture. However, I still had many open questions that this paper did not answer. Why should we use F2 as a measure of genetic drift to begin with? Why does F3 have this positivity property? How robust is this to other structure models? The ‘path’-diagrams that Patterson et al. (2012) used personally did not help me, because I am not familiar with Feynman diagrams, and I did not understand how drift could have ‘opposite’ directions.

The other primary sources did not help me, partly because they are buried in supplements and repetitive. Unfortunately, I initially missed what I now find the most comprehensive resource – the Supplementary Material of Reich et al. (2009), which did not help my understanding. However at that time – early summer last year – I had a thesis to finish, and so the F-statistics left my mind.

I finished my Ph. D. in July, moved to Chicago in October 2014 and forgot about F-statistics in the meantime. When I started my postdoc, John Novembre proposed that I have a look at EEMS, a program one of Matthew’s former students, Desi Petkova, had developed to visualize migration patterns. Strikingly, Desi also used a matrix of squared difference in allele frequency, but she did so in a coalescence framework and for diploid samples, as opposed to the diffusion framework and population sample used for the F-statistics. However, the connection is immediately obvious, and it took only a few pages of algebra to figure out what is now Equation 5 in the paper; namely that F2 has a very easy interpretation under the coalescent.

This was a very useful result, and was what eventually made me decide to start writing a paper, and research the other issues I did not understand about F-statistics. It takes very little algebra (or some digging through supplementary materials) to figure out that F3 and F4 can be written in terms of F2. The interesting bit, however, is the form of these expressions – they immediately reminded me of quantities that are used in distance-based phylogenetics – the Gromov product and tree splits, and made it obvious, that the statistics should be interpreted in that context as tests of treeness, with admixture as the alternative model, and that F3 and F4 are just lengths of external and internal branches on a tree, and that the workings of the tests can be neatly explained using that phylogenetic theory.

Now, essentially a year later, I finished a version of my paper that I am comfortable with sharing. Because of my initial difficulties with the subject – and my suspicion I might not be the only one that only has a vague understanding of the statistics – I kept the first part as basic as possible, starting with how drift is measured as decay in heterozygosity, as increase in uncertainty or relatedness, then explore in depth the phylogenetic theory underlying the null model of the admixture tests, and briefly talk about the path interpretation of the admixture model. Only then I present my main result, the interpretation in terms of coalescent times and internal branch lengths, some small simulations as sanity checks and some applications and population structure models.

A big challenge has been to attribute ideas correctly, sometimes because sources were sometimes difficult to find, and sometimes because key ideas were only implicitly stated. So if parts are unclear, or if I misattributed anything, please let me know, and I am happy to fix it. Similarly, if there are parts of the manuscript that are hard to understand, please contact me, the aim of this paper is meant to serve both as an useful introduction to the topic, and to present some interesting results.