Genome evolution a sequence-centric approach

- Lecture 8-9 Concepts in population genetics

(Probability, Calculus/Matrix theory, some graph

theory, some statistics)

Tree of life Genome Size Elements of genome

structure Elements of genomic information

Simple Tree Models HMMs and variants PhyloHMM,DBN

Context-aware MM Factor Graphs

Probabilistic models

Genome structure

Inference

Mutations

DP Sampling Variational apx. LBP

Parameter estimation

Population

EM Generalized EM (optimize free energy)

Inferring Selection

Today refs Hartl and Clark, Topics from Chapters

3-7 See Gruer/Li chapter 2 (easy to read

overview) and lynch chapter 4 (more advanced)

Studying Populations

Models A set of individuals, genomes Ancestry

relations or hierarchies Experiments Fields

studies, diversity/genotyping Experimental

evolution

mtDNA human migration patterns

Åland Islands, Glanville fritillary population

Species and populations

What is a species? Multiple definitions, most of

them rely on free flow of genetic information

within and weak flow of information outside/inside

Species 2

Species 1

Species can emerge through the formation of

reproductive barriers Allopatric speciation

occurs through geographical separation Parapatric

speciation occurs without geographical

separation but with weak flow of genetic

information Sympatric speciation occurs while

information is flowing - controversial Barriers

can be genetic, physical, behavioral

Population dynamics

We think of a species genome as representing the

population average genomic information

Individuals have genomes that are closely related

to the species genomes, but differ from it in

certain loci (alleles)

As the population evolve there are continuous

changes in allele frequencies, which may result

in ultimate changes in the genome (fixation)

In haploid populations (bacteria), genotypes are

determined by one haplotype and ancestral

relations are simple trees In diploid populations

things are a bit more complex, as genotypes can

be homozygous or heterozygous at each locus.

We can measure and quantify just few aspect of

this evolutionary dynamics Size of

populations Allele frequencies The average

homozygosity/heterozygosity of an allele How

many alleles at a locus Population genetics is

dealing with theories that predict the behavior

of these quantities using simple assumption on

the evolutionary dynamics

Frequency estimates

We will be dealing with estimation of allele

frequencies. To remind you, when sampling n

times from a population with allele of frequency

p, we get an estimate that is distributed as a

binomial variable. This can be further

approximated using a normal distribution

When estimating the frequency out of the number

of successes we therefore have an error that

looks like

Simplest model Hardy-Weinberg

Studying dynamics of the frequencies of two

alleles A/a of a gene Assume Diploid

organisms Sexual Reproduction Non-overlapping

generations Random mating Male-females have the

same allele frequencies Large population, No

migration No mutations, no selection on the

alleles under study

Hardy-Weinberg equilibrium

Random mating

AA

aa

AA

aa

Aa

aA

Aa

aA

Non overlapping generations

With the model assumption, equilibrium is reached

within one generation

Testing Hardy-Weinberg using chi-square statistics

HW is over simplifying everything, but can be

used as a baseline to test if interesting

evolution is going on for some allele Classical

example is the blood group genotypes M/N (Sanger

1975) (this genotype determines the expression of

a polysaccharide on red blood cell surfaces so

they were quantifiable before the genomic era..)

Observed

HW

294.3 298 MM

496 489 MN

209.3 213 NN

Chi-square significance can be computed from the

chi-square distribution with df degrees of

freedom. Here df classes - parameters 1

3(MN/NN/MM) 1 (p) 1 1

Recombination and linkage

Assume two loci have alleles A1,A2, B1,B2

Only double Heterozygous can allow recombination

to change allele frequencies

Linkage equilibrium

A1 B1

A2 B2

A1B1/ A2B2

A1 B2

A1B2/ A1B2

A2 B1

The recombination fraction r proportion of

recombinant gametes generated from double

heterozygote

For different chromosomes r 0.5 For the same

chromosome, function of the distance and possibly

other factors

Linkage disequilibrium (LD)

Recombination on any A1- / -B1

No recomb

Next generation

Define the linkage disequilibrium parameter D as

D

r0.05

r0.2

r0.5

Generation

Linkage disequilibrium (LD) - example

blood group genotypes M/N and S/s. Both alleles

in Hardy-Weinberg

For M/N p1 0.5425 p2 0.4575 For S/s

q1 0.3080 q2 0.6920

Observed

unlinked

334.2 484 MS

750.8 611 Ms

281.8 142 NS

633.2 773 Ns

Linkage equilibrium highly unlikely!

Sources of Linkage disequilibrium

LD in original population that was not stabilized

due to low r Genetic coadaptation regions of

the genome that are not subject to recombination

(for example, inverted chromosomal

fragments) Admixture of populations with

different allele frequencies

Population substructure

The HW theory assumed population are randomly

mating We mentioned that species are suppose to

be isolated genetically, but even inside a

species, the flow of information is never uniform

Subpopulation structure would result in

low heterozygosity This is because (different)

alleles would be fixated in different

sub-populations We can compute the average

heterozygosity predicted by HWE from allele

frequencies H2pq HS in each population use

frequency to compute HWE heterozygosity and

average HR in each region use frequency to

compute HWE heterzygosity and take a weighted

average HT for the entire population use

frequency to compute HWE heterzygosity and

average

Wrights fixation index F Comparing one level in

the hierarchy to another Provide indication to

the level of genetic differentiation in the

population 0ltFlt1, Flt0.05 is considered quite

low, Fgt0.25 is considered very high

Population substructure (Dobzhansky and Epling

1942)

Frequency of recessive allele (blue flower color)

in desert snow flowers (Lynanthus parruae)

0.717

0.005

0.000

0.000

0.032

0.573

0.657

0.000

More significant difference among regions than

inside them

0.009

0.000

0.002

0.302

0.007

0.004

0.000

0.000

0.126

0.504

0.005

0.106

0.008

0.000

0.339

0.000

0.224

0.068

0.010

0.000

0.014

0.411

Each point represent 4000 plants over 30 square

miles of the Mohave desert

Inbreeding

A population with inbreeding will undergo

reduction in heterozygosity For example,

self-fertilization in plants The inbreeding

coefficient H0 the random mating

heterozygosity HI observed (inbreeding)

heterozygosity In fact F is identical to the

Fixation index F and can be interpreted as

measuring the probability that two alleles are

identical by descent - autozygotes The increase

in rare-alleles homozygosity for inbreeded

population is frequently detrimental

Regular mating schemes in the lab and field

Selfing, Sib-mating, Backcrossing to single

individual from a random bred strain Assortative

mating positive (height in human) negative

(cases in plants)

The hapmap project

1 million SNPs (single nucleotide

polymorphisms) 4 populations 30 trios

(parents/child) from Nigeria (Yoruba - YRI) 30

trios (parents/child) from Utah (CEU) 45 Han

chinease (Beijing) 44 Japanease

(Tokyo) Haplotyping each SNP/individual No

just determining heterozygosity/homozygosity

haplotyping completely resolve the genotypes

(phasing) Because of linkage, the partial

SNP Map largely determine all other SNPs!! The

idea is that a group of tag SNPs Can be used

for representing all genetic Variation in the

human population. This is extremely important in

association studies that look for the genetic

cause of disease.

Correlation on SNPs between populations

Recombination rates in the human population LD

blocks

Recombination rates in the human population

Recombination rates are highly non uniform with

major effects on genome structure!

Mutations

Simplest model assume two alleles, and mutations

probabilities

If the process is running long enough, we will

converge to a stationary distribution

A

a

Populations are however finite, and this create

random genetic drift A random allele have a

significance change to be eliminated, even in one

generation

sampling

Drift

Figure 7.4

Experiments with drifting fly populations 107

Drosophila melanogaster populations. Each

consisted orignally of 16 brown eys (bw)

heterozygotes. At each generation, 8 males and 8

females were selected at random from the

progenies of the previous generation. The bars

shows the distribution of allele frequencies in

the 107 populations

Drift, fixation, and the neutral theory

If sampling is random, the chance of ultimate

fixation is Simply because one allele must

become fixated (and there are 2N to begin with).

According to the neutral theory fixation of

neutral alleles play a major role in driving

divergence of populations. This is in contrast

to the selectionist view that stress adaptive

evolution as the major force for fixation of new

alleles. The controversy around the neutral

theory seems like something that belongs to the

past, since it was heated around question of

evolution in protein coding loci, and densely

coded genomes. Today we realize that genomic

information is distributed in a way that should

certainly allow neutral or almost neutral

mutations a considerable freedom in large parts

of the genome..

There are still critically important questions on

how strong is the neutrality assumption in

different parts of the genome well look at

this question later.

Wright-Fischer model for genetic drift

N individuals

8 gametes

N individuals

8 gametes

We follow the frequency of an allele in the

population, until fixation (f2N) or loss (f0)

We can model the frequency as a Markov process

with transition probabilities

Sampling j alleles from a population 2N

population with i alleles.

In larger population the frequency would change

more slowly (the variance of the binomial

variable is pq/2N so sampling wouldnt change

that much)

Diffusion approximation and Kimuras solution

Fischer, and then Kimura approximated the drift

process using a diffusion equation.

The density of population with frequency x..xdx

at time t

The flux of probability at time t and frequency x

The change in the density equals the differences

between the fluxes J(x,t) and J(xdx,t), taking

dx to the limit we have

The if M(x) is the mean change in allele

frequency when the frequency is x, and V(x) is

the variance of that change, then the probability

flux equals

Heat diffusion Fokker-Planck Kolmogorov Forward

eq.

Changes in allele-frequencies, Fischer-Wright

model

After about 4N generations, just 10 of the cases

are not fixed and the distribution becomes flat.

Absorption time and Time to fixation

According to Kimuras solution, the mean time for

allele fixation, assuming initial probability p

and assuming it was not lost is

The mean time for allele loss is (the fixation

time of the complement event)

Effective population size

4N generations looks light a huge number (in a

population of billions!) But in fact, the

wright-fischer model (like the hardy-weinberg

model) is based on many non-realistic assumption,

including random mating any two individuals can

mate The effective population size is defined as

the size of an idealized population for which the

predicted dynamics of changes in allele frequency

are similar to the observed ones For each

measurable statistics of population dynamics, a

different effective population size can be

computed For example, the expected variance in

allele frequency is expressed as

But we can use the same formula to define the

effective population size given the variance

Effective population size changing populations

If the population is changing over time, the

dynamics will be affect by the harmonic mean of

the sizes

So the effective population size is dominated by

the size of the smallest bottleneck Bottlenecks

can occur during migration, environmental stress,

isolation Such effects greatly decrease

heterozygosity (founder effect for example

Tay-Sachs in ashkenazim) Bottlenecks can

accelerate fixation of neutral or even

deleterious mutations as we shall see later.

Human effective population size in the recent 2My

is estimated around 10,000 (due to bottlenecks).

Effective population size unequal sex ratio, and

sex chromosomes

If there are more females than males, or there

are fewer males participating in reproduction

then the effective population size will be

smaller

Any combination of alleles from a male and a

female

So if there are 10 times more females in the

population, the effective population size is

4x10x/(11x)4x, much less than the size of the

population (11x).

Another example is the X chromosome, which is

contained in only one copy for males.

Testing neutrality

The drift process have clear dynamics. We are

usually interested in these dynamics as a

baseline for testing hypotheses on non-neutral

evolution Such tests require predictions on the

behavior of concrete statistics that we can

measure from a population For example, we can

sequence alleles and count how many polymorphic

sites exist in a gene and what are their

frequencies. We can also perform evolutionary

comparisons among different sites we will focus

on these later in the course.

Non neutral population dynamics

sp1

sp2

sp3

sp4

sp5

Slow evolution

Infinite alleles model

Assuming a gene with multiple loci, we can think

of the number of possible alleles as much larger

than the population In this model, the

probability of generating the same mutation twice

is considered 0 One can then ask how many

distinct alleles should we observe given a

neutral process and a certain mutation

probability Alternatively, one can ask what will

be the probability of autozygosity F (identity by

descent)

(picking up two autozygous alleles and not

mutating them, or picking up the same allele

twice)

Looking for steady state and neglecting factors

that depends on m2, m/N

Because of our model, F is also the fraction of

homozygous individuals

4Nm

Testing the infinite alleles model

The Ewens formula enable us to predict the number

of alleles (k) we should observe when sampling n

times from a population with q4Nm, assuming the

infinite allele model

The Chinese restaurant process

Testing the infinite alleles model

We can estimate F from k (by finding q from the

E(k) formula)

We use this statistics to test if a given gene

behave neutrally (or at least according to the

model)

Not quite neutral

Highly non neutral

Figure 7.16,7.17

VNTR locus in humans observed (open columns) and

Ewens predicted allele counts.

F computed from the number of Xdh alleles in 89

D. pseudoobscura lines gene 52 had a common

allele, 8 singletons. Compared to a simulation

assuming the infinite allele model.

Infinite sites model

Instead of looking at an entire gene with many

alleles, consider the many loci consisting the

gene and assume that these are changing slowly

most loci are monomorphic or dimorphic.

Probability of i mismatches in two random

sequences

In particular, autozygosity Just like we had

for the infinite allele model.

If we sample n allele, the number of segregating

sites is distributed like

Assuming no intragenic recombination

So we can test neutrality by looking at the

number of alleles in a certain sample.

Coalescent theory

Any set of individuals in a population are a

consequence of a coalescence process a common

ancestor giving rise to multiple alleles through

mutation, duplication and recombination. Such

models are in wide use for simulating

populations Application for inferring

selection/neutrality or other population dynamics

are becoming reasonable as more data becomes

available.

A simple coalescent model look at the gene tree

of the k observed alleles

Past

Present

Selection

Fitness the relative reproductive success of an

individual (or genome) Fitness is only defined

with respect to the current population. Fitness

is unlikely to remain constant in all conditions

and environments

Sampling probability is multiplied by a selection

factor 1s

Mutations can change fitness A deleterious

mutation decrease fitness. It would therefore be

selected against. This process is called negative

or purifying selection. A advantageous or

beneficial mutation increase fitness. It would

therefore be subject to positive selection. A

neutral mutation is one that do not change the

fitness.

For mono-allelic populations, selection directly

observe the fitness of an allele For diploid

organisms, we should define how the combination

of alleles affect fitness.

Selection in haploid populations

Example (Hartl Dykhuizen 81) E.Coli with two gnd

alleles. One allele is beneficial for growth on

Gluconate. A population of E.coli was tracked

for 35 generations, evolving on two mediums, the

observed frequencies were Gluconate 0.4555

? 0.898 Ribose 0.594 ? 0.587 For

Gluconate log(0.898/0.102)-log(0.455/0.545)35log

w log(w) 0.292, w1.0696 Compare to w0.999 in

Ribose.

Allele

Frequency

Relative fitness

Gamete after selection

Generation t

Ratio as a function of time

Consider continuous time model

The change in allele frequency

Selection and allele frequency dynamics

Assume

Genotype

Fitness

Frequency

(Hardy Weinberg!)

Change in frequency is given by

In the case of codominance

Selection and fixation

An allele with a beneficial mutation will have an

increased frequency in the gamete pool

Its chances to avoid immediate extinction are

This is a rather modest increase, so even

beneficial allele are likely to be eliminated.

For example, s0.1 would have a loss probability

of 0.333 compared to 0.368 for a neutral allele.

For a diploid population, if we assume the

fitness of a heterozygous if 1s and of a

homozygous is 12s, it can be computed from the

diffusion approximation that the overall fixation

probability will be

Selection and fixation

The fixation time for a neutral allele (assuming

fixation was achieved), as we said before, is

averaging at

With a selective advantage, the fixation time is

approximated by

Substitutions

Considering now the entire population, the rate

of substitution at a loci equals the number of

mutations times their fixation probability. In

the neutral case, this is very simple

So neutral evolution is unaffected by the size of

the population.

With a selective advantage, the fixation

probability is approximated by

So evolution will be more efficient when

population is larger, mutation rate is faster and

selection is stronger. The parameter 4Nes is

describing the speed up.

Other types of selection

Over-dominance heterozygous are better, so there

is a possibility for equilibrium in allele

frequencies few examples, but on famous is

resistance ot malaria and sickle cell anemia in

Africa Frequency-, Density-dependent selection

when the fitness depend on the frequency of the

allele or the population size. Fecundity

selection different reproductive potential for

mating pairs. Effects of heterogeneous

environment (overdominance?) Different effects

in males and femeals Effects that apply directly

to the haplotype gametic selection/meiotic drive

(e.g., killing your homologous chromosome

reproductive potential) Kin selection origin of

altruism?

Recombination and selection

Linkage and selection

Linkage interfere with the purging of deleterious

mutations and reduce the efficiency of positive

selection!

Beneficial

Beneficial

Beneficial

Weakly deleterious

Selective sweep/Hitchhiking effect /genetic

draft

Hill-Robertson effect

Linkage and selection

The variance in allele frequency is used to

define the effective population size

Simplistically, assume a neutral locus is

evolving such that a selective sweep is affecting

a fully linked locus at rate d. A sweep will

fixate the allele with probability p, and we

further assume that the sweep happens instantly

This is very rough, but it demonstrates the basic

intuition here sweeps reduce the effective

selection in a way that can be quantified through

reduction in the effective population size.

C the average frequency of the neutral allele

after the sweep