Genome evolution: a sequence-centric approach - PowerPoint PPT Presentation

About This Presentation

Genome evolution: a sequence-centric approach


Genome evolution: a sequence-centric approach Lecture 8-9: Concepts in population genetics Genome evolution: a sequence-centric approach Lecture 8-9: Concepts in ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 46
Provided by: acil150


Transcript and Presenter's Notes

Title: Genome evolution: a sequence-centric approach

Genome evolution a sequence-centric approach
  • Lecture 8-9 Concepts in population genetics

(Probability, Calculus/Matrix theory, some graph
theory, some statistics)
Tree of life Genome Size Elements of genome
structure Elements of genomic information
Simple Tree Models HMMs and variants PhyloHMM,DBN
Context-aware MM Factor Graphs
Probabilistic models
Genome structure
DP Sampling Variational apx. LBP
Parameter estimation
EM Generalized EM (optimize free energy)
Inferring Selection
Today refs Hartl and Clark, Topics from Chapters
3-7 See Gruer/Li chapter 2 (easy to read
overview) and lynch chapter 4 (more advanced)
Studying Populations
Models A set of individuals, genomes Ancestry
relations or hierarchies Experiments Fields
studies, diversity/genotyping Experimental

mtDNA human migration patterns
Åland Islands, Glanville fritillary population
Species and populations
What is a species? Multiple definitions, most of
them rely on free flow of genetic information
within and weak flow of information outside/inside
Species 2
Species 1
Species can emerge through the formation of
reproductive barriers Allopatric speciation
occurs through geographical separation Parapatric
speciation occurs without geographical
separation but with weak flow of genetic
information Sympatric speciation occurs while
information is flowing - controversial Barriers
can be genetic, physical, behavioral
Population dynamics
We think of a species genome as representing the
population average genomic information
Individuals have genomes that are closely related
to the species genomes, but differ from it in
certain loci (alleles)
As the population evolve there are continuous
changes in allele frequencies, which may result
in ultimate changes in the genome (fixation)
In haploid populations (bacteria), genotypes are
determined by one haplotype and ancestral
relations are simple trees In diploid populations
things are a bit more complex, as genotypes can
be homozygous or heterozygous at each locus.
We can measure and quantify just few aspect of
this evolutionary dynamics Size of
populations Allele frequencies The average
homozygosity/heterozygosity of an allele How
many alleles at a locus Population genetics is
dealing with theories that predict the behavior
of these quantities using simple assumption on
the evolutionary dynamics
Frequency estimates
We will be dealing with estimation of allele
frequencies. To remind you, when sampling n
times from a population with allele of frequency
p, we get an estimate that is distributed as a
binomial variable. This can be further
approximated using a normal distribution
When estimating the frequency out of the number
of successes we therefore have an error that
looks like
Simplest model Hardy-Weinberg
Studying dynamics of the frequencies of two
alleles A/a of a gene Assume Diploid
organisms Sexual Reproduction Non-overlapping
generations Random mating Male-females have the
same allele frequencies Large population, No
migration No mutations, no selection on the
alleles under study
Hardy-Weinberg equilibrium
Random mating
Non overlapping generations
With the model assumption, equilibrium is reached
within one generation
Testing Hardy-Weinberg using chi-square statistics
HW is over simplifying everything, but can be
used as a baseline to test if interesting
evolution is going on for some allele Classical
example is the blood group genotypes M/N (Sanger
1975) (this genotype determines the expression of
a polysaccharide on red blood cell surfaces so
they were quantifiable before the genomic era..)
294.3 298 MM
496 489 MN
209.3 213 NN
Chi-square significance can be computed from the
chi-square distribution with df degrees of
freedom. Here df classes - parameters 1
3(MN/NN/MM) 1 (p) 1 1
Recombination and linkage
Assume two loci have alleles A1,A2, B1,B2
Only double Heterozygous can allow recombination
to change allele frequencies
Linkage equilibrium
A1 B1
A2 B2
A1B1/ A2B2
A1 B2
A1B2/ A1B2
A2 B1
The recombination fraction r proportion of
recombinant gametes generated from double
For different chromosomes r 0.5 For the same
chromosome, function of the distance and possibly
other factors
Linkage disequilibrium (LD)
Recombination on any A1- / -B1
No recomb
Next generation
Define the linkage disequilibrium parameter D as
Linkage disequilibrium (LD) - example
blood group genotypes M/N and S/s. Both alleles
in Hardy-Weinberg
For M/N p1 0.5425 p2 0.4575 For S/s
q1 0.3080 q2 0.6920
334.2 484 MS
750.8 611 Ms
281.8 142 NS
633.2 773 Ns
Linkage equilibrium highly unlikely!
Sources of Linkage disequilibrium
LD in original population that was not stabilized
due to low r Genetic coadaptation regions of
the genome that are not subject to recombination
(for example, inverted chromosomal
fragments) Admixture of populations with
different allele frequencies
Population substructure
The HW theory assumed population are randomly
mating We mentioned that species are suppose to
be isolated genetically, but even inside a
species, the flow of information is never uniform
Subpopulation structure would result in
low heterozygosity This is because (different)
alleles would be fixated in different
sub-populations We can compute the average
heterozygosity predicted by HWE from allele
frequencies H2pq HS in each population use
frequency to compute HWE heterozygosity and
average HR in each region use frequency to
compute HWE heterzygosity and take a weighted
average HT for the entire population use
frequency to compute HWE heterzygosity and
Wrights fixation index F Comparing one level in
the hierarchy to another Provide indication to
the level of genetic differentiation in the
population 0ltFlt1, Flt0.05 is considered quite
low, Fgt0.25 is considered very high
Population substructure (Dobzhansky and Epling
Frequency of recessive allele (blue flower color)
in desert snow flowers (Lynanthus parruae)
More significant difference among regions than
inside them
Each point represent 4000 plants over 30 square
miles of the Mohave desert
A population with inbreeding will undergo
reduction in heterozygosity For example,
self-fertilization in plants The inbreeding
coefficient H0 the random mating
heterozygosity HI observed (inbreeding)
heterozygosity In fact F is identical to the
Fixation index F and can be interpreted as
measuring the probability that two alleles are
identical by descent - autozygotes The increase
in rare-alleles homozygosity for inbreeded
population is frequently detrimental
Regular mating schemes in the lab and field
Selfing, Sib-mating, Backcrossing to single
individual from a random bred strain Assortative
mating positive (height in human) negative
(cases in plants)
The hapmap project
1 million SNPs (single nucleotide
polymorphisms) 4 populations 30 trios
(parents/child) from Nigeria (Yoruba - YRI) 30
trios (parents/child) from Utah (CEU) 45 Han
chinease (Beijing) 44 Japanease
(Tokyo) Haplotyping each SNP/individual No
just determining heterozygosity/homozygosity
haplotyping completely resolve the genotypes
(phasing) Because of linkage, the partial
SNP Map largely determine all other SNPs!! The
idea is that a group of tag SNPs Can be used
for representing all genetic Variation in the
human population. This is extremely important in
association studies that look for the genetic
cause of disease.
Correlation on SNPs between populations
Recombination rates in the human population LD
Recombination rates in the human population
Recombination rates are highly non uniform with
major effects on genome structure!
Simplest model assume two alleles, and mutations
If the process is running long enough, we will
converge to a stationary distribution
Populations are however finite, and this create
random genetic drift A random allele have a
significance change to be eliminated, even in one
Figure 7.4
Experiments with drifting fly populations 107
Drosophila melanogaster populations. Each
consisted orignally of 16 brown eys (bw)
heterozygotes. At each generation, 8 males and 8
females were selected at random from the
progenies of the previous generation. The bars
shows the distribution of allele frequencies in
the 107 populations
Drift, fixation, and the neutral theory
If sampling is random, the chance of ultimate
fixation is Simply because one allele must
become fixated (and there are 2N to begin with).
According to the neutral theory fixation of
neutral alleles play a major role in driving
divergence of populations. This is in contrast
to the selectionist view that stress adaptive
evolution as the major force for fixation of new
alleles. The controversy around the neutral
theory seems like something that belongs to the
past, since it was heated around question of
evolution in protein coding loci, and densely
coded genomes. Today we realize that genomic
information is distributed in a way that should
certainly allow neutral or almost neutral
mutations a considerable freedom in large parts
of the genome..
There are still critically important questions on
how strong is the neutrality assumption in
different parts of the genome well look at
this question later.
Wright-Fischer model for genetic drift
N individuals
8 gametes
N individuals
8 gametes
We follow the frequency of an allele in the
population, until fixation (f2N) or loss (f0)
We can model the frequency as a Markov process
with transition probabilities
Sampling j alleles from a population 2N
population with i alleles.
In larger population the frequency would change
more slowly (the variance of the binomial
variable is pq/2N so sampling wouldnt change
that much)
Diffusion approximation and Kimuras solution
Fischer, and then Kimura approximated the drift
process using a diffusion equation.
The density of population with frequency x..xdx
at time t
The flux of probability at time t and frequency x
The change in the density equals the differences
between the fluxes J(x,t) and J(xdx,t), taking
dx to the limit we have
The if M(x) is the mean change in allele
frequency when the frequency is x, and V(x) is
the variance of that change, then the probability
flux equals
Heat diffusion Fokker-Planck Kolmogorov Forward
Changes in allele-frequencies, Fischer-Wright
After about 4N generations, just 10 of the cases
are not fixed and the distribution becomes flat.
Absorption time and Time to fixation
According to Kimuras solution, the mean time for
allele fixation, assuming initial probability p
and assuming it was not lost is
The mean time for allele loss is (the fixation
time of the complement event)
Effective population size
4N generations looks light a huge number (in a
population of billions!) But in fact, the
wright-fischer model (like the hardy-weinberg
model) is based on many non-realistic assumption,
including random mating any two individuals can
mate The effective population size is defined as
the size of an idealized population for which the
predicted dynamics of changes in allele frequency
are similar to the observed ones For each
measurable statistics of population dynamics, a
different effective population size can be
computed For example, the expected variance in
allele frequency is expressed as
But we can use the same formula to define the
effective population size given the variance
Effective population size changing populations
If the population is changing over time, the
dynamics will be affect by the harmonic mean of
the sizes
So the effective population size is dominated by
the size of the smallest bottleneck Bottlenecks
can occur during migration, environmental stress,
isolation Such effects greatly decrease
heterozygosity (founder effect for example
Tay-Sachs in ashkenazim) Bottlenecks can
accelerate fixation of neutral or even
deleterious mutations as we shall see later.
Human effective population size in the recent 2My
is estimated around 10,000 (due to bottlenecks).
Effective population size unequal sex ratio, and
sex chromosomes
If there are more females than males, or there
are fewer males participating in reproduction
then the effective population size will be
Any combination of alleles from a male and a
So if there are 10 times more females in the
population, the effective population size is
4x10x/(11x)4x, much less than the size of the
population (11x).
Another example is the X chromosome, which is
contained in only one copy for males.
Testing neutrality
The drift process have clear dynamics. We are
usually interested in these dynamics as a
baseline for testing hypotheses on non-neutral
evolution Such tests require predictions on the
behavior of concrete statistics that we can
measure from a population For example, we can
sequence alleles and count how many polymorphic
sites exist in a gene and what are their
frequencies. We can also perform evolutionary
comparisons among different sites we will focus
on these later in the course.
Non neutral population dynamics
Slow evolution
Infinite alleles model
Assuming a gene with multiple loci, we can think
of the number of possible alleles as much larger
than the population In this model, the
probability of generating the same mutation twice
is considered 0 One can then ask how many
distinct alleles should we observe given a
neutral process and a certain mutation
probability Alternatively, one can ask what will
be the probability of autozygosity F (identity by
(picking up two autozygous alleles and not
mutating them, or picking up the same allele
Looking for steady state and neglecting factors
that depends on m2, m/N
Because of our model, F is also the fraction of
homozygous individuals
Testing the infinite alleles model
The Ewens formula enable us to predict the number
of alleles (k) we should observe when sampling n
times from a population with q4Nm, assuming the
infinite allele model
The Chinese restaurant process
Testing the infinite alleles model
We can estimate F from k (by finding q from the
E(k) formula)
We use this statistics to test if a given gene
behave neutrally (or at least according to the
Not quite neutral
Highly non neutral
Figure 7.16,7.17
VNTR locus in humans observed (open columns) and
Ewens predicted allele counts.
F computed from the number of Xdh alleles in 89
D. pseudoobscura lines gene 52 had a common
allele, 8 singletons. Compared to a simulation
assuming the infinite allele model.
Infinite sites model
Instead of looking at an entire gene with many
alleles, consider the many loci consisting the
gene and assume that these are changing slowly
most loci are monomorphic or dimorphic.
Probability of i mismatches in two random
In particular, autozygosity Just like we had
for the infinite allele model.
If we sample n allele, the number of segregating
sites is distributed like
Assuming no intragenic recombination
So we can test neutrality by looking at the
number of alleles in a certain sample.
Coalescent theory
Any set of individuals in a population are a
consequence of a coalescence process a common
ancestor giving rise to multiple alleles through
mutation, duplication and recombination. Such
models are in wide use for simulating
populations Application for inferring
selection/neutrality or other population dynamics
are becoming reasonable as more data becomes
A simple coalescent model look at the gene tree
of the k observed alleles
Fitness the relative reproductive success of an
individual (or genome) Fitness is only defined
with respect to the current population. Fitness
is unlikely to remain constant in all conditions
and environments
Sampling probability is multiplied by a selection
factor 1s
Mutations can change fitness A deleterious
mutation decrease fitness. It would therefore be
selected against. This process is called negative
or purifying selection. A advantageous or
beneficial mutation increase fitness. It would
therefore be subject to positive selection. A
neutral mutation is one that do not change the
For mono-allelic populations, selection directly
observe the fitness of an allele For diploid
organisms, we should define how the combination
of alleles affect fitness.
Selection in haploid populations
Example (Hartl Dykhuizen 81) E.Coli with two gnd
alleles. One allele is beneficial for growth on
Gluconate. A population of E.coli was tracked
for 35 generations, evolving on two mediums, the
observed frequencies were Gluconate 0.4555
? 0.898 Ribose 0.594 ? 0.587 For
Gluconate log(0.898/0.102)-log(0.455/0.545)35log
w log(w) 0.292, w1.0696 Compare to w0.999 in
Relative fitness
Gamete after selection
Generation t
Ratio as a function of time
Consider continuous time model
The change in allele frequency
Selection and allele frequency dynamics
(Hardy Weinberg!)
Change in frequency is given by
In the case of codominance
Selection and fixation
An allele with a beneficial mutation will have an
increased frequency in the gamete pool
Its chances to avoid immediate extinction are
This is a rather modest increase, so even
beneficial allele are likely to be eliminated.
For example, s0.1 would have a loss probability
of 0.333 compared to 0.368 for a neutral allele.
For a diploid population, if we assume the
fitness of a heterozygous if 1s and of a
homozygous is 12s, it can be computed from the
diffusion approximation that the overall fixation
probability will be
Selection and fixation
The fixation time for a neutral allele (assuming
fixation was achieved), as we said before, is
averaging at
With a selective advantage, the fixation time is
approximated by
Considering now the entire population, the rate
of substitution at a loci equals the number of
mutations times their fixation probability. In
the neutral case, this is very simple
So neutral evolution is unaffected by the size of
the population.
With a selective advantage, the fixation
probability is approximated by
So evolution will be more efficient when
population is larger, mutation rate is faster and
selection is stronger. The parameter 4Nes is
describing the speed up.
Other types of selection
Over-dominance heterozygous are better, so there
is a possibility for equilibrium in allele
frequencies few examples, but on famous is
resistance ot malaria and sickle cell anemia in
Africa Frequency-, Density-dependent selection
when the fitness depend on the frequency of the
allele or the population size. Fecundity
selection different reproductive potential for
mating pairs. Effects of heterogeneous
environment (overdominance?) Different effects
in males and femeals Effects that apply directly
to the haplotype gametic selection/meiotic drive
(e.g., killing your homologous chromosome
reproductive potential) Kin selection origin of
Recombination and selection
Linkage and selection
Linkage interfere with the purging of deleterious
mutations and reduce the efficiency of positive
Weakly deleterious
Selective sweep/Hitchhiking effect /genetic
Hill-Robertson effect
Linkage and selection
The variance in allele frequency is used to
define the effective population size
Simplistically, assume a neutral locus is
evolving such that a selective sweep is affecting
a fully linked locus at rate d. A sweep will
fixate the allele with probability p, and we
further assume that the sweep happens instantly
This is very rough, but it demonstrates the basic
intuition here sweeps reduce the effective
selection in a way that can be quantified through
reduction in the effective population size.
C the average frequency of the neutral allele
after the sweep
Write a Comment
User Comments (0)