c'est la vie...
by minjong :)
Equine molecular genetics| Category :분류없음| 2017.12.09 23:04

Equine genome sequence and assembly

The First International Equine Gene Mapping Workshop took place in October 1995 in

Lexington, Kentucky, and signified the beginning of an organised equine genomics group.

First-generation maps of the equine genome contained markers that were assigned to various

equine chromosomes using approaches such as synteny analysis (preserved colocalisation of

markers on chromosomes of different species) [4,5], genetic linkage mapping (tendency of

markers that are located close together to be inherited together) [6-9] and fluorescent in situ

hybridisation (the ability to detect specific DNA sequences on chromosomes) [10-14].

Markers consisted of type I markers (associated with genes of known function such as

expression sequence tags) and type II markers (anonymous genomic segments, including

microsatellites [repeating sequences of 2–6 bases of DNA]). Radiation hybrid (RH) maps,

which use x-ray breakage of chromosomes to determine the distance between markers, were

generated for equine chromosomes in order to develop a high-resolution, ordered physical

map consisting of uniformly distributed polymorphic markers. The first-generation RH map

contained 733 markers [15] and the second-generation RH map contained 4103 markers

[16]. These maps provided the initial tools required to assemble the whole genome sequence

of the horse.

The first genome sequence of the domestic horse was published in November 2009 as a

collaborative effort by the worldwide equine research community [17]. DNA from a single

Thoroughbred mare, Twilight, was used to construct the genome sequence. From a panel of

candidate horses, Twilight was selected based on a high level of homozygosity within her

major histocompatibility complex, a region of high diversity relevant to immune system

function that is normally challenging to assemble. A whole-genome shotgun method was

used to sequence Twilight, where large fragments of genomic DNA were randomly sheared

and subsequently inserted into libraries for replication and sequencing. DNA libraries are

collections of DNA fragments that have been inserted into vectors for sequencing. For

Twilight, the libraries were various sizes including 4, 10 and 40 Kb, allowing easier

assembly of the sequence data. Due to the shearing process used, different sized DNA

fragments were created that were overlapped and joined to form contigs, or consensus

regions. Overlapping contigs were then joined together into larger sequences called

scaffolds. A high-quality draft assembly was constructed and additional sequences were

provided by the inclusion of bacterial artificial chromosome end sequences from a related

male Thoroughbred horse [18]. The resulting assembly (EqCab2.0) has 6.8-fold sequence

coverage. The genome size of the horse was estimated to be around 2.7 Gb [17].

In order to determine the chromosomal locations and orientation of the scaffolds within the

equine genome, the genome assembly was compared with the known maps for the horse

[8,16,19,20]. The equine gene set, as annotated by the ENSEMBL pipeline, predicts 20,322

protein-coding genes (Ensembl build 52.2b), similar to human (16,617; Ensembl build 73).

EqCab2.0 is hosted through public genome browsing sites, including the University of

California Santa Cruz, Ensembl and the National Center for Biotechnology Information.

As part of the equine genome project, partial genome sequences were obtained from seven

additional horses from seven different breeds (Akhal-Teke, Andalusian, Arabian, Icelandic,

Quarter Horse, Standardbred, Thoroughbred) to provide a database of genetic markers [17].

A SNP map of more than one million markers was generated from the approximately

700,000 SNPs discovered in the Twilight genome and the additional 400,000 SNPs

discovered from approximately 100,000 whole genome shotgun reads from these seven

horses. As a result, in addition to microsatellite markers, the SNP map was also available as

a genomic tool to investigate traits and inherited diseases.

In 2011, whole-genome sequencing of an individual American Quarter Horse mare was

performed using massively parallel paired-end sequencing [21]. This particular mare was

selected based on having no introgression of Thoroughbred lines during the preceding 4

generations. Approximately 97% of the 75-bp paired end reads aligned to the reference

genome, resulting in an average of 24.7× sequence coverage of the Quarter Horse mare’s

genome. Almost 82,000 reads mapped to the reference mitochondrial genome, resulting in

an average of 355.6× coverage, and approximately 12.8 million reads were mapped to the

unassembled chromosomes. The remaining 12.6 million reads were de novo assembled,

generating 19.1 Mb of new horse genomic sequence.

One of the most exciting results from the sequencing of the Quarter Horse came from the

extensive variant detection analysis performed. Prior to this study, the catalogue of genetic

variants in the horse consisted of 1,163,580 SNP polymorphisms, with no annotated

insertion/deletion polymorphisms or copy number variants. Upon sequencing of the Quarter

Horse, 3.1 million SNPs, 193,000 insertions/deletions, and 282 copy number variants were

detected and subsequently annotated [21]. Pathway analyses of biological pathways

containing heterozygous nonsynonymous SNPs were performed and results compared

between the Quarter Horse and reference Thoroughbred mare. It was discovered that the

Quarter Horse had SNPs enriched in pathways for sensory proprioception, cellular processes

and signal transduction. As this particular mare was not selected for sequencing based on

homozygosity and is a different breed than the reference sequence, this genome provides an

excellent resource for studies of genetic variation.

In addition to sequencing of contemporary breeds, the genome sequence of the ancient horse

has recently been investigated, revealing that the Equus lineage gave rise to all

contemporary horses, zebras and donkeys and that the lineage originated 404.5 million years

ago [22]. Additional sequencing of domestic horse breeds and a Przewalski’s horse has

revealed no evidence of recent admixture between the domestic horse breeds and

Przewalski’s horse [22], thereby supporting the notion that Przewalski’s horses represent the

last surviving wild horse population. Readers are directed to publications regarding ancient

DNA sequences for further information [23-25].

Efforts are currently underway to improve upon the equine reference sequence through the

creation of EquCab3.0 by improving upon the Twilight sequence.

Genomics tools

SNP beadchip

With the discovery of approximately one million SNPs from the sequencing efforts

described above [17], sufficient markers were available to construct a whole genome SNP

array. Preference was given to SNPs that were discovered in the alternate breeds (AkhalTeke,

Andalusian, Arabian, Icelandic, Quarter Horse, Standardbred, Thoroughbred),

resulting in > 67% of the SNPs selected from one of these other breeds relative to Twilight

[17]. The first-generation array (Illumina EquineSNP50 Beadchip, San Diego, California,

USA; 2008) contained 54,602 SNPs that reliably produced genotypes when assessed on a

group of 354 horses representing 14 breeds [26]. Of the ~54 k SNPs, 53,524 were

polymorphic (i.e. having at least one heterozygote within the sample set). The EquineSNP50

Beadchip spanned the entire equine genome, with the exception of the Y chromosome, with

an average spacing between SNPs of 43.1 kb across the 31 autosomes and few gaps larger

than 500 kb.

In the original report describing the sequencing of Twilight, power estimates based on the

length of linkage disequilibrium (LD; level of association between markers) in the horse, the

number of haplotypes (i.e. combination of adjacent DNA sequences on a chromosome)

within haplotype blocks and the polymorphism rate, suggested that more than 100,000 SNPs

would be required to map traits within and across breeds [17]. The first generation SNP

array was validated on a panel of samples representing 14 domestic horse breeds and 18

evolutionarily related species [26]. Based on the extent of LD in breeds such as the Quarter

Horse and Mongolian horse, it has been recommended that more markers are required for

effective mapping in ancient breeds and those with a large effective population size [26].

Therefore, the first-generation Equine SNP50 Beadchip represented about one-half of the

estimated marker density required for adequately powered association studies in breeds with

an average or high degree of LD.

The Equine SNP50 Beadchip was used to evaluate population structure in 744 individuals

from 33 breeds of horses [27]. Variation found among breeds was used to identify genes and

genetic variants targeted by selective breeding (i.e. signatures of selection). This study

identified variants in the American Paint Horse and American Quarter Horse breeds

significantly associated with altered muscle fibre type proportions favourable for sprinting

ability, variants in breeds that perform alternative gaits and genomic regions involved in the

determination of size [27].

In January 2011, the Equine SNP50 Beadchip was replaced by a second-generation SNP

array, the Equine SNP70Beadchip, which contains approximately 74,500 SNP markers with

an average of 1.5 SNPs per 50 kb. This platform contains the original 53,500 markers from

the Equine SNP50 Beadchip and additional SNPs were chosen to address gaps and improve

global coverage across the genome. Additional SNPs were provided from the 7 discovery

breeds, Twilight and RNA sequencing (RNA-seq) data [28] (see below). The equine SNP70

Beadchip contains additional SNPs to enhance the coverage of the equine major

histocompatibility complex on chromosome 20 as well as SNPs on the X chromosome and 2

SNPs on the Y chromosome.

Association studies, using the equine SNP chips, were used to identify a chromosomal

region containing a strong candidate gene for lavender foal syndrome and subsequent

sequencing discovered the genetic mutation responsible for the disease [29]. In addition to

lavender foal disease, the SNP50 Beadchip was used to identify associations with SNP

markers and lead to the subsequent identification of genetic mutations for foal

immunodeficiency syndrome [30] and a mutation that is permissive for gaitedness in the

horse [31]. Association studies using the equine SNP chips have also identified quantitative

trait loci for further investigation in osteochondritis dissecans in Thoroughbreds [32], risk

loci for recurrent laryngeal neuropathy [33], loci for body size [34], and candidate regions

for guttural pouch tympany [35], equine uveitis [36] and insect bite hypersensitivity [37]. As

the estimated marker density of 100,000 SNPs has still not been achieved, efforts are

currently underway to develop a third generation SNP Beadchip, with a targeted 700,000

SNPs. Estimated availability of this array is scheduled for 2014.

DNA microarrays

The study of tissue-specific gene expression in the horse under particular conditions and

considering certain disease processes is an ever-expanding area of research. The first tools

developed to study gene expression, through evaluation of the mRNA transcriptome (all of

the RNAs transcribed from the genome that code for proteins), included expressed sequence

tags [38], serial analysis of gene expression [39] and microarrays [40]. Until recently,

microarrays were used as the primary experimental method for analysing gene expression in

the horse at the transcriptome level. Microarray technology involves isolation of RNA

(target) and subsequent hybridisation to specific, known DNA-sequences on the microarray

(probe). Hybridisation patterns are then compared to enable the identification of mRNAs

that differ in abundance in ≥2 target samples [41].

Initially, human and mouse-specific arrays were used to profile gene expression in the horse.

Upon completion of the sequencing of the equine genome, several groups initiated efforts to

improve equine-specific microarrays, using the gene prediction models from Ensembl and

National Center for Biotechnology Information. Equine-specific microarrays have been used

to evaluate gene expression in laminitis [42] and articular cartilage repair [43]. A recent

study using microarray technology on placental tissues identified a >900-fold upregulation

of mRNA encoding the cytokine interleukin (IL)-22 in chorionic girdle, which is the first

time IL-22 has been reported in any cells other than immune cells [44]. As is required for

any expression study using microarray technology, these results were confirmed using

quantitative RT-PCR. Currently, Agilent provides a horse gene expression microarray with

43,803 probes that can be customised to meet specific research needs (Agilent eArray

custom microarray; available at (https://earray.chem.agilent.com/earray/).

Next-generation sequencing

Most recently, RNA-seq methods have been used to refine gene structure models and

evaluate gene expression patterns [28,45]. Using RNA-seq generates quantitative and

qualitative data concurrently, while also providing insight into alternative transcripts from

the same gene. These RNA-seq techniques are currently being applied in the investigation of

many equine diseases. Also, RNA-seq methods were used to investigate the role of genomic

imprinting in the horse. Genomic imprinting is an epigenetic phenomenon by which certain

genes can be expressed in a parent-of-origin-specific manner. In 2012, Wang et al. used

RNA-seq methodologies to exclude the role of X-imprinted activation, which has been

demonstrated to occur in extra-embryonic studies of other species, in X-inactivation within

the horse and mule placenta [46]. The role of imprinting was further evaluated, using RNAseq,

to determine that there is a paternal bias to expressed genes in the horse placenta,

thereby highlighting the importance of the placenta as the tissue for genomic imprinting


Next-generation sequencing techniques are also being employed at a DNA level (DNA-seq).

These more efficient technologies allow sequences representing the entire genome of a horse

to now be obtained at an affordable cost and offer unprecedented insight into the number of

variants (SNPs, insertions, deletions, rearrangements) within the genome. Many studies that

initially employed genome-wide association techniques (see above) are currently using nextgeneration

sequencing to investigate further a region of association.

As the next-generation sequencing continues to become more and more affordable, whole

genome sequences in the horse are being generated worldwide. Analysis and storage of

these large amounts of sequence data can become problematic. At this time, there are many

software programs available to perform quality control, alignment to the reference sequence,

assembly of unaligned sequences, and variant detection in next-generation sequence data

[48]. For RNA-seq, there are various algorithms for quantifying and comparing gene

expression between conditions [49]. An extensive knowledge of bioinformatics has become

essential in processing the sequences obtained through next-generation sequencing.


Equine Vet J. Author manuscript; available in PMC 2015 February 13.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Appropriate strategies to investigate genetic traits

With the increased availability and affordability of genetic tools in horses, ample

opportunities exist to apply these molecular tools to a wide array of equine diseases. To

investigate genetic diseases in horses, both the sample population and the tools chosen

warrant consideration. For disease susceptibility traits, accurate phenotyping is essential.

Often, selection of affected cases is more straightforward than selection of an appropriate

control group. However, the phenotype to be studied should be decided a priori and its

severity graded, as this becomes useful when selecting cases for sequencing. Whenever

possible, a control population should be selected to maximise the chance that the control

horses would never manifest the phenotype and with consideration placed on the age of

disease onset, environmental risk factors and degree of relatedness between the control and

the affected populations. The utility of genetic tools is affected by the sample size available.

With moderate sized populations with easily discernible phenotypes and an autosomal

recessive mode of inheritance, an association between phenotype and a chromosomal locus

(loci) may be identified with the currently available techniques through a genome-wide

association study using the current Equine SNP BeadArrays. However, small sample sizes, a

phenotype influenced by multiple risk factors, diagnostic accuracy or short LD in certain

breeds (i.e. Quarter Horse) can fail to detect significant associations due to low statistical

power. Both DNA-seq and RNA-seq can be used concurrently to explore gene expression

and correlate with whole genome sequence variants. When designing an RNA-seq

experiment, it is necessary to factor in the fundamental aspects of sound experimental

design; replication, randomisation and blocking. Biological replicates are more important

than technical replicates in RNA-seq study design [50]. It is also imperative that the RNA

obtained in all cases is from a standardised site in the tissue of interest. Simultaneously

performing RNA-seq and whole genome DNA-seq on matching samples enhances the

power to detect biologically relevant variants in smaller sample sizes and should be

considered for investigating complex genetic traits.

Mendelian diseases and traits

Initial genetic mutations in horses were discovered through the use of comparative

genomics. For a certain disease, specific ‘candidate genes’ were investigated based on

equivalent diseases in man. In the horse, the genetic mutations for many diseases that have

genetic tests currently available, including hyperkalaemic periodic paralysis [51] and severe

combined immunodeficiency [52], were uncovered by evaluating candidate genes that had

been associated with similar diseases in man. With the sequencing and annotation of whole

genome maps, other diseases were discovered through whole genome linkage mapping

(hereditary equine regional dermal asthenia [53]), genome-wide association studies with

microsatellites (type I polysaccharide storage myopathy [54]) and genome-wide association

studies using SNP array technology (lavender foal syndrome [29]). At the time of

publication, 35 Mendelian diseases and traits have their key genetic mutations identified in

the horse, including the mutations encoding for coat colour loci (Supplementary Items 1 and

2). There are an additional 13 diseases or traits that appear to be inherited in a Mendelian

fashion but an underlying genetic mutation has not yet been identified or published

(Supplementary Item 3). An updated list of equine diseases can be found at the Online


Equine Vet J. Author manuscript; available in PMC 2015 February 13.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Mendelian Inheritance for Animals webpage (http://omia.angis.org.au/home). With the

current technologies available through SNP-association mapping and next-generation

sequencing, we should expect to further our understanding of Mendelian traits and diseases.

Diseases with suspected heritable basis

Currently, there are many diseases in the horse with a suspected heritable basis for which a

genetic test is not currently available and the mode of inheritance is unclear (Supplementary

Item 4). Diseases and traits in this table include both those with strong evidence for a genetic

basis based on current research and those that have strong comparative correlates in other

species. Many of these more complex diseases and traits have strong environmental

influences and may be polygenic. Researchers studying these diseases have the opportunity

to utilise new technologies, including the equine SNP70 Beadchip, DNA-seq and RNA-seq,

to advance our understanding of genetic variants and gene expression.

Performance traits

Selection for performance traits has been most extensively studied in Warmblood horses

[55-57] and Thoroughbred racehorses [58-60]. In Warmbloods, based on the genetic

correlations between conformation, performance and radiographic health of the limbs [61],

research has been targeted at selecting breeding horses based on these multiple traits [55].

Heritabilities for showjumping was estimated at 0.39–0.61 in Hanoverian horses [62] and

0.12–0.28 for Swedish Warmblood horses [63]. Recently, a genome-wide association study

was performed for quantitative trait loci for showjumping abilities in Hanoverians [56]. This

study identified 6 QTL regions that contained genes previously identified as performancerelated

genes in man, including PAPSS2 (3′-phosphoadenosine 5′-phosphosulfate synthase

2), MYL2 (myosin, light chain 2, regulatory, cardiac, slow), TRHR (thyrotropin-releasing

hormone receptor) and GABPA (GA binding protein transcription factor, α subunit 60 kDa)


In Thoroughbred racehorses, groups of genes associated with the control of substrate

utilisation, insulin signalling and muscle strength seem to be of the greatest importance to

performance [58-60,64]. Sequence variants associated with performance traits have been

reported in the genes MSTN (myostatin) [59,60,64,65], CKM (creatine kinase), COX4I2

(cytochrome c oxidase, subunit 4 isoform 2) [66] and PDK4 (pyruvate dehydrogenase kinase

isoenzyme 4, mitochondrial gene) [67]. Of these genes, the most extensively studied locus

has been the myostatin gene (MSTN, GDF-8). Myostatin is a secreted growth differentiation

factor that inhibits muscle differentiation and growth during myogenesis. Sequence and

structural variation has been discovered in the proximal upstream, downstream and

intergenic sequences of the MSTN gene that are associated with optimum racing distance in

Thoroughbreds [59,60,65]. There are no variants identified within coding sequence of this

gene; all of these associated variants are outside of exons. The variants include: 2 SNPs in

intron 1, a 227 bp insertion located 145 bp upstream of the transcriptional start site and four

3′ untranslated region SNPs [60]. A BLAST search identified the 227 bp insertion as a

horse-specific repetitive DNA sequence element (SINE) known as ERE-1 [60]. Of these

variants, one of the SNPs in intron 1 (g.66493737C>T; P = 5.24 × 10−13) and the SINE

insertion (P = 5.54 × 10−10) were highly associated with the quantitative trait of best race

distance in 165 samples [60]. A genetic test was made available, termed the Equinome

Speed Gene Test, to predict the type of distance best suited to a particular horse based on the

genotype at this intronic locus. Individuals homozygous for the ‘C’ allele (i.e. CC) appear to

compete best in faster, shorter distance races, heterozygous horses (CT) are best at middledistance

races and horses homozygous for the ‘T’ allele (i.e. TT) are best suited in longerdistance

races [64]. Short-distance races were considered to have a mean distance of 6.5 ±

1.5 furlongs, medium-distance 9.1 ± 2.3 furlongs and long-distance 11.0 ± 2.1 furlongs [64].

Many horse breeds demonstrate alternate gaits, including pace, regular rhythm ambling,

lateral ambling and diagonal ambling. The Icelandic horse has a characteristic gait, termed

the tölt, which is a regular ambling gait. A genome-wide association study was performed in

70 Icelandic horses who segregated by gait. Thirty of these horses were classified as 4-

gaited (walk, tölt, trot and gallop) and 40 were classified as 5-gaited (walk, tölt, trot, gallop

and pace) [31]. A significant association between the ability to pace and an SNP on

chromosome 23 was discovered. Subsequent re-sequencing of the region revealed a

homozygous haplotype block in the 5-gaited horses, which contained a family of doublsex

and mab-3 related transcription factors, DMRT1-3. Upon whole-genome re-sequencing, a

single base pair change at codon 301 in DMRT3 was discovered that led to a premature stop

at codon 301. All 5-gaited Icelandic horses are homozygous for this nonsense mutation

while nongaited horses are homozygous wild-type. A high frequency of the DMRT3

mutation was found in horses bred for harness racing [31]. These researchers created Dmrt3-

null mice that demonstrated that Dmrt3 is expressed in the spinal cord and is critical for

normal development of coordinated locomotor network controlling limb movements [31]. A

recent study demonstrated worldwide distribution of the DMRT3 mutation, occurring in 68

out of the 141 breeds genotyped, most abundant in breeds classified as gaited [68].

Genetic testing

DNA tests can be divided into two categories: mutation tests and linked-marker or haplotype

tests. Mutation tests are based on assaying an actual mutation that causes disease, whereas

the linked-marker or haplotype test is based on an assay of the genomic region that is known

to cause disease, but which is not necessarily the actual mutation. Usually, haplotype tests

are offered instead of a mutation test where the functional mutation has not yet been


Mutations that cause disease appear in many different forms. A SNP can cause a disease

either by changing an amino acid (‘missense’ mutation), truncating the amino acid chain

(‘nonsense’ mutation), or altering expression or proper splicing. For example, a missense

mutation has been shown to cause type I polysaccharide storage myopathy [54]

(Supplementary Item 1). Insertions or deletions of a single base pair can cause mutations in

the coding sequence by altering the translational frame, which ultimately causes either

protein truncation or an elongated abnormal protein.

The basis for DNA testing is PCR. Primers can be designed specifically to amplify the DNA

fragment containing either the disease-causing allele or the normal allele. Direct sequencing


Equine Vet J. Author manuscript; available in PMC 2015 February 13.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

of a section of DNA can also be used to determine the animal’s genotype. Alternatively, the

PCR product can be digested with a restriction enzyme that cleaves the DNA at a particular

sequence of bases. To test for the mutation, a restriction enzyme is chosen that shows a

different cleavage pattern between the mutant and the normal forms of the DNA. Many

different methods are available to assay changes in DNA that lead to disease. Each company

that offers a test may choose a different type of assay for the same mutation.

There are limits to all genetic testing. In mutation tests, the specific mutation being assayed

is the only factor being evaluated. An animal may have an alternative mutation in the same

gene or a mutation in a different gene that causes the same phenotype (phenocopy). It is

therefore correct to state that an animal has been ‘DNA tested negative’ for this specific

mutation rather than ‘DNA tested clear’ of the disease. Linked-marker tests have these same

and additional sources of error. In the case of linked-marker tests, recombination events

between the marker(s) and the true disease mutation can lead to false-positive and falsenegative

results. The use of multiple markers that flank the gene of interest (haplotype test)

can increase the probability that a recombination event will be identified; and if one is

identified, the laboratory will know that the test is not valid for that individual.

It is important to recognise that no authority, association or committee examines quality

control of DNA tests that are available in animals. Most tests are published in the scientific

literature, not as diagnostic tests, but as articles describing the discovery of the mutation.

Much of the research done to identify the mutations involved in the tests is performed at

universities and funded by granting agencies that have both financial and intellectual interest

in patenting the tests. Companies then license the rights to offer the tests. Veterinarians

should contact the laboratories to inquire about available genetic tests for horses and

determine if the laboratory maintains a license to run a particular test.


The past two decades have resulted in an explosion of research in the field of equine

genomics. With the creation of the original marker maps in the horse, subsequent

sequencing and annotation of the complete equine genome and the availability of genomic

tools to investigate specific traits and diseases, the study of equine genomics has rapidly

accelerated. Efforts are currently underway to improve upon the equine reference sequence

through the creation of EquCab3.0 and develop variant databases to expand our knowledge

of common variants in the equine genome. Undoubtedly, the next decade will continue to

see an increase in the amount of available DNA tests for horses, in addition to an enhanced

understanding of specific traits and diseases at the molecular level.


  1 2 3 4 ··· 78