Equine genome sequence and assembly
The First International Equine Gene Mapping Workshop took place in October 1995 in
Lexington, Kentucky, and signified the beginning of an organised equine genomics group.
First-generation maps of the equine genome contained markers that were assigned to various
equine chromosomes using approaches such as synteny analysis (preserved colocalisation of
markers on chromosomes of different species) [4,5], genetic linkage mapping (tendency of
markers that are located close together to be inherited together) [6-9] and fluorescent in situ
hybridisation (the ability to detect specific DNA sequences on chromosomes) [10-14].
Markers consisted of type I markers (associated with genes of known function such as
expression sequence tags) and type II markers (anonymous genomic segments, including
microsatellites [repeating sequences of 2–6 bases of DNA]). Radiation hybrid (RH) maps,
which use x-ray breakage of chromosomes to determine the distance between markers, were
generated for equine chromosomes in order to develop a high-resolution, ordered physical
map consisting of uniformly distributed polymorphic markers. The first-generation RH map
contained 733 markers [15] and the second-generation RH map contained 4103 markers
[16]. These maps provided the initial tools required to assemble the whole genome sequence
of the horse.
The first genome sequence of the domestic horse was published in November 2009 as a
collaborative effort by the worldwide equine research community [17]. DNA from a single
Thoroughbred mare, Twilight, was used to construct the genome sequence. From a panel of
candidate horses, Twilight was selected based on a high level of homozygosity within her
major histocompatibility complex, a region of high diversity relevant to immune system
function that is normally challenging to assemble. A whole-genome shotgun method was
used to sequence Twilight, where large fragments of genomic DNA were randomly sheared
and subsequently inserted into libraries for replication and sequencing. DNA libraries are
collections of DNA fragments that have been inserted into vectors for sequencing. For
Twilight, the libraries were various sizes including 4, 10 and 40 Kb, allowing easier
assembly of the sequence data. Due to the shearing process used, different sized DNA
fragments were created that were overlapped and joined to form contigs, or consensus
regions. Overlapping contigs were then joined together into larger sequences called
scaffolds. A high-quality draft assembly was constructed and additional sequences were
provided by the inclusion of bacterial artificial chromosome end sequences from a related
male Thoroughbred horse [18]. The resulting assembly (EqCab2.0) has 6.8-fold sequence
coverage. The genome size of the horse was estimated to be around 2.7 Gb [17].
In order to determine the chromosomal locations and orientation of the scaffolds within the
equine genome, the genome assembly was compared with the known maps for the horse
[8,16,19,20]. The equine gene set, as annotated by the ENSEMBL pipeline, predicts 20,322
protein-coding genes (Ensembl build 52.2b), similar to human (16,617; Ensembl build 73).
EqCab2.0 is hosted through public genome browsing sites, including the University of
California Santa Cruz, Ensembl and the National Center for Biotechnology Information.
As part of the equine genome project, partial genome sequences were obtained from seven
additional horses from seven different breeds (Akhal-Teke, Andalusian, Arabian, Icelandic,
Quarter Horse, Standardbred, Thoroughbred) to provide a database of genetic markers [17].
A SNP map of more than one million markers was generated from the approximately
700,000 SNPs discovered in the Twilight genome and the additional 400,000 SNPs
discovered from approximately 100,000 whole genome shotgun reads from these seven
horses. As a result, in addition to microsatellite markers, the SNP map was also available as
a genomic tool to investigate traits and inherited diseases.
In 2011, whole-genome sequencing of an individual American Quarter Horse mare was
performed using massively parallel paired-end sequencing [21]. This particular mare was
selected based on having no introgression of Thoroughbred lines during the preceding 4
generations. Approximately 97% of the 75-bp paired end reads aligned to the reference
genome, resulting in an average of 24.7× sequence coverage of the Quarter Horse mare’s
genome. Almost 82,000 reads mapped to the reference mitochondrial genome, resulting in
an average of 355.6× coverage, and approximately 12.8 million reads were mapped to the
unassembled chromosomes. The remaining 12.6 million reads were de novo assembled,
generating 19.1 Mb of new horse genomic sequence.
One of the most exciting results from the sequencing of the Quarter Horse came from the
extensive variant detection analysis performed. Prior to this study, the catalogue of genetic
variants in the horse consisted of 1,163,580 SNP polymorphisms, with no annotated
insertion/deletion polymorphisms or copy number variants. Upon sequencing of the Quarter
Horse, 3.1 million SNPs, 193,000 insertions/deletions, and 282 copy number variants were
detected and subsequently annotated [21]. Pathway analyses of biological pathways
containing heterozygous nonsynonymous SNPs were performed and results compared
between the Quarter Horse and reference Thoroughbred mare. It was discovered that the
Quarter Horse had SNPs enriched in pathways for sensory proprioception, cellular processes
and signal transduction. As this particular mare was not selected for sequencing based on
homozygosity and is a different breed than the reference sequence, this genome provides an
excellent resource for studies of genetic variation.
In addition to sequencing of contemporary breeds, the genome sequence of the ancient horse
has recently been investigated, revealing that the Equus lineage gave rise to all
contemporary horses, zebras and donkeys and that the lineage originated 404.5 million years
ago [22]. Additional sequencing of domestic horse breeds and a Przewalski’s horse has
revealed no evidence of recent admixture between the domestic horse breeds and
Przewalski’s horse [22], thereby supporting the notion that Przewalski’s horses represent the
last surviving wild horse population. Readers are directed to publications regarding ancient
DNA sequences for further information [23-25].
Efforts are currently underway to improve upon the equine reference sequence through the
creation of EquCab3.0 by improving upon the Twilight sequence.
Genomics tools
SNP beadchip
With the discovery of approximately one million SNPs from the sequencing efforts
described above [17], sufficient markers were available to construct a whole genome SNP
array. Preference was given to SNPs that were discovered in the alternate breeds (AkhalTeke,
Andalusian, Arabian, Icelandic, Quarter Horse, Standardbred, Thoroughbred),
resulting in > 67% of the SNPs selected from one of these other breeds relative to Twilight
[17]. The first-generation array (Illumina EquineSNP50 Beadchip, San Diego, California,
USA; 2008) contained 54,602 SNPs that reliably produced genotypes when assessed on a
group of 354 horses representing 14 breeds [26]. Of the ~54 k SNPs, 53,524 were
polymorphic (i.e. having at least one heterozygote within the sample set). The EquineSNP50
Beadchip spanned the entire equine genome, with the exception of the Y chromosome, with
an average spacing between SNPs of 43.1 kb across the 31 autosomes and few gaps larger
than 500 kb.
In the original report describing the sequencing of Twilight, power estimates based on the
length of linkage disequilibrium (LD; level of association between markers) in the horse, the
number of haplotypes (i.e. combination of adjacent DNA sequences on a chromosome)
within haplotype blocks and the polymorphism rate, suggested that more than 100,000 SNPs
would be required to map traits within and across breeds [17]. The first generation SNP
array was validated on a panel of samples representing 14 domestic horse breeds and 18
evolutionarily related species [26]. Based on the extent of LD in breeds such as the Quarter
Horse and Mongolian horse, it has been recommended that more markers are required for
effective mapping in ancient breeds and those with a large effective population size [26].
Therefore, the first-generation Equine SNP50 Beadchip represented about one-half of the
estimated marker density required for adequately powered association studies in breeds with
an average or high degree of LD.
The Equine SNP50 Beadchip was used to evaluate population structure in 744 individuals
from 33 breeds of horses [27]. Variation found among breeds was used to identify genes and
genetic variants targeted by selective breeding (i.e. signatures of selection). This study
identified variants in the American Paint Horse and American Quarter Horse breeds
significantly associated with altered muscle fibre type proportions favourable for sprinting
ability, variants in breeds that perform alternative gaits and genomic regions involved in the
determination of size [27].
In January 2011, the Equine SNP50 Beadchip was replaced by a second-generation SNP
array, the Equine SNP70Beadchip, which contains approximately 74,500 SNP markers with
an average of 1.5 SNPs per 50 kb. This platform contains the original 53,500 markers from
the Equine SNP50 Beadchip and additional SNPs were chosen to address gaps and improve
global coverage across the genome. Additional SNPs were provided from the 7 discovery
breeds, Twilight and RNA sequencing (RNA-seq) data [28] (see below). The equine SNP70
Beadchip contains additional SNPs to enhance the coverage of the equine major
histocompatibility complex on chromosome 20 as well as SNPs on the X chromosome and 2
SNPs on the Y chromosome.
Association studies, using the equine SNP chips, were used to identify a chromosomal
region containing a strong candidate gene for lavender foal syndrome and subsequent
sequencing discovered the genetic mutation responsible for the disease [29]. In addition to
lavender foal disease, the SNP50 Beadchip was used to identify associations with SNP
markers and lead to the subsequent identification of genetic mutations for foal
immunodeficiency syndrome [30] and a mutation that is permissive for gaitedness in the
horse [31]. Association studies using the equine SNP chips have also identified quantitative
trait loci for further investigation in osteochondritis dissecans in Thoroughbreds [32], risk
loci for recurrent laryngeal neuropathy [33], loci for body size [34], and candidate regions
for guttural pouch tympany [35], equine uveitis [36] and insect bite hypersensitivity [37]. As
the estimated marker density of 100,000 SNPs has still not been achieved, efforts are
currently underway to develop a third generation SNP Beadchip, with a targeted 700,000
SNPs. Estimated availability of this array is scheduled for 2014.
DNA microarrays
The study of tissue-specific gene expression in the horse under particular conditions and
considering certain disease processes is an ever-expanding area of research. The first tools
developed to study gene expression, through evaluation of the mRNA transcriptome (all of
the RNAs transcribed from the genome that code for proteins), included expressed sequence
tags [38], serial analysis of gene expression [39] and microarrays [40]. Until recently,
microarrays were used as the primary experimental method for analysing gene expression in
the horse at the transcriptome level. Microarray technology involves isolation of RNA
(target) and subsequent hybridisation to specific, known DNA-sequences on the microarray
(probe). Hybridisation patterns are then compared to enable the identification of mRNAs
that differ in abundance in ≥2 target samples [41].
Initially, human and mouse-specific arrays were used to profile gene expression in the horse.
Upon completion of the sequencing of the equine genome, several groups initiated efforts to
improve equine-specific microarrays, using the gene prediction models from Ensembl and
National Center for Biotechnology Information. Equine-specific microarrays have been used
to evaluate gene expression in laminitis [42] and articular cartilage repair [43]. A recent
study using microarray technology on placental tissues identified a >900-fold upregulation
of mRNA encoding the cytokine interleukin (IL)-22 in chorionic girdle, which is the first
time IL-22 has been reported in any cells other than immune cells [44]. As is required for
any expression study using microarray technology, these results were confirmed using
quantitative RT-PCR. Currently, Agilent provides a horse gene expression microarray with
43,803 probes that can be customised to meet specific research needs (Agilent eArray
custom microarray; available at (https://earray.chem.agilent.com/earray/).
Next-generation sequencing
Most recently, RNA-seq methods have been used to refine gene structure models and
evaluate gene expression patterns [28,45]. Using RNA-seq generates quantitative and
qualitative data concurrently, while also providing insight into alternative transcripts from
the same gene. These RNA-seq techniques are currently being applied in the investigation of
many equine diseases. Also, RNA-seq methods were used to investigate the role of genomic
imprinting in the horse. Genomic imprinting is an epigenetic phenomenon by which certain
genes can be expressed in a parent-of-origin-specific manner. In 2012, Wang et al. used
RNA-seq methodologies to exclude the role of X-imprinted activation, which has been
demonstrated to occur in extra-embryonic studies of other species, in X-inactivation within
the horse and mule placenta [46]. The role of imprinting was further evaluated, using RNAseq,
to determine that there is a paternal bias to expressed genes in the horse placenta,
thereby highlighting the importance of the placenta as the tissue for genomic imprinting
[47].
Next-generation sequencing techniques are also being employed at a DNA level (DNA-seq).
These more efficient technologies allow sequences representing the entire genome of a horse
to now be obtained at an affordable cost and offer unprecedented insight into the number of
variants (SNPs, insertions, deletions, rearrangements) within the genome. Many studies that
initially employed genome-wide association techniques (see above) are currently using nextgeneration
sequencing to investigate further a region of association.
As the next-generation sequencing continues to become more and more affordable, whole
genome sequences in the horse are being generated worldwide. Analysis and storage of
these large amounts of sequence data can become problematic. At this time, there are many
software programs available to perform quality control, alignment to the reference sequence,
assembly of unaligned sequences, and variant detection in next-generation sequence data
[48]. For RNA-seq, there are various algorithms for quantifying and comparing gene
expression between conditions [49]. An extensive knowledge of bioinformatics has become
essential in processing the sequences obtained through next-generation sequencing.
FINNO and BANNASCH Page 6
Equine Vet J. Author manuscript; available in PMC 2015 February 13.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
Appropriate strategies to investigate genetic traits
With the increased availability and affordability of genetic tools in horses, ample
opportunities exist to apply these molecular tools to a wide array of equine diseases. To
investigate genetic diseases in horses, both the sample population and the tools chosen
warrant consideration. For disease susceptibility traits, accurate phenotyping is essential.
Often, selection of affected cases is more straightforward than selection of an appropriate
control group. However, the phenotype to be studied should be decided a priori and its
severity graded, as this becomes useful when selecting cases for sequencing. Whenever
possible, a control population should be selected to maximise the chance that the control
horses would never manifest the phenotype and with consideration placed on the age of
disease onset, environmental risk factors and degree of relatedness between the control and
the affected populations. The utility of genetic tools is affected by the sample size available.
With moderate sized populations with easily discernible phenotypes and an autosomal
recessive mode of inheritance, an association between phenotype and a chromosomal locus
(loci) may be identified with the currently available techniques through a genome-wide
association study using the current Equine SNP BeadArrays. However, small sample sizes, a
phenotype influenced by multiple risk factors, diagnostic accuracy or short LD in certain
breeds (i.e. Quarter Horse) can fail to detect significant associations due to low statistical
power. Both DNA-seq and RNA-seq can be used concurrently to explore gene expression
and correlate with whole genome sequence variants. When designing an RNA-seq
experiment, it is necessary to factor in the fundamental aspects of sound experimental
design; replication, randomisation and blocking. Biological replicates are more important
than technical replicates in RNA-seq study design [50]. It is also imperative that the RNA
obtained in all cases is from a standardised site in the tissue of interest. Simultaneously
performing RNA-seq and whole genome DNA-seq on matching samples enhances the
power to detect biologically relevant variants in smaller sample sizes and should be
considered for investigating complex genetic traits.
Mendelian diseases and traits
Initial genetic mutations in horses were discovered through the use of comparative
genomics. For a certain disease, specific ‘candidate genes’ were investigated based on
equivalent diseases in man. In the horse, the genetic mutations for many diseases that have
genetic tests currently available, including hyperkalaemic periodic paralysis [51] and severe
combined immunodeficiency [52], were uncovered by evaluating candidate genes that had
been associated with similar diseases in man. With the sequencing and annotation of whole
genome maps, other diseases were discovered through whole genome linkage mapping
(hereditary equine regional dermal asthenia [53]), genome-wide association studies with
microsatellites (type I polysaccharide storage myopathy [54]) and genome-wide association
studies using SNP array technology (lavender foal syndrome [29]). At the time of
publication, 35 Mendelian diseases and traits have their key genetic mutations identified in
the horse, including the mutations encoding for coat colour loci (Supplementary Items 1 and
2). There are an additional 13 diseases or traits that appear to be inherited in a Mendelian
fashion but an underlying genetic mutation has not yet been identified or published
(Supplementary Item 3). An updated list of equine diseases can be found at the Online
FINNO and BANNASCH Page 7
Equine Vet J. Author manuscript; available in PMC 2015 February 13.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
Mendelian Inheritance for Animals webpage (http://omia.angis.org.au/home). With the
current technologies available through SNP-association mapping and next-generation
sequencing, we should expect to further our understanding of Mendelian traits and diseases.
Diseases with suspected heritable basis
Currently, there are many diseases in the horse with a suspected heritable basis for which a
genetic test is not currently available and the mode of inheritance is unclear (Supplementary
Item 4). Diseases and traits in this table include both those with strong evidence for a genetic
basis based on current research and those that have strong comparative correlates in other
species. Many of these more complex diseases and traits have strong environmental
influences and may be polygenic. Researchers studying these diseases have the opportunity
to utilise new technologies, including the equine SNP70 Beadchip, DNA-seq and RNA-seq,
to advance our understanding of genetic variants and gene expression.
Performance traits
Selection for performance traits has been most extensively studied in Warmblood horses
[55-57] and Thoroughbred racehorses [58-60]. In Warmbloods, based on the genetic
correlations between conformation, performance and radiographic health of the limbs [61],
research has been targeted at selecting breeding horses based on these multiple traits [55].
Heritabilities for showjumping was estimated at 0.39–0.61 in Hanoverian horses [62] and
0.12–0.28 for Swedish Warmblood horses [63]. Recently, a genome-wide association study
was performed for quantitative trait loci for showjumping abilities in Hanoverians [56]. This
study identified 6 QTL regions that contained genes previously identified as performancerelated
genes in man, including PAPSS2 (3′-phosphoadenosine 5′-phosphosulfate synthase
2), MYL2 (myosin, light chain 2, regulatory, cardiac, slow), TRHR (thyrotropin-releasing
hormone receptor) and GABPA (GA binding protein transcription factor, α subunit 60 kDa)
[56].
In Thoroughbred racehorses, groups of genes associated with the control of substrate
utilisation, insulin signalling and muscle strength seem to be of the greatest importance to
performance [58-60,64]. Sequence variants associated with performance traits have been
reported in the genes MSTN (myostatin) [59,60,64,65], CKM (creatine kinase), COX4I2
(cytochrome c oxidase, subunit 4 isoform 2) [66] and PDK4 (pyruvate dehydrogenase kinase
isoenzyme 4, mitochondrial gene) [67]. Of these genes, the most extensively studied locus
has been the myostatin gene (MSTN, GDF-8). Myostatin is a secreted growth differentiation
factor that inhibits muscle differentiation and growth during myogenesis. Sequence and
structural variation has been discovered in the proximal upstream, downstream and
intergenic sequences of the MSTN gene that are associated with optimum racing distance in
Thoroughbreds [59,60,65]. There are no variants identified within coding sequence of this
gene; all of these associated variants are outside of exons. The variants include: 2 SNPs in
intron 1, a 227 bp insertion located 145 bp upstream of the transcriptional start site and four
3′ untranslated region SNPs [60]. A BLAST search identified the 227 bp insertion as a
horse-specific repetitive DNA sequence element (SINE) known as ERE-1 [60]. Of these
variants, one of the SNPs in intron 1 (g.66493737C>T; P = 5.24 × 10−13) and the SINE
insertion (P = 5.54 × 10−10) were highly associated with the quantitative trait of best race
distance in 165 samples [60]. A genetic test was made available, termed the Equinome
Speed Gene Test, to predict the type of distance best suited to a particular horse based on the
genotype at this intronic locus. Individuals homozygous for the ‘C’ allele (i.e. CC) appear to
compete best in faster, shorter distance races, heterozygous horses (CT) are best at middledistance
races and horses homozygous for the ‘T’ allele (i.e. TT) are best suited in longerdistance
races [64]. Short-distance races were considered to have a mean distance of 6.5 ±
1.5 furlongs, medium-distance 9.1 ± 2.3 furlongs and long-distance 11.0 ± 2.1 furlongs [64].
Many horse breeds demonstrate alternate gaits, including pace, regular rhythm ambling,
lateral ambling and diagonal ambling. The Icelandic horse has a characteristic gait, termed
the tölt, which is a regular ambling gait. A genome-wide association study was performed in
70 Icelandic horses who segregated by gait. Thirty of these horses were classified as 4-
gaited (walk, tölt, trot and gallop) and 40 were classified as 5-gaited (walk, tölt, trot, gallop
and pace) [31]. A significant association between the ability to pace and an SNP on
chromosome 23 was discovered. Subsequent re-sequencing of the region revealed a
homozygous haplotype block in the 5-gaited horses, which contained a family of doublsex
and mab-3 related transcription factors, DMRT1-3. Upon whole-genome re-sequencing, a
single base pair change at codon 301 in DMRT3 was discovered that led to a premature stop
at codon 301. All 5-gaited Icelandic horses are homozygous for this nonsense mutation
while nongaited horses are homozygous wild-type. A high frequency of the DMRT3
mutation was found in horses bred for harness racing [31]. These researchers created Dmrt3-
null mice that demonstrated that Dmrt3 is expressed in the spinal cord and is critical for
normal development of coordinated locomotor network controlling limb movements [31]. A
recent study demonstrated worldwide distribution of the DMRT3 mutation, occurring in 68
out of the 141 breeds genotyped, most abundant in breeds classified as gaited [68].
Genetic testing
DNA tests can be divided into two categories: mutation tests and linked-marker or haplotype
tests. Mutation tests are based on assaying an actual mutation that causes disease, whereas
the linked-marker or haplotype test is based on an assay of the genomic region that is known
to cause disease, but which is not necessarily the actual mutation. Usually, haplotype tests
are offered instead of a mutation test where the functional mutation has not yet been
identified.
Mutations that cause disease appear in many different forms. A SNP can cause a disease
either by changing an amino acid (‘missense’ mutation), truncating the amino acid chain
(‘nonsense’ mutation), or altering expression or proper splicing. For example, a missense
mutation has been shown to cause type I polysaccharide storage myopathy [54]
(Supplementary Item 1). Insertions or deletions of a single base pair can cause mutations in
the coding sequence by altering the translational frame, which ultimately causes either
protein truncation or an elongated abnormal protein.
The basis for DNA testing is PCR. Primers can be designed specifically to amplify the DNA
fragment containing either the disease-causing allele or the normal allele. Direct sequencing
FINNO and BANNASCH Page 9
Equine Vet J. Author manuscript; available in PMC 2015 February 13.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript
of a section of DNA can also be used to determine the animal’s genotype. Alternatively, the
PCR product can be digested with a restriction enzyme that cleaves the DNA at a particular
sequence of bases. To test for the mutation, a restriction enzyme is chosen that shows a
different cleavage pattern between the mutant and the normal forms of the DNA. Many
different methods are available to assay changes in DNA that lead to disease. Each company
that offers a test may choose a different type of assay for the same mutation.
There are limits to all genetic testing. In mutation tests, the specific mutation being assayed
is the only factor being evaluated. An animal may have an alternative mutation in the same
gene or a mutation in a different gene that causes the same phenotype (phenocopy). It is
therefore correct to state that an animal has been ‘DNA tested negative’ for this specific
mutation rather than ‘DNA tested clear’ of the disease. Linked-marker tests have these same
and additional sources of error. In the case of linked-marker tests, recombination events
between the marker(s) and the true disease mutation can lead to false-positive and falsenegative
results. The use of multiple markers that flank the gene of interest (haplotype test)
can increase the probability that a recombination event will be identified; and if one is
identified, the laboratory will know that the test is not valid for that individual.
It is important to recognise that no authority, association or committee examines quality
control of DNA tests that are available in animals. Most tests are published in the scientific
literature, not as diagnostic tests, but as articles describing the discovery of the mutation.
Much of the research done to identify the mutations involved in the tests is performed at
universities and funded by granting agencies that have both financial and intellectual interest
in patenting the tests. Companies then license the rights to offer the tests. Veterinarians
should contact the laboratories to inquire about available genetic tests for horses and
determine if the laboratory maintains a license to run a particular test.
Conclusions
The past two decades have resulted in an explosion of research in the field of equine
genomics. With the creation of the original marker maps in the horse, subsequent
sequencing and annotation of the complete equine genome and the availability of genomic
tools to investigate specific traits and diseases, the study of equine genomics has rapidly
accelerated. Efforts are currently underway to improve upon the equine reference sequence
through the creation of EquCab3.0 and develop variant databases to expand our knowledge
of common variants in the equine genome. Undoubtedly, the next decade will continue to
see an increase in the amount of available DNA tests for horses, in addition to an enhanced
understanding of specific traits and diseases at the molecular level.