Loading...
 

Genome Build (aka Reference Model)

Since it is getting lots of press in 2018 due to genetic genealogy companies changing their reference build type, we explain briefly what the build is all about here.

The terms (human) reference model, build, build type and reference build are synonymous. If you see genome build, build, or genome model on this site, it is always referring to the Human Genome Reference Build / Model described here. With the introduction of WGS testing, this has become even more important to understand.

When the international Human Genome Project effort started in the late 1980's, it was understood that the result of developing and understanding the mapping of nucleotides on each chromosome needed to be recorded and available in a common way. That is, a database of the Human Genome as it is understood at that time. Even though all mammals share 99+% of their DNA with each other, every human has slightly different DNA from every other creature and human. And those differences are as simple as a single nucleotide base-pair value change up to whole sequences of nucleotides being added, deleted or reordered in some other way. In the case of mammals, there may be even whole chromosomes added or missing between them. So the Human Genome Project created what they termed a reference build that documents the then defined and known sequencing of nucleotides (or base pairs) in a defined reference human.

The build has always had a focus on the genes in the DNA with a deep understanding documented of their nucleotide sequence and, as best could be defined, where it exists in the larger chromosome strand. As that work finished, effort was given to the inter-gene (sometimes called junk or non-coding) areas of the DNA strand.

Non-coding areas comprise the majority of our DNA. This area is where the largest and most drastic changes occur and thus is more difficult to fix in a reference build. And hence has taken much more work to develop. These larger changes in the non-coding sections also cause any count or specific position in the strand of the nucleotides to be difficult to determine. The count from any reference point, such as the ends of the strand or a Telomere, is not stable. (Note: if a dramatic change in the gene or coding region occurs, the cell usually cannot survive. But major changes in the non-coding region may occur and the organism still survives. Hence the large variation in STR values as most STRs are in the non-coding region.)

Fast forward to 2014 when the last major build was released by the Genome Reference Consortium: Human Genome Build 38 (or GRCh38 for short). Most autosomal results from the first half of the 2010 decade were delivered as GRCh37 or even NCBI36 build results early on (also known as hg19 and hg18; respectively). For example, FamilyTreeDNA, had been delivering their yDNA SNP results in context of an hg19 build but only recently (2018) updated to GRCh38. STR results are often given with named STR markers. SNP results are often given with named markers and also in relation to an rsID number to clearly identify them. STR and SNP markers are named and generally independent of the reference build that defines where they reside. Ditto for the rsID nomenclature. Only new markers, either not yet named or fully defined, are defined by a locus point. In that case, their definition is dependent on the build. To do an apples-to-apples comparison of two different test results, you need to make sure the nomenclature is the same (if not marker name, then common underlying rsID names; and if no rsID known, then the same location in the same reference build). Autosomal results from microarray tests are delivered consistently in build hg19 since 2012.

It is interesting to note that the original Genome Build, at least the yDNA portion, was mostly based on a male anonymous submitter from the Buffalo / Rochester, New York area. He is within the clade R1b-L20. If not changed, others in that same clade would have virtually no yDNA SNPs different than the model. Luckily, the researchers working in the field of building the phylogenetic trees were able to modify the reference model so SNPs placed on the tree are marked as derived (positive for change) from the developed root (or "Adam") there.1 Something similar is needed for the autosomes with the additional caveat that the alternate contiguous regions would likely need to remain in the model to handle some different (historically) ethnic distinct groups.

The Human Genome Build, by definition, is the haploid with the caveat it includes both Allosome chromosomes. The tools that measure DNA for genetic genealogy are capturing the diploid and thus both copies of the autosomal chromosomes (and possibly both copies of the xDNA for biological females).

The reference model is delivered, most often, in a special FASTA file format known as a "final assembly" (or FA). So instead of segments out of a sequencer, they are segments representing each strand in the whole genome model. Gaps are filled in with "N"s for each specific base-pair not yet defined. So the model is always of a specific length of base-pairs per chromosome. Technically, often, the mtDNA is not considered a direct part of the model but included in the analysis models used in tools.

Rows of equivalent nomenclature for build versions; followed by the various services using that build:
Build VersionCompany / Service usingNotes
GRCh38 / hg38 GEDMatch Genesis, 23andMe v5 (Sep2017 and later), LIvingDNA, FTDNA BigY-700, Latest and 1KGenome project model of hs38
GRCh37 / hg19 FTDNA BigY, most atDNA results It is a misnomer to call this hg37 although a popular model from the 1KGenome project is termed hs37d5
NCBI36 / hg18 GEDMatch Original, early FTDNA atDNA
When looking at RAW data files, the positions of referenced variations are going to be dependent on the build used.

Note: ''We should clarify for the reader. hg reference model names are those generally released by University of California, Santa Cruz during the (HGP/ UCSC developed the original Genome Browser and analysis set for release with it. GRCh is the name associated with later releases during the 1KGenome project and with the Genome Reference Consortium - human and released by the European Bioinformatics Institute (EBI) . In general, they are equivalent except for how they name the chromosomes and in some instances which mtDNA model is included. For example, hg19 names the chromosomes by "chr1, chr2, " and originally used the Yoruba mtDNA model. GRCh37 and 38 models name the chromosomes without the "chr" prefix and uses the rCRS mtDNA model. These finer points are not covered here.

1 Private communication with Thomas Krahn on 17 Nov 2018.

See Also


External References