Loading...
 

a DNA Primer

This DNA Primer gives just enough information to understand what is tested and why to aid in understanding how this is useful in Genealogical studies. See the external references at the end for a more complete biological introduction to the Humane Genome. Many terms in this page are links to the glossary; likely providing further detail and helping provide a more gentle introduction here. Feel free to bounce back and forth as an aid to your learning and understanding.

What is DNA?

Human Chromosomes

Image
From basic K-12 biology education, most recall that DNA exists as Chromosomes inside the nucleus of a cell.
In Humans, there are 23 pairs of chromosomes in the nucleus (46 chromosomes in total). 23 come from your mother and 23 from your father.
22 of these chromosome pairs are known as Autosomes and labelled 1 to 22; numbered from the longest to the shortest. The last and 23rd pair are the Allosome, or sex chromosomes, as they determine the biological sex of the person. If you have two X chromosomes, you are a female. If you have one X and one Y, you are a male. In addition to this nuclear DNA, there are Mitochondria outside the nucleus but inside the cell body. Mitochondria also contain some primitive, almost bacteria-like, DNA. All of this DNA makes up the Human Genome and can be utilized in genetic genealogy to help uncover clues about your past.

Structure of DNA

Image
Each of your 46 chromosomes and the mitochondria consists of a twisted strand of nucleotide pairs; in what is now a readily known term the double-helix. Think of it like a ladder made out of long strands of licorice for the vertical support rails. The ladder is twisted by spinning one end of the rails (strands) while holding the other end fixed. These support rails are made up of nucleotides on each side. A nucleobase, when paired as a base-pair with another, form the rungs connecting these rails. Think of the rungs like toothpicks connecting the licorice rails. There are only four (4) basic nucleobases.  They go by the letter abbreviation A, G, T and C. The sequence of these four nucleobases on the rung of the ladder is like the binary sequence of a stored program in a computer; defining a function and behavior when taken in larger chunks and "executed". The "program" is simply expressed in a chemical language of nature instead of a binary machine code.  Nucleobases are strictly matched when paired, so knowing one side of the ladder rung you know the other side of each nucleobase rung or pair.
To fully sequence a chromosome is to determine the type and order of all nucleobases on one side of a DNA strand.
 

There are around 6.4 billion nucleobases in a human cell with DNA. The autosomes (1-22) have the most with from 250 million to 50 million nucleobases on one "rail" in each chromosome; respectively. Mitochondria and the Y chromosome have the smallest (in general*); with mitochondria only having just over 16 thousand base-pairs and the Y chromosome having nearer 60 million. The X chromosome has about 155 million nucleobase pairs.   This is too much information to consider fully sequencing as the cost, both in terms of money and time, is too great to be practical. Today. So another mechanism is used to measure, test and compare our DNA.  The solution lies in the fact that the DNA of all humans is more than 99.999% identical.  But that still leaves 10 million unique differences between each of us. Before we cover what the testing solution used in genetic genealogy is, lets cover a few more terms.
* Chromosomes 21 and 22 actually are slightly shorter than the Y chromosome of 60 million base-pairs.

Around 98% of your DNA in a chromosome has historically been considered junk or the inter-gene region. Junk in reality because its purpose is not yet known. Maybe it is simply filler between more important segments known as genesGenes are most often associated with inherited traits we can detect such as physical characteristics or medical conditions. Genes are just molecules that are involved in chemical reactions that govern the purpose and function of a particular cell. A gene or junk segment within a DNA strand will consist of thousands to millions of nucleobase pairs; in a specific order. So each DNA strand (that is, single chromosome if in the nucleus) can be considered a bunch of gene and junk segments strung together. This distinction within the DNA strand of coding genes and non-coding  junk regions is not really important for genetic genealogy but is mentioned for a better understanding of these commonly heard terms.

There is a special type of cell division, termed meiosis, where instead of simply splitting and making an exact duplicate of a cell and its DNA, a sex cell is created (an Egg or Sperm) with only 1/2 the DNA. That is, one strand of each chromosome pair along with the mitochondria are in a sex cell.  If a male, 1/2 the resultant cells will get the X chromosome and half will get the Y.  Key to understand is that this meiosis cell division happens once between each generation.  Normal cells are constantly replicating and dying off.  But once a sex cell is created, it stays there until used to make a new embryo or dies without replicating again. This is really key. The number of cell division steps, and thus the opportunity to introduce a change or errors in the copy of the DNA, happens only once between a parent and the child. As a result, the DNA is fairly stable and common characteristics are likely passed down to a future generation.
This limited occurrence of sexual reproduction (and therefore the opportunity for changes to be inserted) is key to the stability of DNA between individuals down different ancestral lines.

Markers in our DNA

So lets get back to understanding the testing solution developed across the 6.4 billion base-pairs in DNA.  Every once in a while, either in a gene or junk segment within the DNA, there is a hiccup during meiosis replication where one or more sequential base-pair values change.  If a dramatic change, especially in a gene coding area, the cell likely dies.  If a very slight change, say in only one nucleobase value in a junk area, the change goes undetected and the cell lives on.  A particular change may continue on for potentially tens of thousands of years and hundreds to thousands of generations.
Genetic genealogy testing is often just looking for these changed values of nucleobases or markers
These changes, when taken as a whole, make you somewhat unique. Some of these markers, or change areas, happen to change more frequently than others. An analogy is the specification of the date and time**.  The hour changes much less frequently than the seconds and the date changes much less frequently than the hour. Some markers may be very noisy and change every generation or so. Others may have only changed once sometime in the past millennia among all the peoples of the world that exist today.
** There is no implication meant by the analogy to imply that changes of any marker happen with any regularity. Marker changes are a very random and sporadic activity.
Key in genetic genealogy development is (a) finding markers, (b) understanding their properties, (c) testing their values in individuals and (d) using that information, in conjunction with genealogical time frame traditional records, to understand a detectable difference between those with common ancestors.
 The genealogical time frame is roughly the last 500 years. So we are talking about changes detectable in ten to twenty generations.  Some variance is good.  Too much or too little is of no help for using DNA testing to aid genealogy. Some markers are used by anthropologists to study ancient human populations. These can be important to help group the more frequent, or likely recently changed markers, into more ancient ancestral populations first. Thus ordering us into an ancient to current-time ancestral or phylogenetic tree.

There are two main types of markers identified and used in genetic testingSingle Nucleotide Polymorphisms (or SNP for short) are a type of marker that represents one (or possibly a very few) nucleobases that are different or changed. Sometimes the difference can be an insertion or deletion of the marker value in the strands sequence when compared to others DNA (or the reference human genome model defined by the NIH).  Most often the marker is just a change in a single nucleobase value. SNP markers (hereafter referred to simply as SNP) tend to change very slowly as even small changes of a single nucleobase value can be drastic and kill the cell; especially if occurring in an important part of the genes molecular function definition.  The other type of marker, a Short Tandem Repeat (or STR for short), is simply a noticed repeat of a defining nucleobase value (or value group) multiple times. Often repeated into double digit counts. The repeat count changes by one, most commonly, during meiosis — either by an insertion of another copy of the value (set) or the deletion of the value (set) from the chain in the sequence.  So maybe a segment of DNA appears as TTTTT or maybe ATGATGATGATG (repeating ATG set). There are further variances to this definition of an STR marker (hereafter simply referred to as STR) that are unimportant here. Most pronounce the SNP marker as "SN_i_P" and a growing number in the genetic genealogy community pronounce the STR marker acronym as "ST_i_R"; so they can refer to them with ease in conversation.
Genetic genealogy testing companies are reporting on known marker values and not usually fully sequencing the DNA.
Genetic genealogy companies often report on hundreds of thousands of these marker values across the autosomes (over 1.2 million values in reality as they report the value on each chromosome in a pair). This translates to roughly one marker for every 5,500 base-pairs in the autosomes. The X chromosome is so similar to the autosomes, especially in women who have two in a pair, that it is often tested (if not also reported) as part of the autosome test process. Autosomal STRs are only tested for forensic and criminal investigation purposes and are not used in genetic genealogy. Therefore, there is no current overlap in any results that could possibly be used or stored in a criminal or government database. Mitochondria is so short that it is often just fully sequenced (every value reported) instead of looking for specific marker values. There are no known STRs in the short mitochondria DNA. The Y chromosome is unique in many ways and the basis for early genetic genealogy and surname studies like this one.

The birth of Genetic Genealogy

The Y chromosome  has some unique properties that make it desirable to do a more detailed analysis with.  Companies will test anywhere from a thousand to in excess of 300,000 SNP values on the nearly 60 million base-pairs.  The Y chromosome has hundreds of identified STR values and this is the first, most-important, genetic genealogy test introduced to allow surname studies and the birth of the new field of genetic genealogy.
In genetic genealogy, STR markers are only really tested and reported on from the Y chromosome.
STRs are specifically chosen that change more often than SNP's.  This to find more variance and thus utility in the genetic genealogy community. But sometimes they test an STR that changes too often.  And thus is just introducing noise or confusion in the genetic genealogy test analysis.  A more refined genetic genealogy analyst will understand these differences between too stable, just right, and too variant STRs and know what values appearing as changed are more important to the analysis process.  Overall, as a 10,000 ft view simplification, an STR changes value once every 150 to 200 years (out of a random sample of 30 markers tested); that is every 5 to 8 generations.  The more STRs tested, the more variance likely detectable. The more changes detected then the more refining into family branches may be determined from the test. STR testing initiated with just 12 marker tests but now regularly go to 111 markers or even well over 700 with sequencing techniques.

SNP testing used to be more unique and relied on by anthropologists doing population studies on ancient groups. They identified singular changes that have lasted for hundreds of thousands of years and let them characterize population movement. In general, today for genetic genealogy, more SNPs are chosen to look for more diversity and to look for changes occurring more often than once in the history of humankind.  Even so, they choose SNPs that still change rarely. So there is less variance among people of the SNP values (but a lot, lot more of them).  SNPs are tested on all the DNA but most often and are critical to the work with autosomes and studying nearer term relatives.

STR value sets can converge back on themselves.  That is, the value represented as a repeat count is just as likely to change up as it is down. So communities of people who strayed off from others in the population develop STR value sets (termed haplotypes) that change and bring them back into concurrence with some other group; either by changing back to their original values or changing in a way over thousands of years that bring their change pattern back to the same with some group that diverged at a much earlier "branch" point.  For this reason, on the Y chromosome, SNPs are used first to group individuals and then STRs within that group are used to refine their grouping to likely genealogical time frame common ancestor lines.  Measuring STRs, on their own, is often not enough.

SNPs and STRs are named and defined by the DNA strand they exist in and a number indicating the distance into the DNA strand where it occurs.  SNPs are further described by one or more nucleotide values (both the expected "original" as well as the possible "derived" value). Note that this could mean an inserted or deleted value as well.  For STRs, the position is defined as the start of the sequence in the DNA strand.  Instead of a changed value (or simply inserted or deleted) like in SNP's, STR's are defined or measured by a repetition count.  A test companies "raw" results of your genealogical test simply lists these SNPs or STRs and their values determined from your DNA sample.  Remember, for autosomes, you are getting two SNP values as you have two strands of each chromosome in a pair to test. This is important to understanding the half identical matching process of tools. Shorter code names are often how the markers are known by. Often created by the person who discovered the marker. Due to concurrent discovery, sometimes multiple names are used for the same marker. The NIH and NIST both work to develop a common database and nomenclature for the markers. But the academic and genetic genealogy community tends to still report on and identify with the original published name.

Recombination

The autosomes have an additional important aspect that is almost unique to them.  During meiosis replication, they do not simply separate into their individual strands cleanly and each eventually go into a separate sex cell.  Although only connected at a specific point (the centromere, the two chromosomes are wound up together like a super tightly wound rubber-band. Ever seen a balsa wood airplane with the rubber-band powered propeller. One often will keep winding that rubber-band (twisting it by turning the propeller) until near the breaking point to enable as along a powered flight time as possible. That rubber-band (two strands, by the way) will "single" knot, then "double" knot, then "triple" and so on as you wind it. So does the chromosome. As the chromosome pairs are separating from this tight ball during meiosis, the two strands of the same chromosome pair will actually cross-over and recombine as they are pulled apart.  Think of it like pulling a spaghetti noodle out of a boiling pot of water.  One strand of the chromosome pair is yellow pasta and the other a red one.  As you pull the yellow strand, it will all of a sudden switch to being the red strand and then back to the yellow and so on.  So you have a multi-colored pasta noodle in the end.  There will be a matching, opposite multi-colored noodle left in the pot — red where the other is yellow and so on — that will end up in a separate sex cell. This recombination may happen in many places across each chromosome during each meiosis event.  Where the cross-overs occur seems arbitrary.  Luckily, there appear to be only 30 to 50 cross-overs across all the ((autosomes) and X each generation. This translates to maybe 1 to 3 cross-overs on the long Chromosome 1 and often none at all on chromosome 22.
Cross-overs are frequent and random enough that very quickly, in over just 10 generations, it is difficult to find any segment of DNA longer than around 10 million base-pairs still existing in any autosome or X chromosome.
This makes analysis of markers passed down through the generations on the autosomes and X be effective for only at most 10 generations or about 200 years; but often much less than that.

This recombination is important to genetic diversity but really messes up tracking people down through the generations. Recombination leads to a finer mixing of the genes siblings inherit than would occur if whole strands were simply passed down. In girls, because they have two X chromosomes, the X acts just like the autosomes during meiosis and participates fully in cross-over or recombination. The Y chromosome has some regions on the tips that participate in cross-over recombination with the X chromosome tips.  This can only occur in fathers generating their sex cells as they have one X and one Y.  (These regions on the Y chromosome and their SNP values when measured, are often simply reported as the second value in a males |X chromosome results. So males may have two values for an SNP on the |X chromosome report even though they only have one |X chromosome.)  Otherwise, the Y chromosome does not participate in recombination as there is only one strand. Ditto for the mitochondria.  Recombination of the autosomes and the single Y and mitochondria is all very messy and contrary to simple Mendel genomics you have been taught in school.  It generates much more diversity in the gene pool than originally expected and leads to greater statistical variance in measuring matching segments between individuals. 
Mitochondria and the non-recombining majority of the Y chromosome are the only DNA that is very stable when passed down each generation and happen to pass down the Matriline and Patriline; respectively, pretty much intact.
Mitochondria exists only singly outside the cell nucleus (although many mitochondria with the same DNA exist within the cell wall) and thus has nothing to recombine with. Mitochondria do not participate in meiosis although they are replicated and exist in the sex cell. The mitochondria of the sperm is consumed (destroyed) as part of the sperms process of fertilizing the egg. Only the eggs mitochondria remains. Thus, everyone's mitochondria is always inherited from your mother. The mitochondria is very short and has no STR's.  There are very few SNPs to test as well. And it does not recombine.  So mitochondria has limited use in genetic genealogy because it has little change nor information (markers that may change). But because of these properties, historically, mitochondria (and even the ((Allosome|Y) chromosome) have been primarily used in anthropological population studies of ancient humans.  But, the mitochondria is passed down from a mother to their children. And from the daughters to their children. And so on over many generations. This is the same as the Matrilineal line. The resultant value to genetic genealogy is that if you believe two individuals share an ancestor on their Matrilineal line, then their mitochondria should match (near) identically.

The non-recombining portion of the Y chromosome is passed down from father to son for the most part unchanged; similar to the mitochondria but on the Patrilineal line.  But the Y chromosome is around 60 million base-pairs long.  So long to have hundreds of thousands of SNPs and hundreds of STRs scattered throughout.  As surnames in western Europe and early documents on inheritance followed this Patrilineal line, the Y chromosome became the perfect study vehicle to help aid surname, patrilineal line research through many centuries.  There is a small enough variance in STR values to group people who are very likely related in the last 1,000 years.  And then if enough markers tested, to even find unique markers for more recent branches down the different descendant paths that reaches into the last one to two hundred years. The same time period that autosomal testing becomes reliable to find any relative with a common ancestor in that time period. Hence, the explosion of both autosomal and Y chromosome testing to aid the genealogical search and the rapid growth of the new field of genetic genealogy.

External References