Loading...
 

Errors in Matching

Autosomal Matching Errors

Some misinterpret the wide variance from the expected match percentage to be an error.  But the wide variance from the expected is normal. After the first generation (parent / child), there is a wide variance possible of how much you receive from each grandparent of all the Autosomal and X chromosomes. Mathematically, this variance is expressed as a large standard deviation from the normal / average / expected. See our explanation of Autosomal matching in the Consanguinity glossary entry for a bit more information.

The variance off the normal is due to the fact that the range in length of autosomes varies, and you only get one of the two chromosomes from each parent. Maybe your dad passed only his mother's chromosomes down to his child? The distribution of values in this variance is actually filled in from you might expect to be a more discrete points by the fact that there are 30 to 50 crossover events each generation with each set of Autosomal chromosomes a parent delivers to a child (and for X delivered by the mother). This fills in what might otherwise be big jumps or gaps in the variance due to the very large chromosomes like #1 and #2.

It is common with autosomes SNP results from testing companies to have false matches.  Some have identified these false matches as due to Identical by State (IBS) versus Identical by Descent (IBD). But we believe that is an incorrect application of the existing nomenclature.  These false matches mostly occur due to the test giving unphased results.  So lets look more at the testing process to understand why..

Autosomes (and X DN for females) have two strands of the chromosome pair.  As a result, the data is reported back from the testing company with two values for each SNP on each chromosome.  One for the "left" chromosome and one for the "right".  But the values are un-ordered.  Meaning when reported for one SNP, the first value may be for the left chromosome.  In the next SNP reported on that chromosome, the first value may be for the right chromosome this time.  And so on.  The testing process cannot distinguish which SNP value is from which of the chromosomes in the pair. And so the results are reported as un-ordered value pair.

Matching algorithms are aggressive. They assume that if it can construct a match using either value, then that is a match. This is one of the reasons to not reduce the matching sequence of SNP's much below 500. As you rely on fewer and fewer SNP's to determine a matching segment, you introduce a greater risk for getting an artificial match created by the mixing of values from the different strands. Anytime a child shows a longer, overlapping matching segment than the parent, this is likely the cause. These artificial or false matches are not due to the underlying DNA in the person but the testing and analysis process.

Phasing is a way to try and remove some of these false matches.  Phasing involves looking at a child's DNA test results and then looking at one or more of the parents.  Since each child gets one autosome of each type from each parent, having the parents values can help construct the resultant chromosome from that parent.  Once you have the SNP value tagged out of each resultant pair, all the remaining un-tagged values are for the other strand of that chromosome pair.  Hence, the pairs of SNP values are no longer un-ordered but now fixed or associated in sequence and thus recreate the original two separate chromosomes.  So matching is more exact as you have fixed sequences of SNP's for a specific chromosome as exist in the tester.  The Total Match Length and number of often reduced (except in Full Siblings and similar situations)) and the number of matching segments increased when phased results are compared between two testers.

On initial blush, one would think they can phase a parents chromosomes from the child's values in the same way. After all, who is to say which value set is the child and which the parents when presented to the tool?  But the cross-over recombination that occurs each meiosis cycle makes this impossible.  So the reverse process is not possible with just one child.  With enough test results of siblings, a probable phased result of a parent may be determinable. Especially if two siblings happened to get the opposite strands that went to create the sex cells.

In reality, phasing a child with one parent yields only about 80% of the SNP values as fixed.  Using both parents gets you over 90%.  Errors in testing and the results determination just make it not an exact process over the half-million or so SNP results provided.  Some testing companies purposely round their results so the accuracy and preciseness of the RAW data is more inline.  Retesting the same person with the same company, let alone a second company, will not yield 100% repeatable results. SNP testing is not an exact, full-sequencing of the chromosome but more a determination of the likely sequences of SNP values in the sampled DNA.

yDNA STR Matching Errors

Matching errors (false-positive matches to someone you are not related too) can happen in yDNA STR testing also. But the reasons are not testing-introduced errors but the DNA replication process. That is, the test is an accurate representation of the DNA in each. It is the DNA that is actually the same and so some knowledge as to why that can occur needs to be understood.

Matching errors are most common in the R Haplogroup branch, deep down in the tree, where 50% of White, European males exist and are sharing similar SNP values as well as STR ones. For some, who fit in this mold, it is not uncommon to have 37 markers a near identical match and yet still not be related in any genealogical time frame. At the same time, two people known to be related in the nearer term can have 2-3 markers off by one from each other. So you say "How can this be?" You have to look at the tests and the values derived from the tests.

STR markers have variance in their rate of change. If an STR exists but is identical in all the population and never seems to change, well this is not a very good marker to tell much of anything. At the same time, if a marker has only two or three values, and changes every generation, than this marker is not useful either. It is too noisy to be helpful. So some attempt is made to only use STR markers that are better than rarely changing ones but not so noisy as to change too often. Some of the markers used are at the fringe of these middle values sought. So you need to understand which markers are different and how often they tend to change. That is, weight the differences in markers based on their likelihood to change. If the differences between two test results are in rarely changing markers, then this is indicating they are not likely a near term match. If the differences are mostly in the often changing markers, then they could still be a near term match.

yDNA SNP markers should be used first to group STR testers and then use STR results compared in that group. You cannot rely on the predicted Haplogroup (or SNP values) that are derived from a subset of the STR markers. While that may work in some cases, it is far from exacting. So get your yDNA SNP values tested to a reasonable depth to differentiate that way first. Only when the SNP markers are matching should you then be comparing the STR results.

So why can different family lines (which likely have different SNP markers) have the same or similar STR values? Over the years, STR values that were diverging collapse back to the same values of other diverging STR value family lines. Remember, an STR count is just as likely to increase or decrease in value. So an STR count could have diverged and then changed back. Or diverged so far as to marge into a more ancient branch that had diverged as well. 67 or 111 markers is needed to start looking for more variance between lines that may have collapsed with each other. Or use FamilyTreeDNA's BigY with yFull analysis to extract the over 450+ STR marker values. But with the expansion using BigY comes more variance, even between near term relatives. So the analysis is more complicated and not as clear cut for the average person.

See the Mutation Rate of STR Markers and Frequency of STR Marker values by Haplogroup charts for more information on the types of variation found across the general population for the most commonly used STR markers.