Loading...
 

Match Codes: Marking Autosomal Matches


A big problem that is arising in the genetic genealogy community is uniquely identifying autosomal matches and their common ancestor(s). Only by supporting the precise identification can accurate analysis of match patterns be determined. Like the average longest matching segment per degree of kinship. In reality, this could be the code to attach to a DNA tester and their respective match that fulfills the ISOGG definition of triangulation (original, not segment). Note that the code, as defined here, does not define the level of verification or GPS-level compliance. Simply a computer-processable nomenclature of the DNA matches' relationship as determined by the tester or match (or their respective kit manager).

In summary, with all extensions, the complete specification is:
Copy to clipboard
nCmRD#AnnnXmmmD&Tnnn.n;Lnnn.n;Nnn
with an example being
Copy to clipboard
4C1R#A032F064&T23.1;L18.9;N2
where we have chosen a patriline match that also has some atxDNA match as well. So lets dive into what this full coded form is all about.

Match Code format (basic)

Copy to clipboard
#AnnnXmmmD
where
  • A represents the match site (A for Ancestry, M for 23andMe, F for FamilyTreeDNA, H for MyHeritage, G for GEDMatch, L for Living DNA (when it becomes available), or N for Geneanet
  • nnn is the Ahnentafel number of the most recent common ancestor up the pedigree from the first (primary) tester (always at least three digits; pre-pend leading zero's if needed)
  • X is either an F if Full-siblings as the last non-common ancestor, or H if Half-siblings, or A if doing the extension below
  • mmm is the Ahnentafel number of the most recent common ancestor up the pedigree from the secondary (the match) tester (always at least three digits; pre-pend leading zero's if needed)
  • D is an optional trailing letter and added only if known additional relationships exist between the two matching testers. For example, are Double cousins. Or M if any other reason they share Multiple ancestors on different lines such as the MRCA couple are related (e.g. cousins) . Maybe use E if the testers are both from an endogamous population that exhibit excessive matching. Key is you are specifying that the DNA match strength will be different than the Ahnentafel numbers imply on their own.

That initial letter, given as A in the match code description above, is for where the match is found. Not where the individual testers were tested. This is key. You want to trace the match code back to the source match site where the match data is presented. For sites like Ancestry and 23andMe, you know both kits are tested at that same site. For FamilyTreeDNA and MyHeritage, you have no idea where the source of the kit may be from as they allow transfers in. (Note: even for FTDNA, they may have created a transfer Kit ID starting with B but then tested locally at FTDNA. Or bought a kit for a yDNA test but then transferred in the autosomal result instead of doing FamilyFinder with them. So you do not really know the source of the test.) GEDMatch is the only site where it may be possible to determine the origin of each test kit. The notes are often specific to identify that. But even then, people submit false / fake / made-up kits that came from other sources but made them to look like one company or another. So it is best to simply indicate the source of the match information being captured. And not try and be specific about the source of the test itself. A limitation in the match code and associated match data it is attached to; but also a limitation in the underlying data one can see / get from the match sites.

For full siblings, the Ahnentafel numbers can be specified as either common parent. It does not matter. One can always get the other by adding or subtracting 1. Both are implied. Only for half siblings must both Ahnentafel numbers be even or odd and represent the same person (the shared, single parent that is the first common ancestor). For full siblings, if either number is odd, subtract one from the odd number and you have two even numbers that both represent the shared father. Or add one to the even number and you have the two odd numbers that both represent the shared mother. Full siblings is the only time when the two Ahnentafel numbers do not both have to be even or odd.

The two testers that the match code applies to are implicit as the code is expected to be specified in a notes or similar field in the context of a match. For example, if a primary tester S has a match list which contains a secondary tester T (the match), then they would enter the above designated Match Code for that match in the primary tester S Matches entry for the match T. So the Ahnentafel number context is understood relative to those two, ordered, implicit testers. Any extraction or use of the code should correspondingly include the two testers identifying data.

We need two Ahnentafel numbers to represent the individual pedigree from each tester to the common ancestor(s): the primary S and the match T. The pedigree from each tester up to, including, and beyond the MRCA will always have different Ahnentafel numbers as the numbering is relative to the starting tester. It is simply a fluke when the Ahnentafel numbers match. (As they could for a full siblings match or always should for a half sibling match.)

DEPRECATED:The hardest thing to likely understand is that we use the Ahnentafel numbers of the siblings before reaching the common ancestor(s). Or what we term here the last non-common ancestor as you go up from you in your pedigree. These siblings are unique. We do this as Full Siblings always share two parents and we do not want to have to specify the code for both parents. (Or the code for one and have to figure which is the other.) A vast majority of the matches will be via Full-siblings before the common ancestors. Full versus half-siblings at any other level does not matter as the line through the "sibling" is singular. So, to avoid confusion and imprecise specification, we do not specify the one or two common ancestors but the child of the common ancestor leading to each match. When an F is specified, then both parents of that Ahnentafel number child are the common ancestor. Otherwise, if P or M, then the indicated single parent is the single common ancestor. ''Changed from specifying the last non-common ancestor to specifying the common ancestor for full siblings.

'Could simply use H for half if we switch the Ahnentafel number to be the common ancestor. Then the numbers must match (both even or both odd) and specify the single parent that way.

Special Cases

Or, what turns out to be, the only use of the Ahnentafel number 001 and even our special case introduced here of 000.

What if one of the testers is the MRCA? That is, we are specifying a parent-child, grandparent-grandchild or similar relationship where one of the testers is the MRCA of the other. In that case, you enter the Ahnentafel number of a number one i(001 ) n the field for the tester that is the common ancestor. This is the only time that a tester is being specified by the Ahnentafel number. And that then always a Ahnentafel number of 001 will appear. The other Ahnentafel number is then for the relationship of this tester to their descendant match specified as the other tester; and as viewed from the other tester. Note that the concept of full sibling or half sibling and that specification does not apply in this case. Like for half sibling though, the Ahnentafel number is specific to a single person in both cases. But, unlike for half siblings, you can have an even and odd number that both represent the same, specific person. If a grandfather and his sons grandson are being tagged in a match list, the number for the grandfather is a 001 to represent himself but the number for the grandson is 004 representing his paternal grandfather. Note one number is even and one odd even though they designate the same person. Use of 001 is a special case and exception to the even / odd rule.

In this special case of an ancestor being one of the testers, we could introduce a new letter such as A instead of F or H. This because the numbers may not BOTH be even or odd (as is true for F). But they both do not specify a specific, single ancestor (like in the case for H) either. So neither F nor H are appropriate, necessarily, when one of the testers is an ancestor of the other. But we do not need the third letter designation to help us here as this is the only time when we will see a 001 as one of the two numbers. For robustness, we introduce the optional third letter designation of A between the numbers if you wish to use it. Note it could be appropriate also as a specification of 001A001 might be needed. This to specify the match is between the person themselves such as when the person tested with two companies and appears on their own match list. We want to identify all matches, with the code, so this would be appropriate.

What if the testers themselves are related as full or half-siblings? Not really an exception to any of the normal rules but lets think through it. For full siblings, specify either 002 and/or 003 for each Ahnentafel number of the MRCA and then an F in between. They share both parents so the normal process still applies for any other match case. For half siblings, you specify both as either 002 or 003 with the H designation in between. If both specifying 002, they are sharing only a father. If 003, only a mother. In this one case, they both must use the same number.

The Match code can work for partial specifications. Lets say that someone knows, via segment triangulation, who the common ancestor is. But the other match does not have enough research to really know their pedigree to reach that ancestor. Then the known ancestor Ahnentafel number can be given and the other unknown specified as all zeros (000) to represent Unknown. For internal processing, a default match code of #A000F000 can always be assumed with every match until a more refined value is specified. The match list company given by the first letter is still necessary. But, as before for 001A001, we could use an A in the middle instead of F.

Lets show examples for some of these special cases.
Match CodeRelationship representedMatch CodeRelationship represented
#A001F001 Self Match (two kits) #AH000F000 Not yet known (unspecified match)
#G002H002 Half-siblings, shared father #M003H003 Half-siblings, shared mother
#A001H002 Father and matching child #A003H001 Child and matching Mother
#A001H005 Paternal grandmother and matching grand-child #A006H001 Grand-child and matching Maternal grandfather
Note that with these codes, you cannot tell the gender of the tester except in a very few, specific cases. The code is only capturing the relationship to the common ancestor. The gender of the tester is hopefully captured in some other field.

More examples

Without knowing the testers, you can determined the relationship between them from the Match code. If direct cousins (no m times removed), then the Ahnentafel numbers will be in the same generation (or power of two range). If once removed cousins, the generation implied will differ by one. If twice removed, by two. See the table below for how we determine the generation from the power of two range.
GenerationsRelationshipShared AncestorsAhnentafel number range
0 Self Self 1 (20)
1 Siblings Parents 2-3 ((21 to 22-1)
2 (1st) Cousins (0x Great) Grandparents 4-7 (22 to 23-1)
3 2nd Cousins (1x) Great Grandparents 8-15 (23 to 24-1)
4 3rd Cousins 2x Great Grandparents 16-31 (24 to 25-1)
5 4th Cousins 3x Great Grandparents 31-63 (25 to 26-1)
6 5th Cousins 4x Great Grandparents 64-127 (26 to 27-1)
7 6th Cousins 5x Great Grandparents 128-255 (27 to 28-1)
8 7th Cousins 6x Great Grandparents 256-511 (28 to 29-1)
9 8th Cousins 7x Great Grandparents 512-1,023 (29 to 210-1)
...
nth (n-1)th Cousins (nth-2)x Great Grandparents (2n to 2n+1-1)
Note that the term half- is not often heard in relationship to distant cousins; only siblings. But it is important when determining the likely strength of DNA matching. If you understand that siblings are treated by computer programmers of genealogy software as 0th-cousins, then you see how half is simply included in the relationship description internally and is so included in the Match code.

We recently came across a "distant" (but strong) match that we did not think would appear with autosomal DNA testing. A match between two people who are 1st Cousins, 5x Removed. Yes, five times! Ends up the older gentleman had long delays in his pedigree before children and is also the last born child in each generation. The tested young child carried a segment of 15 cM, unaltered, down from his tested grandparent. Their code might look something like #A004F136.

The match code can be applied to yDNA testers and their match specification as well. For patriline matches, the Ahnentafel numbers for both sides will always be a power of two except of the special case of a 001. You are specifying the patriline ancestor always.

Some patriline genealogies seem to reliably go back 500 years. 300 years is often hitting the 9 generations and so starts to exceed our standard 3 digits That is, seeing numbers 1,024 and higher. So we can either allow more than 3 digits in common, in these cases, or realize that the patriline values are always a power of 2 and thus could be simplified. Maybe present / ask-for the power of 2 designation only. So, instead of storing 256, one could store 8 representing 28 which equals 256. Note that the stored power also represents the number of generations to the MRCA. You can thus represent most expected values within the 3 digit code again. Even 2 digits is sufficient as that is a range of an ancestor back over 3,000 years earlier. Much farther than you can accurately determine the number of generations for. Maybe when using the power of two designation attached to a patriline yDNA match, you can use a trailing Y to indicate this special case of specifying the ahnentafel number.

Ahnentafel Numbers

Content to be mostly removed for a special Ahnentafel numbers glossary entry; once we find the bug in the Wiki code preventing its creation!
We use Ahnentafel numbers, like Jim Bartlett introduced in his coded examples, as they are the most compact and accurate. The downside is most lay users cannot read off determine the Ahnentafel number of any ancestor readily. Most genealogical tree tools can make them visible or allow a chart to be printed with the ancestors identified with their numbers. See the linked glossary entry for Ahnentafel numbers for a better description and method to easily figure them out in your head. Here are some additional notes to think about:
  • For autosomal matches, we will rarely need to go beyond 3 digits. And one or two leading zero's is easier and quicker to visually recognize. Most Ahnentafel numbers will be one to two digits. So keeping the leading zero in these cases leads to better overall readability and consistency in length of the Match codes.
  • 4th cousins share 3xGreat Grandparents. 7th cousins share 6x Great Grandparents. 3 digits will represent all 7th cousin matches and most 8th. The Ahnentafel numbers for the 7th cousin ancestors are 256 through 511. Only with 8th cousins do you get ancestor Ahnentafel numbers of 512 through 1,023 and thus need a fourth digit for the last 24 specified ancestors.
  • This fourth digit will also start to be needed in generational difference matches such as 6th cousin twice removed. Likely this is the more common occurrence for when a 4th digit is needed; as we tend to see those sticky, single segments with 4th through 6th cousin matches last for a few more generations. Key is, most autosomal matches will fall within the 3 digit convention of the specification. So, with leading zero's to pad to three digits, most Match Codes will appear similar and more readable.
  • The Patriline is represented by Ahnentafel numbers that are powers of 2. 20 or 1 is the tester, 21 or 2 is the father of the tester, 22 or 4 is the paternal grandfather, and so on. 2n is the nth generation patriline father with all other grandparents that generation having Ahnentafel numbers of 2n through 2n+1-1. Similarly, Matriline Ahnentafel numbers are 2(n+1)-1 for the nth generation.

Relationship Code (extension / addition)

While the match code above is specific, it is not as easily recognized and interpreted by a human reader. So some still want to use a relationship specification that may convey less information (not the exact pedigree) but more the general relation. So for more human processing, we introduce an additional, auxiliary specification of a relationship code .
Copy to clipboard
nCmRD
where
  • nC is the nth cousin specification. n being a numeral 0-9 with the special case of a dash ('-') (see below). Expand to a second digit if necessary (no leading zeros)
  • mR is the optional mth removed specification giving the number of generations farther removed from the common ancestor one of the two in the relationship is. Optional because not needed if the m is zero. m thus being a numeral 1-9.
  • D is optional, like in the match code above, to indicate there are likely multiple relationships that can be specified (double cousins, etc) Often, you always want to specify the closer / stronger relationship if multiple. Which is true for the match codes above.

Note that this Relationship Code is independent of the order of the testers. Whereas the match code implies an order of the testers being described. This relationship code can be reliably determined from the match code . If both exist, the match code always overrides.

For relationships involving siblings (of any type), they are 0th Cousins, 0th Removed or simply 0C. If a relationship between a descendant and ancestor is being specified, use a dash ('-') in place of the cousin number. So a grandparent relation to the grandchild is -C2R. Using a designation of zero (0) is not appropriate as they are not siblings. By extension, 0C1R is an Aunt / Uncle relationship to a Niece / Nephew. This, along with 0C for sibling, is likely not as recognized and used today. Note that 0C2R is a great-uncle whereas the grandparent (brother to great-uncle, let's say) is -C2R. Maybe there is something better than a '-' that others can come up with. In some sense, the number to be specified is a minus one (or "-1"). But that would be too confusing and so we thought just the dash without the number was best.

A fully specified match code can always be used to generate a relationship code, as specified here. But rarely can a relationship code be used to derive a match code. The match code is more expressive and accurate with the exact path of ancestors to the common ancestor. For example, a 4C1R as mentioned earlier could be either on a paternal or maternal side; either a paternal grandfather or paternal grandmother side, and so on. This path is exactly specified in a match code because of the Ahnentafel number.

We have adopted both compact specifications in our Ancestry, 23andMe, MyHeritage, and GEDMatch Tier1 match field. In our case, we simply prepend the relationship code to the match code.

Match Data (extension / addition)

The actual segment summary match data is not always available or carried when the notes field is manipulated or extracted. As this data is useful and to be processable, we propose a further extension to add here. While we could use the hash '#' as a field separator, it is likely easier to introduce a more unique symbol so simple processing can easily distinguish fields. So we introduce the use of the ampersand ('&').
Copy to clipboard
&Tnnn.n;Lnnn.n;Nnn
where
  • T is the total (of all the) matching segments included. In cM or if followed by a % as a percentage
  • L is the longest matching segment (in cM)
  • N is the number of matching segments
Each has a numeric parameter.

Note that this data is specific to the company given in the match code before it. So, for example, 23andMe includes X matching in their values whereas other companies do not. We have not given formal BNF syntax above. The fields are ordered but each is optional. Ancestry just started providing the longest matching segment and so we have to go back in and modify by inserting the field. Oddly, they determine L before applying Timber and determining T. So we are often seeing entries where N is 1, and L is greater than T.

Background and Goals

The nomenclature should be precise, accurate, computer process-able, human readable, human determinable, and unique for 5 sigma of the specification match cases.

Although we have had this idea on a back burner for 4 years or so, it has been spurred by Jim Bartlett's Ancestry Notes tags. And our long desire since working on autosomal matching and seeing the grave deficiencies of the crowd-sourced data being collected in getting accurate, determinable data to make further studies on. Our ultimate goal is to get the concept adopted in GEDMatch and other Autosomal match tools as a specifiable field in match lists. This will not only help people analyzing and tracking their matches. But will allow for more accurate research by third parties with access to the full database. For example, answering that question of "Is there is a better correlation with matching longest segments in the autosomes than for the total matching segment size?". Tools interpreting the fields can also auto-mark trees with DNA inheritance paths, paint chromosomes with ancestor designations, provide more refined sorting of matches, and more. With ahnentafel numbers, not only can you determine if the matches are patriline or matriline (and thus likely share the yDNA or mtDNA; respectively), but also if there is possible xDNA sharing possible and likely how strong. The analysis possible with the match database annotated with match codes is far reaching and will do more to enhance scientific study than poorly collected and specified surveys. True, the relationships are being specified by people and there can be error introduced there. But the likelihood of that error, especially non-detectable, is much less.