Short Tandem Repeat (STR) is one of two genetic marker types (or variance types) observed and used to measure the difference in the DNA of one person from another. The other type being an SNP. With 99.9% of our DNA identical, the focus is on finding and tracking these differences. Currently, the STR markers are all defined on chromosomes; none exist in the mitochondria. Over 700,000 STR markers have been identified to date with over 15,000 in the yDNA alone. Only STRs on the Y chromosome are used. STRs are also known as microsatellites.
An STR is a repeat pattern of 1 to 5 base-pair values. Larger numbers of base-pair repeat sequences go by another term. The number of repetitions (or count) defines the genetic marker value. Examples are ...GTAAAAAAAAACG... for a count of 9 "A" repeats; and ...GCCATACATACATATG... for a count of 3 "CATA" repeats. Depicted in the figure here is the DYS385 STR taken from the archived SMGF site from the 2000's.
STR markers in the yDNA are what we care about in Genetic Genealogy and play a primary role in DNA surname studies. For it is the yDNA which gets passed down each generation; relatively unchanged. Thus providing a useful Patriline identification technique that happens to also match the surname inheritance in most European populations. STR markers are more variant than SNP ones, as used to date, and so provide a better quick, easy and cheap test identification of family branch members on the same Patriline within the genealogical time frame. STR markers are typically measured by an early sequencing test termed CE.
A Haplotype is a collection of all the STR genetic marker counts a tester has. In our use here -- the yDNA genetic markers. Very similar or identical Haplotypes between testers usually imply a common patriline ancestor; unless there has been a Convergence of STR values between different ancient lines. So yDNA SNP values should also be checked to see if they match before believing the common haplotypes indicate the same family branch. If different haplogroups exist, then the STR values converged. SNP values define a Haplogroup and are not to be confused with STR values and the Haplotype they define. So in reality, you need concordance on both the haplogroup and haplotype to declare a family branch match in yDNA.
STR genetic markers on the autosomal chromosomes are used for forensic and criminal identification and NOT Genetic Genealogy. The US Government's CODIS database is a catalog of 12 to 20 STR markers on the autosomes. There is no overlap between Genetic Genealogy Testing and criminal, government testing (at this time). Although law enforcement does hire outside consultants to make use of the large genetic genealogy databases as a way to get clues in their investigations; especially Jane or John Does.
Most STR genetic markers are in the inter-gene (i.e. non-coded or possibly "junk-DNA") region; as are most base pairs in general.
By definition (or design when a genetic marker is found and chosen for inclusion), the STR genetic markers change count infrequently. STR genetic markers that change randomly every generation are considered too noisy and not used. A number of the yDNA STR genetic markers, commonly defined in the range 26 to 37 by FTDNA, are on the fringe of usefulness. Frequently changing markers do not help track the surname or Patriline. STR genetic markers that rarely change (say, much less than every five thousand years) are considered too stable like SNP genetic markers and are not of as much use either. Rarely changing genetic marker values will tend to be shared by too many people. So scientists look for just the right "frequency" of change in an STR genetic marker to make it useful for Genetic Genealogy purposes. The more varied the genetic marker value among the general population then the more valuable the genetic marker also. That is, STRs with counts between 5 and 25 are much more useful than ones narrowed to a small range like 9 to 11. Once confirmed haplotypes, haplogroups and genealogy matching is made between multiple testers, then slight variances in STR values can be used to differentiate family lines in the genealogical time frame.
STR genetic marker testing for personal genomics was introduced by FTDNA in 1999. The STR markers are processed by them in "panels". Usually 12 or so STRs in each panel. The STRs are labeled using the HUGO Gene Nomenclature Committee advice and that of the NIST Forensic Science Program office. The adopted nomenclature used now is most often the prefix DYS and then a number. The middle letter Y, when seen as a number instead, is defining an autosome STR genetic marker. "S" means single marker and location (unique). "F" is used for multicopy markers (think Family). "Z" and "M" are more rarely seen for identifying complex markers.
Duplicate / multi-copy markers exist because the primer sequences used to find them are not unique to a single instance of the repeat pattern being looked for. Those markers tend to be identified as DYS still and then have suffixes of letters or roman numeral 'i's used to uniquely identify the multiple values. This as opposed to using the DYF identifier. Occasionally, a marker is known only by its original name created by the discover who published it.
When an SNP will occur in the repeat area itself, it leads to a shortening of the repeat pattern as reported. Some companies report this count change with the repeat count before the SNP followed by the count of base-pairs that would be part of the repeat after. For example, 10.2 for 10 repeats with 2 base-pairs after the SNP breaking the repeat. When the SNP occurs in the primer used to locate the STR, then the STR cannot be found and is reported as a Null or zero value. These STR repeat counts have to be determined by other testing means.
Sometimes STRs are talked about in numerical sequence groups, like 1-12 or 1-37. This is simply the order FamilyTreeDNA added them into their testing product (panels) and subsequently report on them. You need to identify the underlying STR name to get an accurate understanding of the marker and its value. As FTDNA has been the predominant, long-term, and still operating tester, the order and count into it is sometimes used to identify the marker.
FTDNA started with a yDNA 12 marker panel, expanded to 25 markers in two panels, then 37 markers in three panels. FTDNA are now up to 111 markers in an estimated 9 to 10 panels (although they call the y67 test results panel 4 and the y111 results panel 5). Their recent expansion by another 389 markers is termed panel 6 although it is not using panel testing to derive the values. They expanded reporting on another 200 or so with the BigY-700 test. yFull reports over 800 STRs as extracted from the BAM files of yDNA NGS tests; including the BigY files. They are able to extract around 90% of the STRs in FTDNAs base y111 test this way. See the LobSTR tool description
This technique of extracting values from NGS testing does not appear as accurate as the specific CE panel technique done by FTDNA. This is most true with long sequence markers that extend beyond the short segments of Sequencing in BigY. We are analyzing the accuracy of yFull STR extractions in this project. It appears the Compound (or duplicative or multi-copy) markers are the ones most difficult to determine from the BAM file analysis; especially when they are longer than the short segment.
Surname studies are just beginning to study the new, expanded yDNA STR values beyond the base y111 (both from BigY and yFull) to understand the value added to family branch studies. The hope is a few marker values across the 700+ available can be identified as "modal" for a family branch and allow a custom subset of markers to be tested individually (e.g. by YSEQ) to determine membership). Currently, with our B10 group, the y12 panel happens to include enough markers to provide that utility. With others in OSM, even y111 is not enough values in the haplotype to narrow down to just family branch members. Se we are looking to find a pseudo-y12 panel for those groups to make it easier to quickly determine membership at a lower point of entry cost and test level. With dropping prices of Sequencing and WGS testing, we may see a shift to just direct "test everything" at once.
STRs are known as Simple Sequence Repeats (SSRs) in some communities and Microsatellites in general. STRs are a form of the more general Copy-Number Variant (CNV) that is part of WGS results. CNV genetic markers tend to be analyzed using Copy-Number Analysis in microarray testing or more commonly with "Capillary Electrophoresis CE Fragment Analysis". These are much simpler and targeted than WGS full sequencing techniques.
An STR is a repeat pattern of 1 to 5 base-pair values. Larger numbers of base-pair repeat sequences go by another term. The number of repetitions (or count) defines the genetic marker value. Examples are ...GTAAAAAAAAACG... for a count of 9 "A" repeats; and ...GCCATACATACATATG... for a count of 3 "CATA" repeats. Depicted in the figure here is the DYS385 STR taken from the archived SMGF site from the 2000's.
STR markers in the yDNA are what we care about in Genetic Genealogy and play a primary role in DNA surname studies. For it is the yDNA which gets passed down each generation; relatively unchanged. Thus providing a useful Patriline identification technique that happens to also match the surname inheritance in most European populations. STR markers are more variant than SNP ones, as used to date, and so provide a better quick, easy and cheap test identification of family branch members on the same Patriline within the genealogical time frame. STR markers are typically measured by an early sequencing test termed CE.
A Haplotype is a collection of all the STR genetic marker counts a tester has. In our use here -- the yDNA genetic markers. Very similar or identical Haplotypes between testers usually imply a common patriline ancestor; unless there has been a Convergence of STR values between different ancient lines. So yDNA SNP values should also be checked to see if they match before believing the common haplotypes indicate the same family branch. If different haplogroups exist, then the STR values converged. SNP values define a Haplogroup and are not to be confused with STR values and the Haplotype they define. So in reality, you need concordance on both the haplogroup and haplotype to declare a family branch match in yDNA.
STR genetic markers on the autosomal chromosomes are used for forensic and criminal identification and NOT Genetic Genealogy. The US Government's CODIS database is a catalog of 12 to 20 STR markers on the autosomes. There is no overlap between Genetic Genealogy Testing and criminal, government testing (at this time). Although law enforcement does hire outside consultants to make use of the large genetic genealogy databases as a way to get clues in their investigations; especially Jane or John Does.
Most STR genetic markers are in the inter-gene (i.e. non-coded or possibly "junk-DNA") region; as are most base pairs in general.
By definition (or design when a genetic marker is found and chosen for inclusion), the STR genetic markers change count infrequently. STR genetic markers that change randomly every generation are considered too noisy and not used. A number of the yDNA STR genetic markers, commonly defined in the range 26 to 37 by FTDNA, are on the fringe of usefulness. Frequently changing markers do not help track the surname or Patriline. STR genetic markers that rarely change (say, much less than every five thousand years) are considered too stable like SNP genetic markers and are not of as much use either. Rarely changing genetic marker values will tend to be shared by too many people. So scientists look for just the right "frequency" of change in an STR genetic marker to make it useful for Genetic Genealogy purposes. The more varied the genetic marker value among the general population then the more valuable the genetic marker also. That is, STRs with counts between 5 and 25 are much more useful than ones narrowed to a small range like 9 to 11. Once confirmed haplotypes, haplogroups and genealogy matching is made between multiple testers, then slight variances in STR values can be used to differentiate family lines in the genealogical time frame.
STR genetic marker testing for personal genomics was introduced by FTDNA in 1999. The STR markers are processed by them in "panels". Usually 12 or so STRs in each panel. The STRs are labeled using the HUGO Gene Nomenclature Committee advice and that of the NIST Forensic Science Program office. The adopted nomenclature used now is most often the prefix DYS and then a number. The middle letter Y, when seen as a number instead, is defining an autosome STR genetic marker. "S" means single marker and location (unique). "F" is used for multicopy markers (think Family). "Z" and "M" are more rarely seen for identifying complex markers.
Duplicate / multi-copy markers exist because the primer sequences used to find them are not unique to a single instance of the repeat pattern being looked for. Those markers tend to be identified as DYS still and then have suffixes of letters or roman numeral 'i's used to uniquely identify the multiple values. This as opposed to using the DYF identifier. Occasionally, a marker is known only by its original name created by the discover who published it.
When an SNP will occur in the repeat area itself, it leads to a shortening of the repeat pattern as reported. Some companies report this count change with the repeat count before the SNP followed by the count of base-pairs that would be part of the repeat after. For example, 10.2 for 10 repeats with 2 base-pairs after the SNP breaking the repeat. When the SNP occurs in the primer used to locate the STR, then the STR cannot be found and is reported as a Null or zero value. These STR repeat counts have to be determined by other testing means.
Sometimes STRs are talked about in numerical sequence groups, like 1-12 or 1-37. This is simply the order FamilyTreeDNA added them into their testing product (panels) and subsequently report on them. You need to identify the underlying STR name to get an accurate understanding of the marker and its value. As FTDNA has been the predominant, long-term, and still operating tester, the order and count into it is sometimes used to identify the marker.
FTDNA started with a yDNA 12 marker panel, expanded to 25 markers in two panels, then 37 markers in three panels. FTDNA are now up to 111 markers in an estimated 9 to 10 panels (although they call the y67 test results panel 4 and the y111 results panel 5). Their recent expansion by another 389 markers is termed panel 6 although it is not using panel testing to derive the values. They expanded reporting on another 200 or so with the BigY-700 test. yFull reports over 800 STRs as extracted from the BAM files of yDNA NGS tests; including the BigY files. They are able to extract around 90% of the STRs in FTDNAs base y111 test this way. See the LobSTR tool description
This technique of extracting values from NGS testing does not appear as accurate as the specific CE panel technique done by FTDNA. This is most true with long sequence markers that extend beyond the short segments of Sequencing in BigY. We are analyzing the accuracy of yFull STR extractions in this project. It appears the Compound (or duplicative or multi-copy) markers are the ones most difficult to determine from the BAM file analysis; especially when they are longer than the short segment.
Surname studies are just beginning to study the new, expanded yDNA STR values beyond the base y111 (both from BigY and yFull) to understand the value added to family branch studies. The hope is a few marker values across the 700+ available can be identified as "modal" for a family branch and allow a custom subset of markers to be tested individually (e.g. by YSEQ) to determine membership). Currently, with our B10 group, the y12 panel happens to include enough markers to provide that utility. With others in OSM, even y111 is not enough values in the haplotype to narrow down to just family branch members. Se we are looking to find a pseudo-y12 panel for those groups to make it easier to quickly determine membership at a lower point of entry cost and test level. With dropping prices of Sequencing and WGS testing, we may see a shift to just direct "test everything" at once.
STRs are known as Simple Sequence Repeats (SSRs) in some communities and Microsatellites in general. STRs are a form of the more general Copy-Number Variant (CNV) that is part of WGS results. CNV genetic markers tend to be analyzed using Copy-Number Analysis in microarray testing or more commonly with "Capillary Electrophoresis CE Fragment Analysis". These are much simpler and targeted than WGS full sequencing techniques.
External Resources
- Relative frequencies of various marker values and their rate of change:
SMGF (archived), Y-Base (archived), Leo Littles Freq chart (archived), Dean McGee's STR Grouping, Kerchner-R1b - Willems, T; et al, The Landscape of Human STR Variation, 2014, Genome Research journal
- Wikipedia pages on Microsatellites, Copy-number analysis
- STRbase at NIH
- Butler, J.M., at al Addressing Y-chromosome short tandem repeat (Y-STR) allele nomenclature. (2008) Journal of Genetic Genealogy 4(2): 125-148
- HUGO Gene Nomenclature Committee (HGNC)website (a more specific link for STR naming standard not available from there)
- Nomenclature defined at FTDNA, Wikipedia, SMGF (archived), and ISOGG DYS (see also ISOGG Y-STR)
- YDNA Testing Chart by Michael L. Hébert comparing the overlap of STR markers tested at various companies