Duplicates | The H600 Project

Duplicates is an artifact associated with sequencing. The presence of duplicates leads to bias during variant calling. And so efforts to recognize and ignore duplicates are often made. But not by all. A true duplicate is the physical occurrence and reading of more than one copy of a particular fragment.

Duplicates occur for various reasons.

Amplification during library preparation. Library preparation is taking a DNA sample and preparing it for use in the sequencer. The lab may use PCR techniques to amplify or enhance the amount of DNA available for sequencing. This is most often the case when a low amount of sample is available or as a technique of amplifying a particular region (such as the exome in WES or as done for the Poz regions in the FTDNA BigY test). PCR copies not only introduce bias for that particular fragment that is duplicated, but the PCR process may introduce a change that is then further duplicated in a biased way.
During a vendors sequencing process, they may introduce duplicates. For example, to increase "signal" strength, Illumina will do in-place, nearby duplication of a fragment after it has been attached to the flow cell. Creating multiple nearby copies of the same fragment. Sometimes these copies will bleed into neighboring "pixels" of the flow-cell when imaged. Or the instrument may be slightly out of alignment and read the neighboring flow cell in addition to the primary target pixel. This is the source of what is known as optical duplication.
Sample coincidence is a form of false positive duplication. The random shearing of the original DNA to create fragments may create two fragments that are identical (same section of DNA from the same chromosome location and starting with the same base-pair in each). These are not true duplicates but usually cannot be distinguished as being one of the other types. Sample coincidence is usually randomly rare in occurrence. But can be a factor in very high read-depth sequencing such as WES.

Duplicates are usually determined and, at minimum, marked if not removed, after the output of the sequencer is aligned to a reference model. They are determined by looking for fragments that have the same start base-pair location with each other. the goal is to mark all occurrences other than the first encountered as duplicate and not consider the fragment during base or variant calling. To avoid bias introduced by their occurrence.

Structures