Sequencing File Formats

There are many differing file formats used to represent DNA test result data. Some are more company unique and simple. They are covered in the Microarray File Formats page. Others are standard and used in the medical and research industry; albeit with varying forms of implementation that are not always compatible. We attempt to cover the salient points of the more standard formats used in Sequencing bioinformatics here. Specifically, what is returned with 30x WGS testing that comprises the 3rd Wave in genetic genealogy testing.

Note that some DNA results are too simple to warrant a file format and are simply provided as a comma separated list of values. yDNA STR, yDNA named SNPs and mtDNA SNP variant calls are often examples where this happens. This even though the data may have originated from sequencing or microarray test techniques.

The bioinformatics processing pipeline for sequencing and microarray DNA testing can most simply be described as shown here:

Sequencing Bioinformatics Pipeline

It is these intermediate file formats in the sequencing pipeline that are covered here and another document we developed: Section 3.3.1 in Bioinformatics for Newbies).

Note that many of these file formats are more like container formats. Similar to how is done with modern movie streaming formats like MP4. So it is not enough to know the data is in a specific format. But one must know what stage in the processing is contained in that format and thus what type of data. This is not always apparent by the content, file name or its file name extension.

For most of these bioinformatic file types, the shell command htsfiles (from the htslib / samtools program release) can be used to characterize if it is any of these files. Both for its type, version, rough content and compression type.

The Sequencing File Formats started life as simple TSVs representing tables of data. So text files that could be simply viewed in a text editor, captured into a word processing document or even a spreadsheet. Then easily written, saved and exchanged. Originally, these file formats were containing just small portions of DNA as extracted by various researchers. Maybe a single gene or similar. After the creation of the human reference genome and lab instruments capable of performing whole exome / genome sequencing, the sizes of the files grew exponentially. Beyond the capabilities of standard processors and disks of the time.

As the volume of the data grew, the formats had to evolve. So compression was introduced into the formats as well as more compact, binary forms. While many files can still be represented simply as TSVs, the varying sizes of the cell entries of the more complicated formats are so great that the human-readability of the files is often now lost. Not to mention the files have grown to millions, if not hundreds of millions, of text lines — far too large for any traditional text processing program, spreadsheet or other human viewable form. A typical WGS test returns over 100 GB of compressed data.

In comparison, the Microarray File Formats are, for the most part, still the simple TSV format files. In fact, they are really just "RAW, annotated VCF files" with many columns stripped. And hence where the name RAW files comes from. RAW files can be processed by many of the same tools that process full sequencing results in similar formats. And may still be used for simple research results of just a single gene or small area.

A typical 30x WGS test result with over 90 gigabases of sequencing data in a TSV text file will have nearly a billion lines lines and be over 300 gigabytes in size. And thus they need to be compressed to store, transfer and even manipulate during processing. Most microarray file formats are delivered in compressed form even though they are considerably smaller. Maybe 50 million bytes when uncompressed. Given most sequencing bioinformaticians were working on Unix machines, the gzip standard for compression on that platform was adopted. But that is not enough and what has led to some confusion.

All the file formats covered here have now been defined to have a "header" in addition to the main data content body. As some formats are newly defined with a header, not all files you encounter will have the header. FASTQ/FASTQ in particular. In most cases, the content of the header is not a requirement or even computer processable. They are primarily human-readable comments. The headers are defined by starting each line with the hash ('#') character; originally introduced as a line comment in Unix Shell scripts back in the 1970's.

Because of the circular nature of mtDNA, many bioinformatic pipelines of software cannot handle the sequencer reads of mtDNA as the sequences overlap the tail and head of the DNA strand. Special processing and handling may be required to properly read such data.

Early on in BigY's existence, FTDNA did not filter out non yDNA results in their delivered BAM files. As those other results were not necessarily quality reads, they later started filtering them out before supplying BAM files to the customer. But you may still experience processing one of those old BAM files with a very small amount of non-yDNA data in them. They are not WGS BAM files.

It is interesting to note that early sequencers put out gigabases of data over a few days and only had read lengths of 20-40 base-pairs. They would output the images of the flow cells in TIFF format (a hold-over from the Facsimile days and formats). One image for each base-pair in the read length. Microarray testers do the same but only a single image. Post processing software was then used to provide the image analysis to extract base-pair read values with quality metrics. And the "stacked" images were then used to create the read segments which are then stripped of tags that may have been added on. As this extra processing became complex but standardized by each vendor base on their tools characteristics, the sequencer vendors now include this processing within the sequencers themselves. And hence the file format output to users is the already created and quality annotated read segments in the FASTQ files.

Following the section on External References here, we will introduce and cover each of the file formats given in the sequencing flow chart above. This material is a subset (summary) of more extensive materials in our Bioinformatics Documents series we have written.

External References

Original 1KGenomes Samtools tools site on file formats, see also http://www.htslib.org/doc/#file-formats
Viktor Ljungström, Univ of Uppsala Intro Presentation on File Formats
Dora Bihary presentation on BioInformatics File Formats
Illumina BaseSpace description of File Formats
ABM (applied biological materials) Intro NGS testing data analysis
Bioinformatics for Newbies from our series (source of the diagram)
Wikipedia FASTQ, SAM and VCF
Hosseini, Parsa, et al, ''An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets", 2 Jul 2010, BMC Research Notes

Tabs of various file formats

The file formats are each described below including examples of the uncompressed, human readable TSV form underlying each. Each major file format is in a separate tab'bed section like tabbed file folders in a file drawer. Click a tab on a file format to see information about that format. Click the "no tab" button to the far right to put all the tabs inline. This to print or save the page. Each file format tab is listed by its most common name. But there are variants of the format and names as well. These are described within each respective tab.

No Tabs

FASTQ format

Originally known as the FAST format or more commonly now as FASTA, this is a special format capturing the data as now delivered by the sequencing lab equipment. The FASTA format is the sequencer reads net yet mapped to a genome build or even filtered much for quality control. They do have some instrument tagging stripped by time they are delivered in the FASTA format. Mapping these raw sequencing read segment values to a reference genome build will result in a SAM / BAM file encapsulating these same read sequences but with additional information.

A FASTQ is the FASTA format with quality annotations added. FASTx files tend to be text just as SAM files are. But due to their size, they are most often found compressed. These files are not TSV format. Instead, they have 2 to 4 lines per read segment stored. In a very simplistic view, the BAM format simply converts this 4 line format into 4 columns so there is a single line (or row) per read segment. And then adds some additional annotations (the alignment) to each row as additional columns of data.

SAM / BAM files without alignment information are sometimes created and used. They are very similar in content of information (and format) to FASTQ files mentioned here. Basically, both are containers for the short sequence reads of fast, full sequencers with varying levels of information added.

Because these files are pre-alignment to a genome build, they are important to retain as future build improvements can yield greatly improved results of mapping and thus reading of SNP values. Luckily FTDNA must have retained the FASTQ files of their previous BigY customer runs as they automatically provided a remapped BAM to BigY customers when they introduced the BigY-500 product; which changed the used reference model from HG19 to HG38.

Copy to clipboard

@CL100119390L2C001R001_7/1
ACCTCAAGTGATCCGCCGCCTTGGTCTCCCAAAATGCTGGGATTACAGGCATGAGCAGCCCGGCTGACTCAGTGTAATCTTATTG
+
FF>GGFFFEGCFFFGFFFFFFEFFFFGGFFGFFGGFFGFFFEFFGGFFGFGGF=FFFFFFFBFFAF>FCFGFFF>EFFFFFEFG

The four lines of a FASTQ file are:

The identification line that must start with an "@" character and then a name unique to the read segment in the file
The actual read segment of nucleotide bases. Consisting mainly of the ATCG character values representing the bases read.
The indicator that the next line is a quality indicator for each base read in the second line
The actual quality segment providing a numeric (coded in ASCII text form) value for each corresponding bas in the second line

In the latest standard, a new header format has been adopted. But tools are not yet developed to really process and ignore it as of yet. So it is rarely seen yet. The read segment will often be near the target read length set by the lab equipment. Older runs may be 60, 70 or even 80 characters. Newer segments are 100, 120 or even 150 characters (or bases) long. The last two of the four lines do not exist in FASTA files. Only in FASTQ.

FASTA, FASTQ

FASTA files are the unaligned, short segment reads from the second generation, massively parallel sequencers. FASTQ is the form that includes quality data along with the reads and is the most common form encountered. The FASTQ is a simple text format with four line entries per read segment. The newly developed specification for the format is adding an information header to the file like exists in SAM and VCF files already.. Sometimes the files are given a .fasta extension. Other times, a ,fa or even .fna extension — although these should really be reserved for final assembly reference genome builds described in another tab.

Being an ASCII text defined file, it can be very large for a typical 30x average read depth, 90 gigabase result from a sequencer. As a result, almost always (and the tools understand and handle this), the file is compressed. If compressed in the BGZF format (like used in BAM fiiles), then the files can be concatenated even in their compressed, binary form. Some vendors deliver the separate FASTQ files from each lane in the sequencer. Multiple lanes used on a sample to build up to the 30x average read depth and 90 gigabases. Most often, compressed FASTA files are given a .fasta.gz or similar variation mentioned before.

The most common encounter will be the output from a paired-end, short read sequencer. Likely 100 to 200 base pairs per read segment. The number of bases per read being fairly constant with little variance and set by the lab during the tool run. So of paired-end reads, there will be two files with 300 million read segments of 150 base-pairs each. This represents around 90 gigabases. Assuming fully equal likelihood of reading any base pair in the whole genome, with the 3.2 billion base-pair human genome, that 90 gigabases represents about 30 reads per base-pair (roughly). And hence the specs of 90 gigabases and average read depth of 30 (or 30x as often seen). Note that the Allosome chromosomes (if a male) will have half as many reads. There is only one for every two autosomes. It is around 2.3 to 2.5 bytes per base-pair in a FASTQ file. Thus 90 gigabases will be about 200 GB. Hence the reason to keep them compressed. Unlike the BAM files, the FASTQ tend to be read serially through. So no index file and block compression is necessary. A 7x compression is typically achieved with a variance between the paired-end FASTQ files (that have the same number of gigabases but different values) is within 10% each other in size after compression. Before compression they are near identical in size.

There is no real index file for the FASTQ / FASTQ because there is no real order to the entries. They are the sequencer read outputs with minimal processing done to them. And thus not tagged with any information as to their source. Think of a FASTQ file as a box of the random pieces of a jigsaw puzzle. You could think of artificial ways to try and sort them (by majority color or lightness, by edge types, etc) but nothing very useful to the future processing.

Binary Alignment Map (BAM) file format

A BAM file is simply a binary-encoded file of the Sequence Alignment Map (or SAM) TSV file. It is not generally readable by a human in a standard text format even after uncompressing. Sequencing machines really put out an unaligned, unmapped sequence file (generally called FASTx where there are different letters for the 'x' depending on various factors). BAM is the popular format in use for Sequencing result data today as it can provide the actual reads of DNA segments dumped out by the test equipment. But then annotated with additional information such as where it is likely aligned to a standard human reference genome. Key is the reads have already been initially interpreted with additional data added. As such, unlerss a BAM was subsetted, it can be used to recreate the FASTQ files that one started with.

BAM, SAM, CRAM (BAI, CRAI)

BAM is the term used to described a binary, more compact form of a SAM file. A Sequencer Alignment Map (or SAM file) is a textual, pseudo TSV format that was actually created after the BAM as a way to more easily view and sometimes process the data. So, in reality, it is more proper to use the term SAM file for the content of a BAM file. As SAM files from sequencing tend to be over 400 gigabytes in their compressed form, one rarely ever encounters the uncompressed SAM file directly. So everyone simply uses the term BAM fand the BAM format to describe the content.

The BAM file uses a modified form of the gzip compression format named BGZF. A block format that allows the large file to be compressed in smaller blocks that are simply packed together. When used with a BAM Index File (or BAI), this allows the BAM file to be accessed in the middle, a portion of it uncompressed and and possibly updated, and then stored back in — all without un-compressing the whole file. More on the compression format(s) is given in another tab.

Copy to clipboard

CL100119390L2C010R002_385185    163     1       10000   11      49M1I6M1I43M    =       10069   165     TTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTTACCCTAACCCTAACCCTAAC      EFFFDDEEFEEEEDEFCCDAAABDC@BEDDDEDEDEEEECDDECB<AEFF;A5DCEAEFFEDFFFFDEEFFDDF5EFFFCEFFFD      RG:Z:1  AS:i:77 XS:i:84 NM:i:4

The above is a single line from a SAM file. The line has "wrapped" around due to the web-page formatting. It consists of Tab-Separated Value (TSV) columns of data. The first, tenth and eleventh columns should be recognized from the FASTQ example given in the FASTQ tab. They are the read segment name, the actual read segment base values, and the quality segment that corresponds to the read segment. Columns 3 and 4 are what now represents some of the alignment information. This indicates the segment is mapped to chromosome 1 starting at position 10,000 (in a forward positive strand order). Other values are special codes and tags we refer you to the standard to comprehend.

The BGZF format includes a special EOF compressed block added to the end of the file. Because standard gzip tools can read and process the BGZF format, many users will accidentally uncompress the BAM file using a non-BGZF tool and then try to recompress it using that same tool. When doing this, the special, separate EOF block is lost as well as other per-block header information. It is no longer block compressed but simply one big, single block when compressed. So often if you get the error that the BAM is missing the EOF or of the wrong format, it is because someone did not use a BGZF program to process the file. Use the program htsfile to check which compression format was used on your BAM file. And often uncompressing and then re-compressing using BGZip will solve the problem.

The SAM format extracted from a compressed block from a BAM is similar to what happens with a DNA strand in the cell. Just like during cell use of the DNA to create proteins, only small sections of interest at a time are uncompressed, unwound and turned into a more readable form to replicate the protein expressed by the gene. In a similar manner, rarely is the whole BAM uncompressed into a SAM file and made available for perusing. SAM has only recently been more formally specified in a specification. To date it was a convention initially developed and defined by the GATK and SAMTools tools during the 1KGenome project.

To make it easier to find information in a random-access way in the block-oriented, compressed BAM file, another file is defined called a BAM Index file, or BAI for short. Often given an extension of .bai.

The CRAM file is a very specialized format to drive and get more compression out of the original SAM data. Unlike the BAM, which is a simple gzip-variant of the plain-text SAM format, a CRAM is highly compressed by doing a column oriented look at the aligned, per-base-pair data as well as compressing the runs of base-pair data in the row-format. There is no hope of recovering data from a CRAM without the reference genome file used to create the BAM and then CRAM. The CRAM file is often around 50% smaller then the original BAM. Significant when you have BAM files in the 40 to 60 gigabyte range.

SAMTools is the main tool used to manipulate and view SAM / BAM / CRAM files.

BWA, BWA-MEM2, Minimap2, SNAP and others are the alignment programs that take in a FASTQ file(s) and create mapped, aligned BAM files. These are mostly focused at short read sequencer output.

Because a BAM file is very similar to a FASTQ in that it stores mainly the read segments, there is a not-often encountered Unaligned BAM (or uBAM) file format that is encountered. The BAM takes the read segment and its quality value and puts them both into a single TSV row along with alignment and other process information. A stripped BAM can be created without the alignment information columns (data) and thus simply represent the FASTQ. There is usually no real benefit to this and so it is rarely encountered. But is possible to find in practice.

WGS Extract is a tool that reads BAM files, performs variant calling, and then outputs microarray file format. It will also process the BAM and output an mtDNA FASTA and a subset yDNA / mtDNA BAM file. Useful for submitting to other tools in the community. yFull takes in the BAM to analyze the Y and mtDNA for haplogroup analysis and placement in a phylogenetic tree and thus must perform variant calling itself.

External Resources

PDF SAM/BAM File Format from the SAMTools Working Group
Wikipedia SAM File Format

Variant Call File (VCF) format

A VCF file is in a format for Sequencing data that historically only records the variations from the reference Genome Build that was used to create the BAM file it is created from. Over 99% of any tested users genome is the same as the reference. So this allows for a much more efficient storage of data just based on the differences compared to the reference genome model. The format was originally developed by the 1K Genomes project but is now maintained by the HTSLib consortia.

Copy to clipboard

#CHROM POS ID REF ALT QUAL FILTER INFO
14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:

Because VCF typically only contains the derived variant values of SNPs and some INDEL's, it is not considered as useful in this form for genetic genealogy. As you often want to know what was read and tested for in the genome, not just what was detected as different. Often, a version of a VCF file called a gVCF is therefore asked for and used. This file is in the same basic format but includes block areas defined as "all reference", tested and with quality values for the block as a whole.

A RAW (simple) VCF file is the first file format that is post pile-up from a BAM and contains data organized by base-pair (as opposed to read segment.) And after determining hetero versus homologic values in diploid regions like the autosomes . Often it is before the actual variant calling where the values are compared against the reference genome and possibly filtered due to low quality metrics. This format would be more useful to genetic genealogy; including the gVCF block definitions of the reference genome areas that were tested.

The VCF is a tab-separated, ASCII text file (TSV). Similar to a tab-separated (CSV) general file in spreadsheets. Unlike for microarray tests, a VCF file can be very large; in the gigabytes. So they are often compressed. Like for the compressed SAM (termed a BAM), the VCF is compressed using BGZF and officially termed a BCF. Although most VCF files are compressed, you rarely see the extension .bcf used to indicate that. Similar to how found for SAM versus BAM. Like for the BAM, an index file is often created. Termed Tabix and often given a file extension of .tai after the corresponding VCF (i.e. BCF) file name. Most VCF files are delivered with a .vcf.gz extension instead of simply a .bcf one.

As with all the formats, understanding the Genome Build version used to generate the data is crucial to interpreting the data delivered. This especially for formats like VCF that only report detected variation from the referenced Genome Build. So not reporting every SNP tested and its resultant value. Only the variants. The microarray file format is a gross simplification of a VCF file and in fact mimics the early format before many additional columns were added. VCF tools can often read microarray RAW results. None of these variant file formats indicate the reference model that the values are reported on. One has to guess and can really only do that if rsID or similar annotations are included.

BCFTools, IGV / Genome Browser, and xxxx are some of the main tools developed to manipulate files in this format. Confusing is that the SAMtools mostly ever see and process BAM files. While the BCFTools mostly only see and process BCF files that are labeled as VCF ones.

There are many steps or stages to converting a BAM file of aligned sequencer reads into a variant-only file of base-pair differences. It is the intermediate stages that can generate the differing but related VCF files like RAW and gVCF.. The GATK tool or in particular the bcftools mpileup are the main methods of initially converting a BAM file of sequencer segments to final values. The pileup process converts the file from a segment oriented one to a column, per base-pair oriented one. A variant caller then takes the per base-pair value set from all the overlapping segments, determines the quality, and if the singular (homozygous) or dual (heterozygous) resulting value, Remember most of our DNA exists in the Autosomes where we have two similar chromosomes, each with their own value for each base-pair. The final step is to compare the per base-pair value to the reference genome to determine if the value(s) are ancestral (that is, the same as the reference genome) or are derived (and thus vary from the reference genome. When this variance analysis is done on a per base-pair basis, this is called SNP variant calling. There are other variants such as Insertion-Deletions (InDel's), Copy-Number Variants (CNV's, of which STRs are a subset), and Structural Variants (SVs). After pileup and determining a singular / double value, but before comparing to the reference genome and quality filtering, the file can be saved as a RAW VCF. After variant calling but before throwing out the ancestral values, it is called a gVCF file. Once filtered and only the variants are left, a standard VCF is produced.

It should be noted that once the BAM is processed to call variants, information is lost, and you cannot go backwards. A FASTQ to BAM only adds information (alignment to the reference genome and generally does not trim anything out. So a BAM can be used to recreate the original FASTQ file as delivered from the sequencer tool pipeline. But going from a BAM to a VCF loses information (all the overlapping segment read values; as well as ancestral value result and quality information). Thus creating a VCF is a one-way process. One should always save their BAM (possibly in a more compact CRAM format) along with the reference genome file. Together, you can recreate all other data. As one might expect, a final, filtered VCF is often 1/50th the size of the original BAM and reference genome files. Although you only have about one variance every 10,000 base pairs, there is a lot more overhead per base-pair recorded in the VCF record.

DNA Kit Studio, from the Third Party Tools section, works on VCF files as well as RAW File Format files. It's main use here is to transform WGS VCF files into RAW File Format files that mimic output from a microarray test that can then be loaded into an autosomal segment analysis and match database. Genvue, Prometheus, and Sequencing all read VCF files to create reports on medical implications of variants found. But it should be reiterated that a VCF is not a good starting point for generating a microarray file format output. If anything, gVCF is needed as the starting point but a RAW VCF is even better. Is is suspected that because of the similarity in stage of information between a microarray file format and a RAW VCF (both tab-separated value formats, both containing values whether (ancestral or derived, etc.) that the microarray file formats are often called RAW files.

External Resources

Global Alliance for Genomes and Health (GA4GH) VCF Description
VCF4 at the 1K Genomes Project
Wikipedia VCF Format

Tab-Separated Value (TSV)

TSV and its more robust, first-developed Comma-Separated Value (CSV) are not unique to genetics or sequencing. But they are so important as an underlying feature of all the file formats described that we feel some space should be dedicated to them. These file formats are mainly outcrops of early desktop spreadsheet file exchange formats dating back to the early 1980's. And their use here is really just an outcrop of scientists hand creating unformatted text files that they later wanted to exchange and use. TSV is a very simple format for scripting languages like Python to accommodate as well.

TSV is really just trying to reliably capture what would appear to be a table in a text file. Using a fixed-width font, here is an example table one might have that you also want to be computer processable:

Copy to clipboard

# rsid		chromosome	position	genotype
rs4477212	1		82154		AA
rs3094315	1		752566		AG
rs3131972	1		752721		AG
rs12124819	1		776546		AA
rs11240777	1		798959		AG

The example above is actually the start of a microarray file format. bcftools can take in TSV format files as VCF is a complex form of them. In fact,bcftools can read and write microarray file formats which are simple TSV format files. In the above example, we have inserted multiple tabs in some places to make it more readable and apparent as a table. Some tools allow multiple successive tabs between single fields. Others interpret multiple successive tabs to be specifying null (empty) fields.

External References

Wikipedia on CSV and TSV

BGZF

The sequencing tool developers built a block-oriented compression format named BGZF as an extension to gzip. A normal "gzip" program can actually uncompressed this file format as it is an extension of the gzip format using capabilities defined there. But a "gzip" program cannot recreate a BGZF format, compressed file. Add to that the fact the developers did not create a unique extension for this format (like, say, .bgz). As a result, most simply reuse .gz for these format files. This all leads to much confusion. All files, including FA, BAM, VCF and FASTQ are generally compressed and used in the BGZF file format.

To understand the BGZF format, think of how a ZIP file in Windows (originally called PKZIP) can contain many files and folders in its compressed format. Unix's TAR to create a single file from many files (originally to write to Tape ARchive) was often then piped to gzip to compress that single large file. And hence where the idea of ZIP came from. But to get a single file out of a tar, gzip archive, you have to read through the whole file, unzipping along the way, until you came to the file of interest. PKZIP compresses each file individually, then uses a TAR like action to sandwich the files together while putting a file name index at the head. The index containing the block count into the file for quick, independent access of any file.

For the sequencing BGZF format, it is not multiple files being compressed and sandwiched together. Instead they take one very large file of sequence data and break it up into regular sized blocks. "gzip" compressing each block independently and then sandwiching them together. If the file content in TSV form is sorted on one of the columns of data before compression, then one could look to see what value is stored at the start of each block and create an index — much like the file index of PKZIP before. Using that index, a tool can then jump into the middle of the large file to find just the block desired and then simply uncompress that single block. This is an important core concept of all sequencing file formats used today. The files ending in "i" for index are the indexing file for the BGZF compressed file. So .fai, .bai, etc for FASTA, BAM, etc files. Today, the sequencing file formats, that are ultimately TSV source files in the end, are always saved in the binary compressed format known as BGZF.

Even though the BAM, CRAM and VCF file formats are defined with this compression, many tools expect the compression state to be a part of the file name. The BGZF creators did not promote a new extension (like .bgz), As BGZF was built as an extension to the gzip format, many simply adopted that gzip convention of adding a trailing .gz file name extension. Which is really an overuse of the extension and incorrect. Some do not realize this issue and compress the sequencing file format using gzip. Which the processing tools cannot handle as they often need that block index. Hence, why it is useful to use the htsfiles tool to check which compression method was used in a file in addition to identifying what the content of the file is. FASTAs often appear with the extension .fasta.gz as it is not required to be compressed in all uses. VCFs are normally required to be compressed and so some tools require the extension .vcf.gz but not all files may appear that way. Note, to confuse things further, most VCF files are actually BCF ones (the bianry form of a VCF just like BAM is the binary form of a SAM). But they use .vcf.gz even if a BCF format file.

While gzip compressed files can sometimes be processed by the tools, they cannot be indexed. And most tools require the companion index to the compressed files. So it is important to always have a BGZF compressed format of your files. And to always generate BGZF compressed files as part of your processing and output. As well as their corresponding index files.

CRAM has its own, special, custom format of compression that is built into the file format. While inside it still ultimately exhibits the BGZF compression format, the content of the blocks is not recognizable from the original SAM format. CRAM files can only be uncompressed by a special CRAM converter that also requires access to the reference genome final assembly FASTA file that is covered in another TAB.

Final Assembly (FA) file format

The reference model Genome Build is often specified as a parameter to some of the tools. Specifically, the aligner that takes in the raw read segments from a FASTQ file and matches (i.e. aligns) the reads to a reference model. This reference model is stored in a file format known as a Final Assembly. It is a simply a FASTA format file with a special format of the sequence names. Commonly the file has a .fa or .fna extension but sometimes simply .fasta. Generally a final assembly for human genome work consists of the 24 primary chromosomes, the mitochondrial model, and any additional contigs to represent special, unplaced areas patches or fizes that are known to exist.

The FASTA has a main index file (.fai), but often has multiple index files (for example, a .gzi index file as well). Most deliver the FA in gzip compressed format as BGZF is not strictly necessary as the file generally cannot be easily indexed. But most tools require the reference be BGZF compressed with an index tht indicates in which block a sequence (or chromosome or similar) starts.

Copy to clipboard

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
accctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaac
cctaacccaaccctaaccctaaccctaaccctaaccctaaccctaacccc

BWA and other aligners need a special index suitable for their aligning process. Each with their own form. Often those index files are considerably larger than the original FA file by 3-5 fold. A human reference genome final assembly file, when BGZF compressed, is often around 1 GB to represent the 3 billion plus bases contained within. Important to understand is a FA is always haploid. There is no replication of chromosomes that may be replicated inside a cell.

Backlinks

Structures

External References

Tabs of various file formats

FASTQ format

FASTA, FASTQ

Binary Alignment Map (BAM) file format

BAM, SAM, CRAM (BAI, CRAI)

External Resources

Variant Call File (VCF) format

External Resources

Tab-Separated Value (TSV)

External References

BGZF

Final Assembly (FA) file format