There are many sources and pathways, other than the DNA test labs themselves, for creating microarray file format result files. Some valid and useful. Others often used by naive users in the wrong way to create bad data. But even sometimes rogue operators who wish to mimic or create false files not derived directly from lab results. With the introduction of WGS testing and the need to create microarray file format results to load into existing genetic genealogy sites, this has become even more of an issue. So we propose a simple and elegant solution to the problem using commonly available techniques. It does not prevent bad data from being created and "signed". But more guarantees the source of the data so bad sources can be identified and maybe rejected outright. Possibly enforcing a bread-crumb trail, so to speak, along the way if data is processed multiple times.

Many have called for many years to start enforcing some method of secure transfer of data. Often without understanding the security market of how this can be done. Or in ways that would not disrupt current practices, tools and methods of operating in the market. What we show here is a simple and elegant method that minimally changes the files from how they are used and created now, but adds verify-ability to the file content that comes from trusted suppliers. Thus tools that are accepting the data can more easily discern and learn to trust certain players.

Looking in our glossary here, one can find the description of the typical microarray file format. It consists of one to 20 lines of header information (often starting with a shell-like comment delimiter (the hash or '#'). Followed by data in a tab separated format. Readable, ASCII format files. Keeping these files readable is key. As tools are already built to accept and read them as is. And users have grown accustomed to being able to open the files in a text editor and read their content. The process to thus tag them to verify their authenticity is as follows: This same general form proposed here is in wide use in many scenarios today. HTTPS protocol, GPG for signed email messages. Even portions of bitcoin transactions.

(1) Checksum the complete, existing standard text file. Header and all. MD5 or SHA-256 or whatever is chosen and deemed appropriate.
(2) Sign the checksum with a private key. Release the public key that can be used to verify the signature of the checksum and thus verify the content.
(3) Add the checksum and signature to the header as a new line. First line likely easiest. Or use an existing line to make tools more compatible with the new format. Adding it to the end of the file could be best but tools may not be expecting a header comment line after the end of real data. Whatever is the most compatible to allow existing tool sets to work with signed and unsigned files without change is best.

For tools that wish to check, find the signed checksum, verify the checksum with the public key, remove the signed checksum from the header (in the same form it was added), and verify the checksum against the existing file. One could simplify it further by simply signing the whole file content. As the signature process is creating a hash of the content and then encrypting or signing that hash (along with some information like the date and time, etc). But we thought the visible checksum, without verifying the signature, would be nicer for the casual user. And somewhat independent of signing the document.

Crucial to any use of public-key cryptography is you must have authenticated or trusted ways of obtaining the public-key of a known entity. If not, then a man-in-the-middle attack can occur by someone masquerading as the key signer. Either traditional databases like used with PGP can be utilized to obtain the public keys. Or maybe ISOGG can take it on itself to verify the source and post the public keys for the various companies that might sign.

One could consider extending this to WGS result files such as FASTQ's, BAMs and VCF files; but this would be more problematic. Not all have headers (FASTQ have just had that added in a recent proposed change to the defacto format). The files are very large (tens of gigabytes for 30x WGS results). And often already binary and block encrypted using the BGZF algorithm (that allows them to be indexed and selectively uncompressed, much like DNA does during transcription of a gene). Checksumming the binary or uncompressed form may be hours to do and check; and would likely need a much longer checksum than used in some key signing activities (tens of thousands of bytes). But you have to checksum the whole file to prevent tampering. But another main issue to resolve is how to add the signed checksum to an already checksummed and compressed file like a BAM? The BGZF algorithm does define a trailing EOF block that consists of a short number of bytes. Maybe one could add the checksum after that and add it to the definition of that block in the standard. Not necessarily compatible with existing tools But an elegant solution. This use needs to be thought through a bit further.

Note that the checksum is separately added in a visible way to verify the open-text file has not been altered. This without having to verify the signature by finding the right public-key. But instead of signing the hash, one could sign the whole file. You would use this if you want to securely transmit a file to someone else and only that person can open it. This is used daily in OpenPGP email applications. In that case, you use the intended recipients public key to encrypt the file content. Then only the recipient with the private key can un-encrypt and see the content. With additionally signing and using hash's inside the file, this can be used as a way to securely transmit a guaranteed unaltered file between two parties. For example, a DNA test company could sign the file, encrypt using the receivers key, and then the match database receiver could unencrypt with their private key and then verify the signature and hash. Again, we encourage against encryption and to retain the open file format for the general use case as is done today.

As an example, I show doing this for just the portion of the 23andMe data file shown here. Running the MD5 checksum gets the value: AA032B168B9349E042C70486452295CF. The signature of just this checksum (if dumped in a normal email message stream) is shown here:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

AA032B168B9349E042C70486452295CF
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEJGqkVk36UWco8TZBdX3RiueeSl0FAl5awY4ACgkQdX3Riuee
Sl0sHwf/T1Zpq6gFffZi85sE4KoTjT33GX9DbBbkm9HrvTCZm2AVdMeYr+XCzz9B
xd297DoGQhqTcmXmOsMK7JZjeaZerXV3loqNry/ngY7EspPselWdxq2YknwzZY8z
isIXSHSuft+PZhxWJ49qOliycLpZcUfc1A7fspsOgB51/gVNQO2dNs7gWoost4Lx
9/HRDxD94+Mhyw/45CM6MLJFt+jTw2NXcLSlvGsU5W8hmoMS++6NGSFsZwEgnIRj
3DtUkU2uhmBoyfTejXpHzJ5vyaXYXSx6PP1omcchPEaNIUD0qS9eMmTOt+IacsXy
rckFV1xK+HnNCExT9GZpHXT/OrmZXw==
=V1fb
-----END PGP SIGNATURE-----


From https://keys.gnupgp.net, you can get the public key used above to verify this signed checksum value. The new header will be created with the signed checksum included and might look like the following. The check sum and signature being added to the start of the file as comments in this case.

# MD5 Hash: AA032B168B9349E042C70486452295CF
# Signature: iQEzBAEBCAAdFiEEJGqkVk36UWco8TZBdX3RiueeSl0FAl5awY4ACgkQdX3RiueeSl0sHwf/T1Zpq6gFffZi85sE4KoTjT33GX9DbBbkm9HrvTCZm2AVdMeYr+XCzz9Bxd297DoGQhqTcmXmOsMK7JZjeaZerXV3loqNry/ngY7EspPselWdxq2YknwzZY8zisIXSHSuft+PZhxWJ49qOliycLpZcUfc1A7fspsOgB51/gVNQO2dNs7gWoost4Lx9/HRDxD94+Mhyw/45CM6MLJFt+jTw2NXcLSlvGsU5W8hmoMS++6NGSFsZwEgnIRj3DtUkU2uhmBoyfTejXpHzJ5vyaXYXSx6PP1omcchPEaNIUD0qS9eMmTOt+IacsXyrckFV1xK+HnNCExT9GZpHXT/OrmZXw===V1fb
# This data file generated by 23andMe at: Thu Dec 17 14:11:20 2015
#
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# This data has undergone a general quality review however only a subset of markers have been 
.....
# rsid	chromosome	position	genotype
rs12564807	1	734462	AA
i3001395          MT      15530     --

Once you remove the added signature and checksum, you can then generate the checksum for the original file and verify the checksums match. Thus verifying the original file is unaltered (according to the stored checksum), even though plain and open text. But to verify the checksum has not been altered, you need to verify the signature. And thus verify who generated the signature and is by the company that you think signed the file. Desktop tools could also sign the files they generate although it is a bit harder for them to keep their private key private when doing so. The resultant verification in GUI like interfaces would report something like:
Signature created on Saturday, February 29, 2020 1:53:14 PM
With certificate: H600 Test  (757D D18A E79E 4A5D)
The signature is valid and the certificate's validity is ultimately trusted.

Note that signing a "document" is actually encrypting the hash of that document along with some other information. You could simplify the process by simply signing the whole document and not worry about the visible hash. We added the step of creating the hash checksum of the file first, then signing that hash, so that the content could be verified with the hash and without having to verify the signature on the hash. But in reality, someone could still alter the hash and so you really need to always verify the signature of the hash. One could simply sign the WHOLE document and include the signature directly. A simplification that removes the MD5 (or similar) checksum from the visible file. This is what typical email protocols using signatures are doing. Here is what a typical "message" that signed the whole file (using our same example) might look like:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

# This data file generated by 23andMe at: Thu Dec 17 14:11:20 2015
#
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# This data has undergone a general quality review however only a subset of markers have been 
# individually validated for accuracy. As such, this data is suitable only for research, 
# educational, and informational use and not for medical or other use.
# 
# Below is a text version of your data.  Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier 
# (an rsid or an internal id), its location on the reference human genome, and the 
# genotype call oriented with respect to the plus strand on the human reference sequence.
# We are using reference human assembly build 37 (also known as Annotation Release 104).
# Note that it is possible that data downloaded at different times may be different due to ongoing 
# improvements in our ability to call genotypes. More information about these changes can be found at:
# https://www.23andme.com/you/download/revisions/
# 
# More information on reference human assembly build 37 (aka Annotation Release 104):
# http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9606
#
# rsid	chromosome	position	genotype
rs12564807	1	734462	AA
i3001395          MT      15530     --
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEJGqkVk36UWco8TZBdX3RiueeSl0FAl5aw+sACgkQdX3Riuee
Sl2ZHggAsbkL5/VrDyVmnDXNnh70X9de61SqUdl3rLUDvGnEX80vv22sMyK4TWEA
VDgf6vs7/piFc6JyMbv5sLayGDKvPhP92XMHph7OJZs6LFJ46jYzyCz1TL+j6c1f
CwOpxghjBNDmjAyEbrJP9G+SbWK1aAkftDyh6VVlsZtjym6iF+izbthn1MNNsQah
yH0mWdo6A4r5OzndhN1acAe1UzsXlITJ2aNOGPQLeAxMWfArBXVzAXaMdOnH3DjP
N90ofBRXScyxkVXtgxEoPRX7fsCxs/+VrBMw8kUoov/JUCQgoH7YLHbZ12HhY4F3
gPP5Ov0gOP4BnSws9NjsA0pzabTKbw==
=pOEh
-----END PGP SIGNATURE-----

The signature could simply be added as a one line comment somewhere in the file; often best to be the first or last line so it can be easily stripped. Note that the files could still be subsequently compressed (i.e. zipped) and unzipped as they are now without affecting this verification process.

References