The standard reference genome of SARS 2 is known as Wuhan-Hu-1, and it was described in a paper by Wu et al. titled "A new coronavirus associated with human respiratory disease in China" (https://www.nature.com/articles/s41586-020-2008-3). There's currently three versions of Wuhan-Hu-1 at GenBank: MN908947.1 is 30,474 bases long and it was published on January 12th 2020, MN908947.2 is 29,875 bases long and it was published on January 14th 2020, MN908947.3 is 29,903 bases long and it was published on January 17th 2020 (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3?report=girevhist). I'm not sure if the dates are UTC or local time.
The only differences between the three versions are at the beginning and end of the sequences:
$ brew install mafft [...] $ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.'{1,2,3}>wuhu.fa $ mafft --globalpair --clustalout --thread 4 wuhu.fa [...] CLUSTAL format alignment by MAFFT G-INS-1 (v7.490) MN908947.1 cggtgacgcatacaaaacattcccaccataccttcccaggtaacaaaccaaccaactttc MN908947.2 cggtgacgcatacaaaacattcccaccataccttcccaggtaacaaaccaaccaactttc MN908947.3 -----------attaaaggtt-----tataccttcccaggtaacaaaccaaccaactttc *. *** .** .********************************* [... 29,760 bases omitted from the middle of the alignment] MN908947.1 aatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttcctgg MN908947.2 aatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttctt----- MN908947.3 aatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttaggag ******************************************************* MN908947.1 gatcttgatttcaacagcctcttcctcatactattctcaacactactgtcagtgaacttc MN908947.2 ------------------------------------------------------------ MN908947.3 aatgacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa--------------------- MN908947.1 taaaaatggattctgtgtttgaccgtaggaaacatttcagtattccttatctttagaaac MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 ctctgcaactcaagtgtctctgccaaagagtcttcaaggtaatgtcaaataccataacat MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 attttttaagtttttgtatacttcctaggaatatatgtcatagtatgtaaagaagatctg MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 attcatgtgatctgtactgttaagaaacacttcaaaggaaaacaacctctcatagatctt MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 caaatcctgttgttcttgagttggaatgtacaatattcaggtagagatgctggtacaaaa MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 aaagactgaatcatcaggctattagaaagataaactaggcctctaacatagtaactttaa MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 cagtttgtacttatacatattttcacattgaaatatagttttattcatgactttttttgt MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 tttagcttctctgtcttccattatttcaagctgctaaaaattaaaaatatcctatagcaa MN908947.2 ------------------------------------------------------------ MN908947.3 ------------------------------------------------------------ MN908947.1 agggctatggcatctttttgtaaaaataaggaaagcaaggttttttgataatc MN908947.2 ----------------------------------------------------- MN908947.3 -----------------------------------------------------
The first and second versions started with the 27-base segment CGGTGACGCATACAAAACATTCCCACC
which was changed to ATTAAAGGTTT
in the third version. If you remove the first three bases from CGGTGACGCATACAAAACATTCCCACC
, then it's identical to a 24-base segment that's included near the end of all three versions Wuhan-Hu-1:
$ seqkit locate -p tgacgcatacaaaacattcccacc wuhu.fa|column -t seqID patternName pattern strand start end matched MN908947.1 tgacgcatacaaaacattcccacc tgacgcatacaaaacattcccacc + 4 27 tgacgcatacaaaacattcccacc MN908947.1 tgacgcatacaaaacattcccacc tgacgcatacaaaacattcccacc + 29360 29383 tgacgcatacaaaacattcccacc MN908947.2 tgacgcatacaaaacattcccacc tgacgcatacaaaacattcccacc + 4 27 tgacgcatacaaaacattcccacc MN908947.2 tgacgcatacaaaacattcccacc tgacgcatacaaaacattcccacc + 29360 29383 tgacgcatacaaaacattcccacc MN908947.3 tgacgcatacaaaacattcccacc tgacgcatacaaaacattcccacc + 29360 29383 tgacgcatacaaaacattcccacc
I have found several similar assembly errors in viral sequences where a short segment at either end of the sequence consists of another part of the genome or its reverse complement. For exaple in a SARS 1 sequence titled "SARS coronavirus SoD" (AY461660.1), there's a 24-base segment at the 3' end end which is the reverse complement of another 24-base segment near the 3' end:
$ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=AY461660.1'>sod.fa $ seqkit grep -nrp AY461660 sod.fa|seqkit locate -p cacattagggctcttccatatagg|column -t seqID pattern strand start end matched AY461660.1 cacattagggctcttccatatagg + 29692 29715 cacattagggctcttccatatagg AY461660.1 cacattagggctcttccatatagg - 29629 29652 cacattagggctcttccatatagg $ seqkit subseq -r 29629:29652 sod.fa # the match on the reverse strand is a reverse complement of the match on the positive strand >AY461660.1 SARS coronavirus SoD, complete genome CCTATATGGAAGAGCCCTAATGTG
The assembly errors are easy to spot if you do a multiple sequence alignment of several viral sequences and you look for inserts at the beginning or end of the alignment which are only included in one sequence but missing from all other sequences. Then you can do a BLAST search for the insert to find its origin.
The 3' end of the first version of Wuhan-Hu-1 accidentally included a segment of 618 bases of human DNA (https://twitter.com/ChrisDeZPhD/status/1290218272705531905). The first 20 of the 618 bases are actually included near the 3' end of the genome SARS 2, and they're also the last 20 bases of the second version of Wuhan-Hu-1 (TGTGATTTTAATAGCTTCTT
). You can do a BLAST search for the last 618 bases by going here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn. Then paste the sequence below to the big text field at the top and press the "BLAST" button:
$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'|seqkit subseq -r -618:-1 >MN908947.1 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome tgtgattttaatagcttcttcctgggatcttgatttcaacagcctcttcctcatactatt ctcaacactactgtcagtgaacttctaaaaatggattctgtgtttgaccgtaggaaacat ttcagtattccttatctttagaaacctctgcaactcaagtgtctctgccaaagagtcttc aaggtaatgtcaaataccataacatattttttaagtttttgtatacttcctaggaatata tgtcatagtatgtaaagaagatctgattcatgtgatctgtactgttaagaaacacttcaa aggaaaacaacctctcatagatcttcaaatcctgttgttcttgagttggaatgtacaata ttcaggtagagatgctggtacaaaaaaagactgaatcatcaggctattagaaagataaac taggcctctaacatagtaactttaacagtttgtacttatacatattttcacattgaaata tagttttattcatgactttttttgttttagcttctctgtcttccattatttcaagctgct aaaaattaaaaatatcctatagcaaagggctatggcatctttttgtaaaaataaggaaag caaggttttttgataatc
The best match is 616 bases long because it's missing the last 2 bases of the query, but 614 out of 616 bases are identical with a result titled "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence".
Another common assembly error is that a piece of host DNA or host rRNA is accidentally inserted to either end of the contig. For example I noticed that at the beginning of the SARS-like bat virus Rs7907, there was an insertion which was missing from other SARS-like viruses:
$ curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1j-YFiMYG4DkVKSget2fYW-gaJDy6NCkW' $ seqkit fx2tab sarslike.fa|grep -Ev 'Rs7931|Ra7909'|grep -C10 Rs7907|awk -F\\t '{print$1 FS substr($2,1,200)}'|seqkit tab2fx|mafft --globalpair --quiet -|seqkit fx2tab|awk -F\\t '{print $2 FS$1}'|cut -d, -f1|column -ts$'\t' g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct MT040333.1 Pangolin coronavirus isolate PCoV_GX-P4L g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct MT040335.1 Pangolin coronavirus isolate PCoV_GX-P5L a----------------------------------------------------------------------ttaaaggtttatac-cttcccaggtagcaaaaccaaccaactct MW532698.1 Pangolin coronavirus isolate GX_P2V g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct MT040334.1 Pangolin coronavirus isolate PCoV_GX-P1E ----------------------------------------------------------------------------------------tcccaggtagcaaaaccaaccaactct MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct MT040336.1 Pangolin coronavirus isolate PCoV_GX-P5E a-----------------------------------------------------------------------------------c-cttcccaggtaacaaaaccaaccaac-ct OL674075.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7905 ------------------------------------------------------------------------------------------------------------------- OL674081.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7952 --------------------------------------------------------------------------------------cttcccaggtaacaaaaccaaccaac-ct OL674079.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7924 ------------------------------------------------------------------------------------c-cttcccaggtaacaaaaccaaccaac-ct OL674078.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7921 ggggattgcaattattccccatgaacgaggaattcccagtaagtgcgggtcataagcttgcgttgattaagtccctgcccttt---gtacacaccgcccaaaaccaaccaac-ct OL674076.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7907 ------------------------------------------------------------------------------------------------------------------- OL674074.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7896 a----------------------------------------------------------------------ttaaaggtttttac-cttcccaggtaacaaaaccaaccgac-ct MZ081380.1 Betacoronavirus sp. RsYN04 strain bat/Yunnan/RsYN04/2020 a----------------------------------------------------------------------ttaaaggtttttac-cttcccaggtaacaaaaccaaccgac-ct MZ081376.1 Betacoronavirus sp. RmYN05 strain bat/Yunnan/RmYN05/2020 a----------------------------------------------------------------------ttaaaggtttttac-cttcccaggtaacaaaaccaaccgac-ct MZ081378.1 Betacoronavirus sp. RmYN08 strain bat/Yunnan/RmYN08/2020 --------------------------------------------------------------------------------------ctacccagg---aaaagccaaccaac-ct AY304488.1 SARS coronavirus SZ16 ------------------------------------------------------------------------------------------------------------------- OK017852.1 Sarbecovirus sp. isolate YN2020B ------------------------------------------------------------------------------------------------------------------- OK017856.1 Sarbecovirus sp. isolate YN2020F g------------------------------------------------------------------------------------------------------------------ OK017855.1 Sarbecovirus sp. isolate YN2020E ------------------------------------------------------------------------------------------------------------------- OK017854.1 Sarbecovirus sp. isolate YN2020D g------------------------------------------------------------------------------------------------------------------ OK017853.1 Sarbecovirus sp. isolate YN2020C
When I did a BLAST search for the first 150 bases of Rs7907, the first 96 bases were 100% identical with several sequences of mammalian DNA or rRNA, like for example "PREDICTED: Pan troglodytes 18S ribosomal RNA (LOC129143085), rRNA" and "Delphinus delphis genome assembly, chromosome: 21".
When the middle part of an RNA sequence is known but it's not certain if the ends of the sequence are correct, the ends can be sequenced accurately using method called RACE (rapid amplification of cDNA ends). RACE is also known as "one-sided PCR" or "anchored PCR", and it basically requires only specifying a single primer for each amplified segment instead of a forward and reverse primer like in regular PCR (https://en.wikipedia.org/wiki/Rapid_amplification_of_cDNA_ends).
In the third version of Wuhan-Hu-1, an additional 44 bases were inserted to the end which were missing from the second version. I believe it's because they used RACE to sequence the ends of the genome. The paper by Wu et al. said: "The genome sequence of this virus, as well as its termini, were determined and confirmed by reverse-transcription PCR (RT–PCR)10 and 5′/3′ rapid amplification of cDNA ends (RACE), respectively." (https://www.nature.com/articles/s41586-020-2008-3)
Supplementary Table 6 includes the primers they used to do RACE (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2008-3/MediaObjects/41586_2020_2008_MOESM1_ESM.pdf):
D. Primers used in 5'/3' RACE | |||
5-GSP | CCACATGAGGGACAAGGACACCAAGTG | 573-599 (599 bp) | |
5-GSPn | CATGACCATGAGGTGCAGTTCGAGC | 491-515 (515 bp) | |
3-GSP | TGTCGCGCATTGGCATGGAAGTCACACC | 29212-29239 (688 bp) | |
3-GSPn | CTCAAGCCTTACCGCAGAGACAGAAG | 29398-29423 (502 bp) |
In 2020, Steven Quay posted a tweet where he wrote: "Dr. Shi lied? Below is NEW RaTG13 sequence https://ncbi.nlm.nih.gov/nuccore/MN996532 with the missing 1-15 nt at 5' end. So I blasted 1-180 of this new sequence against her RNA-Seq for RaTG13 to find the missing 15 nt. But they are not there. So 1-15 nt appears FABRICATED by Dr. Shi & WIV." (https://twitter.com/quay_dr/status/1318041151211884552) Actually the RACE sequences of RaTG were only added to SRA in December 2021 after Quay's tweet, but one of them matched the segment of 15 nucleotides which was added to the start of the second version of RaTG13 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR16979454):
$ fastq-dump SRR16979454 # RACE sequences of RaTG13 $ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN996532.'{1,2}|mafft ->ratg.fa # first and second version of RaTG13 [...] $ Rscript -e 'm=t(as.matrix(Biostrings::readDNAStringSet("ratg.fa")));pick=which(rowSums(m!=m[,1])>0);write.table(cbind(m[pick,],pick),quote=F,col.names=F,row.names=F)' - A 1 - T 2 - T 3 - A 4 - A 5 - A 6 - G 7 - G 8 - T 9 - T 10 - T 11 - A 12 - T 13 - A 14 - C 15 C T 2125 C A 2132 C A 6936 G A 7127 G T 18812 C G 18813 A - 29856 A - 29857 A - 29858 A - 29859 A - 29860 A - 29861 A - 29862 A - 29863 A - 29864 A - 29865 A - 29866 A - 29867 A - 29868 A - 29869 A - 29870 $ seqkit locate -irp attaaaggtttatac SRR16979454.fastq|column -ts$'\t' # search RACE sequences for the 15-base segment that was added to the start of the second version of RaTG13 seqID patternName pattern strand start end matched SRR16979454.4 attaaaggtttatac attaaaggtttatac - 461 475 ATTAAAGGTTTATAC
As far as I know, the RACE sequences used by Wu et al. were never published, but I guess they match a longer part of the poly(A) tail than the metagenomic reads which only included a few bases of the poly(A) tail.
The third version of Wuhan-Hu-1 has a 33-base poly(A) tail, but I think they sequenced the tail with RACE because there's only a few bases of the tail included in the metagenomic reads published at the SRA. The output below shows reads that aligned against the end of the third version of Wuhan-Hu-1. To save space, I only included one read per each starting position even though many positions have over ten aligned reads. The first column shows the starting position and the second column shows the number of mismatches relative to Wuhan-Hu-1:
$ brew install bowtie2 fastp samtools [...] $ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_{1,2}.fastq.gz [...] $ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>sars2.fa $ bowtie2-build sars2.fa{,} [...] $ fastp -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1.trim.fq.gz -O SRR10971381_2.trim.fq.gz -l 70 [...] $ bowtie2 -p3 -x sars2.fa -1 SRR10971381_1.trim.fq.gz -2 SRR10971381_2.trim.fq.gz --no-unal|samtools sort -@2 ->sorted58.bam [...] $ samtools view sorted58.bam|awk -F\\t '!a[$4]++{x=$4;y=$10;sub(/.*NM:i:/,"");print x,$1,y}'|tail -n30 29763 1 AATGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAG 29767 2 TTCAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA 29769 8 GGCTGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCGTCTCTTTAGA 29771 4 GGTCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCAA 29772 2 TTTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA 29773 1 GTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGA 29775 2 TATGGAGAGCTGCCTATATGGAAGAGCCCTAATGTATAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACA 29776 6 TTGGAGAGCTGCCTATATGGCAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGATGC 29777 2 ATGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAG 29780 1 AGTGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGG 29781 4 AAACTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCGCATGTGATTTTAATAGCTTCTTAGGAGAATAACA 29791 1 TGTGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA 29855 8 CTGTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29857 8 AATAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29858 8 AAAAAAAAAACTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29859 6 AAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29860 7 AAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29861 6 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29862 3 AAAAATCACAAAAAAAAAAAAAAAAAAAAAAAA 29863 4 AAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29864 4 AAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29865 1 AATGAAAAAAAAAAAAAAAAAAAAAAAAAAA 29866 3 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29867 4 AAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAA 29868 2 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29869 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29870 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29871 0 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29872 1 AATAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 29873 1 AAAAAATAAAAAAAAAAAAAAAAAAAAAAAA
In the output above, there's a huge gap between positions 29,791 and 29,855 with no reads starting from the 63 positions in between. I think it's because the reads after the gap don't actually come from SARS 2 but from some other organisms.
The last 50 bases of Wuhan-Hu-1 before the poly(A) tail are ATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGAC
, which is matched in the by the reads above the gap within a couple of mismatches. But the first read below the gap starts with CTGTT
which has three nucleotide changes from ATGAC
at the corresponding position in Wuhan-Hu-1.
Out of the reads whose starting position is 29,700 or above, there's only one read which has five A bases after the sequence ATGAC
:
$ samtools view sorted58.bam|awk '$4>=29700'|cut -f10|grep -Eo ATGACA+|sort|uniq -c|sort -n 1 ATGACAAAAA 2 ATGACAAA 2 ATGACAAAA 8 ATGACA
Wu et al.'s paper in Nature was submitted a week before MN908947.2 was published at GenBank, so maybe at the time when they submitted the Nature paper, they had not yet noticed that they accidentally included about 600 bases of human DNA at the end of MN908947.1:
(I'm not sure which of the dates above are UTC and which are local time.)
In Supplementary Table 8 of Wu et al. 2020, there's a list of the PCR primers they used to do PCR-based sequencing in order to verify the results of the metagenomic de-novo assembly they did with MEGAHIT (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2008-3/MediaObjects/41586_2020_2008_MOESM1_ESM.pdf). The last reverse primer is AAAATCACATGGGGATAGCACTACT
which is shown to end at position 29,864, so I guess I guess by the time they designed the primers, they had already noticed that they accidentally included a piece of human DNA at the beginning of MN908947.1 which was 30,474 bases long. The first forward primer is CCAGGTAACAAACCAACCAACTT
and it's shown to start at position 36, which matches the coordinates in the first and second version of Wuhan-Hu-1 but not the third version, so I guess they designed the primers based on the second version. However another part of Supplementary Table 8 says that the "primers for WHCV detection using qPCR" were "designed based on the whole genome of WHCV (MN908947.3)", and their coordinates match the coordinates in the third version and not the second version. So I guess they designed different sets of primers based on different versions of the genome.
Below I ran MEGAHIT for Wu et al.'s lung metagenomic reads. When I did trimming with fastp
, my longest contig was 29,802 bases long and it missed the last 102 bases of MN908947.3 and had an insertion of one base at the beginning. When I did trimming with trimmomatic
, my longest contig was 29,875 bases long and now it only missed the last 30 bases of MN908947.3 even though it still had the single-base insertion at the beginning:
brew install megahit -s # `-s` compiles from source because the bottle was compiled against GCC 9 so it gave the error `dyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgomp.1.dylib` wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_{1,2}.fastq.gz # download from ENA because it's faster than downloading from SRA and doesn't require installing sratoolkit fastp -i SRR10971381_1.fastq.gz -I SRR10971381_2.fastq.gz -o SRR10971381_1.trim.fq.gz -O SRR10971381_2.trim.fq.gz -l 70 megahit -1 SRR10971381_1.trim.fq.gz -2 SRR10971381_2.trim.fq.gz -o megahutrim brew install trimmomatic trimmomatic PE SRR10971381_{1,2}.fastq.gz SRR10971381_{1,2}.paired.fq.gz SRR10971381_{1,2}.unpaired.fq.gz AVGQUAL:20 HEADCROP:12 LEADING:3 TRAILING:3 MINLEN:75 -threads 4 # adapter trimming was omitted here megahit -1 SRR10971381_1.paired.fq.gz -2 SRR10971381_2.paired.fq.gz -o megahutrimmo
The paper by Wu et al. says: "In total, we generated 56,565,928 sequence reads that were de novo-assembled and screened for potential aetiological agents. Of the 384,096 contigs assembled by Megahit [9], the longest (30,474 nucleotides (nt)) had a high abundance and was closely related to a bat SARS-like coronavirus (CoV) isolate - bat SL-CoVZC45 (GenBank accession number MG772933) - that had previously been sampled in China, with a nucleotide identity of 89.1% (Supplementary Tables 1, 2).". And 56,565,928 is identical to the number of reads in the SRA run deposited by Wu et al. (https://www.ncbi.nlm.nih.gov/sra/?term=SRR10971381). And 30,474 is identical to the length of the first version of Wuhan-Hu-1 at GenBank.
However when I tried aligning the reads in Wu et al.'s metagenomic dataset against the first version of Wuhan-Hu-1, I didn't get a single read which aligned against the last 585 bases. The starting position of my last aligned read was 29,807 and it ended 590 positions before the end of MN908947.1:
$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'>wuhu1.fa $ bowtie2-build wuhu1.fa{,} [...] $ bowtie2 -p3 -x wuhu1.fa -1 SRR10971381_1.fastq.gz -2 SRR10971381_2.fastq.gz --no-unal|samtools sort -@2 ->sorted59.bam [...] $ samtools view sorted59.bam|awk -F\\t '!a[$4]++{x=$4;y=$10;sub(/.*NM:i:/,"");print x,30473-x-length(y),$1,y}'|tail|column -t 29785 591 8 GGCTGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCGTCTCTTTAGA 29787 616 4 GGTCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCAA 29788 590 7 TTTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA 29789 591 5 GTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGA 29791 585 11 TATGGAGAGCTGCCTATATGGAAGAGCCCTAATGTATAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACA 29792 591 7 TTGGAGAGCTGCCTATATGGCAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGATGC 29793 592 6 ATGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAG 29796 594 4 AGTGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGG 29797 585 12 AAACTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCGCATGTGATTTTAATAGCTTCTTAGGAGAATAACA 29807 590 6 TGTGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA
In Wu et al.'s metagenomic dataset, about half of the reads have been masked entirely so that they consist of only N
bases:
$ seqkit grep -rsp '^N+$' SRR10971381_1.fastq.gz|seqkit head -n1 @SRR10971381.2 2 length=151 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! $ seqkit grep -rsp '^N+$' SRR10971381_1.fastq|seqkit stat # out of a total of about 28 million reads, about 14 million reads consist of only N letters file format type num_seqs sum_len min_len avg_len max_len - FASTQ DNA 13,934,971 2,027,338,447 35 145.5 151
So it could be that by the time Wu et al. submitted the reads to SRA, they had noticed that their initial 30,474-base contig included a segment of human DNA at the end, so maybe they replaced the reads which matched human DNA with reads that consisted of only N letters. That would explain why the number of reads at SRA is the same as the number of reads which produced their 30,474-base contig but why the contig cannot be reproduced. But it could be that at the point when they noticed the assembly error in MN908947.1, they were no longer able to update the main text of their paper in Nature, which might explain why the assembly error in MN908947.1 was accounted for in Supplementary Table 8 but not in the main text of the paper.
Wu et al. wrote that about 24 million reads out of about 57 million reads remained after human reads were filtered out:
Sequencing reads were first adaptor and quality trimmed using the Trimmomatic program[32]. The remaining 56,565,928 reads were assembled de novo using both Megahit (v.1.1.3)[9] and Trinity (v.2.5.1)[33] with default parameter settings. Megahit generated a total of 384,096 assembled contigs (size range of 200–30,474 nt), whereas Trinity generated 1,329,960 contigs with a size range of 201–11,760 nt. All of these assembled contigs were compared (using BLASTn and Diamond BLASTx) against the entire non-redundant (nr) nucleotide and protein databases, with e values set to 1 × 10−10 and 1 × 10−5, respectively. To identify possible aetiological agents present in the sequencing data, the abundance of the assembled contigs was first evaluated as the expected counts using the RSEM program[34] implemented in Trinity. Non-human reads (23,712,657 reads), generated by filtering host reads using the human genome (human release 32, GRCh38.p13, downloaded from Gencode) by Bowtie2[35], were used for the RSEM abundance assessment.
However in the STAT analysis for the reads uploaded to SRA, only about 0.1% of the reads match Homo sapiens, about 0.8% of the reads match simians, and about 4.6% of the reads match eukaryotes. There's also an unusually low percentage of only 39% identified reads, because the half of reads which consist of only N
bases are unidentified:
$ curl -s 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?cluster_name=public&acc=SRR10971381'>SRR10971381.stat $ jq -r '.[].tax_totals.total as$tot|.[].tax_table|sort_by(-.total_count)[]|(100*.total_count/$tot|tostring)+";"+.org' SRR10971381.stat|egrep 'Eukaryota|Simiiformes|Homo sapiens' 4.591841930004224;Eukaryota 0.8129416704698984;Simiiformes 0.1192095708215023;Homo sapiens $ jq -r '.[].tax_totals|100*.identified/.total' SRR10971381.stat 39.12950566284354
In order to compare the results to other human BALF samples from patients with pneumonia, I searched the SRA for bronchoalveolar pneumonia metagenomic human
. In one run which was part of a study titled "Bronchoalveolar Lavage Fluid Metagenomic Next-Generation Sequencing in non-severe and severe Pneumonia", about 86% of the reads were identified and 76% matched simians (https://www.ncbi.nlm.nih.gov/sra/?term=SRR22183690):
$ curl -s 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?cluster_name=public&acc=SRR22183690'>SRR22183690.stat $ jq -r '.[].tax_totals.total as$tot|.[].tax_table|sort_by(-.total_count)[]|(100*.total_count/$tot|tostring)+";"+.org' SRR22183690.stat|egrep 'Eukaryota|Simiiformes|Homo sapiens' 81.79456158724525;Eukaryota 75.61440435123265;Simiiformes 12.85625010794863;Homo sapiens $ jq -r '.[].tax_totals|100*.identified/.total' SRR22183690.stat 86.16966070371517
In an early metagenomic run for COVID titled "Total RNA sequencing of BALF (human reads removed)", only about 0.6% of reads matched simians. The run was submitted by Wuhan University and published on January 18th 2020, and I didn't find any SARS 2 run on SRA with an earlier publication date (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA601736). About 13% of the reads were masked so that they consisted of only N
bases, which adds further weight behind the theory that a similar method was used to mask human reads in Wu et al.'s set of reads:
$ curl -s 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?cluster_name=public&acc=SRR10903402'>SRR10903402.stat $ jq -r '.[].tax_totals.total as$tot|.[].tax_table|sort_by(-.total_count)[]|(100*.total_count/$tot|tostring)+";"+.org' *02.stat|egrep 'Eukaryota|Simiiformes|Homo sapiens' 1.435508516404756;Eukaryota 0.5964291097600983;Simiiformes $ jq -r '.[].tax_totals|100*.identified/.total' SRR10903402.stat 71.32440955587016 $ seqkit grep -srp '^N*$' SRR10903402_1.fastq.gz|seqkit head -n1 @SRR10903402.1 1 length=151 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! $ seqkit seq -s SRR10903402_[12].fastq.gz|grep '^N*$'|wc -l 177770 $ seqkit seq -s SRR10903402_[12].fastq.gz|wc -l 1353388
The NCBI's sequence read archive has an "Analysis" tab which shows a breakdown of what percentage of reads are estimated to come from different organisms and taxonomial nodes. It was implemented using a software called SRA Taxonomy Analysis Tool (STAT), which searches for 32-base k-mers among the reads which match a set of reference sequences. A database which contains the number of reads which match each taxonomical node for each SRA run is currently about 500 GB in size, but you can search it through BigQuery at the Google Cloud Platform: https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud-based-taxonomy-analysis-table/. I tried using BigQuery to search for the oldest sequencing runs which matched over 10,000 reads of SARS 2 (taxid2697049):
select * from `nih-sra-datastore.sra.metadata` as m, `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax where m.acc=tax.acc and tax_id=2697049 and total_count>10000 order by releasedate
I got 17 results from before 2020 but none of them had a single read which aligned against Wuhan-Hu-1 when I used Bowtie2 with the default settings. The oldest result I found which had reads that aligned against SARS 2 was submitted by Wuhan University and published on January 18th 2020 (in an unknown timezone) (https://www.ncbi.nlm.nih.gov/sra/?term=SRR10903402). When I ran MEGAHIT for the reads submitted by Wuhan University so that I didn't do any trimming, my longest contig was 17,699 bases long, and it was identical to positions 12197-29895 of Wuhan-Hu-1:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/002/SRR10903402/SRR10903402_{1,2}.fastq.gz # download from ENA because downloading from SRA is too slow curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>sars2.fa megahit -1 SRR10903402_1.fastq.gz -2 SRR10903402_2.fastq.gz -o megawuhanjan seqkit sort -lr megawuhanjan/final.contigs.fa|seqkit head -n1|cat - sars2.fa|mafft --clustalout -
When I trimmed the reads with fastp
so that I used the default settings apart from a longer minimum length, almost half of the read pairs got an adapter sequence trimmed, so the adapter sequences may have earlier prevented assembling a longer contig. Now my longest contig was 29,809 bases long, and it missed the first 76 bases and last 19 bases of Wuhan-Hu-1 (which should demonstrate that it's common for MEGAHIT contigs to miss pieces from the start or end of the genome):
$ fastp -i SRR10903402_1.fastq.gz -I SRR10903402_2.fastq.gz -o SRR10903402_1.trim.fq.gz -O SRR10903402_2.trim.fq.gz Read1 before filtering: total reads: 676694 total bases: 101889418 [...] Filtering result: reads passed filter: 1126442 reads failed due to low quality: 193854 reads failed due to too many N: 0 reads failed due to too short: 33092 reads with adapter trimmed: 287572 bases trimmed due to adapters: 12910527 [...] $ megahit -1 SRR10903402_1.trim.fq.gz -2 SRR10903402_2.trim.fq.gz -o megawuhanjan3 [...] 2023-05-20 14:22:39 - 425 contigs, total 246369 bp, min 286 bp, max 29809 bp, avg 579 bp, N50 540 bp 2023-05-20 14:22:39 - ALL DONE. Time elapsed: 72.962219 seconds $ seqkit sort -lr megawuhanjan3/final.contigs.fa|seqkit head -n1|cat sars2.fa -|mafft ->temp.aln [...] $ Rscript --no-init-file -e 'm=t(as.matrix(Biostrings::readDNAStringSet("temp.aln")));pick=which(rowSums(m!=m[,1])>0);writeLines(unlist(lapply(split(pick,cummax(c(1,diff(pick)))),\(x)paste(x[1],tail(x,1),paste(apply(m[x,],2,paste,collapse=""),collapse=" ")))))' 1 76 ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACT ---------------------------------------------------------------------------- 29885 29903 AAAAAAAAAAAAAAAAAAA T------------------
When I did variant calling for the reads submitted by Wuhan University so that I used Wuhan-Hu-1 as the reference, I didn't get a single variant as expected:
brew install bowtie2 samtools bcftools bowtie2-build sars2.fa{,.index} x=wuhanjan;bowtie2 -p3 -x sars2.fa.index -1 SRR10903402_1.trim.fq..gz -2 SRR10903402_2.trim.fq.gz --no-unal|samtools sort -@2 ->$x.bam samtools index $x.bam;bcftools mpileup -f sars2.fa $x.bam|bcftools call -mv -Ob -o $x.bcf;bcftools index $x.bcf;bcftools consensus -f sars2.fa $x.bcf>$x.fa
Some of the reads aligned by Bowtie2 had a longer poly(A) tail than Wuhan-Hu-1, so I tried aligning the untrimmed reads against a version of Wuhan-Hu-1 with an extended poly(A) tail. I found one read where the poly(A) tail started 73 bases from the end of the read, even though it had a bunch of non-A bases near the end of the read and it only had 48 contiguous A bases. But I also got another read with 47 contiguous A bases, another with 39, and another with 38. And may of the long segments of A bases are preceded by GAATGAC
which are also the last bases of Wuhan-Hu-1 before the poly(A) tail. So I guess the reads may have included samples of SARS 2 which actually had a poly(A) tail longer than 33 bases:
$ (cat sars2.fa;echo AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)>sars2a.fa $ bowtie2-build sars2a.fa{,.index} [...] $ bowtie2 -x sars2a.fa.index -1 SRR10903402_1.fastq.gz -2 SRR10903402_2.fastq.gz --no-unal -p3|samtools sort -@2 ->sorted54.bam [...] $ samtools view sorted54.bam|cut -f4,10,17|tail -n8|column -t 29755 ACGTGTGCTCTTCCGATCTGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATTTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTTTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAA NM:i:14 29756 AGTGTACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTTTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTTTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAATCAAACACA NM:i:6 29757 GTATCAAGTAAAAAATGTTAGGGAGAATTGCCTAAATGGAAAAGCCATAATTTTTAAAAATAATTTTAGTAGTGTTATCCCCATGTAATTTTAAAAATTTCTTAGGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAG NM:i:24 29759 GTACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAAAAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAACAAAAAAAAAAAATCAAACACAAA NM:i:6 29761 ACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCTTAATGTGTAAAATTAATGTTAGGAGTGCTATCTCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAATGATATGAATAGCGTTGGTTA NM:i:19 29762 AAGGAAAAAAAGCTAGGAAGAGATGCCTAAATGGAAAAAACAAAATTTTAAAAATTAATTTTAGGGGGGGTATCCCCATGTGATTTTAATAATTTTTTAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAA NM:i:28 29765 TGAACAATGCTAGGGGGAGCTGCCTATATGGAAAAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAAAAGCTTCTTAGGAGAAAGACAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAATCAAACAC NM:i:9 29773 GCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAACGGAAGAGCGGCGGAGAGGAAA NM:i:15 29773 GCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGAAAAGCAACGGA NM:i:8 29793 TGGAAGAGCCCTAATGAGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAAAAGCAAGTGGAGGAAAAAAA NM:i:11 $ samtools view sorted54.bam|cut -f10|tail -n100|grep -Eo A+|awk '{print length}'|sort -n|tail # longest consecutive segments of A bases among last 100 reads 26 27 27 27 28 34 38 39 47 48
The last read shown above has 48 consecutive A bases. Its reverse pair has 47 consecutive T bases even though they're at a different spot so that there's only 4 bases before them, even though in the forward pair there's 25 bases after the segment of contiguous A bases:
$ seqkit grep -p $(samtools view sorted54.bam|tail -n1|cut -f1) SRR10903402_1.fastq.gz |seqkit seq -s TTGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTAAAAAACTTTTAAAATCAATTGGGGTTAGCACTACTAAAATTAATTTTACAAATTTGGGCTTTTCCAAAATGGGAAAAACACACGTCTAAACCCCAG
Some people were saying that it was somehow problematic that the full poly(A) tail of Wuhan-Hu-1 could not be reproduced from the metagenomic sequencing run of Wu et al. 2020. However above I have shown that there's another early metagenomic sequencing run of SARS 2 where the reads actually feature a long poly(A) tail.
In order to test if the genome of SARS 2 can be assembled from other metagenomic datasets, I searched the SRA for human bronchoalveolar pneumonia
: https://www.ncbi.nlm.nih.gov/sra/?term=human%20bronchoalveolar%20pneumonia.
I picked the first result which was from a study titled "Bronchoalveolar Lavage Fluid Metagenomic Next-Generation Sequencing in non-severe and severe Pneumonia": https://www.ncbi.nlm.nih.gov/sra/?term=SRR22183690. When I ran MEGAHIT with the default settings, I got a total of about 500 contigs with a maximum length of about 8,000. When I aligned the contigs against a file of about 15,000 virus reference sequences, I got only 14 aligned contigs, none of which matched SARS2:
$ wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR221/090/SRR22183690/SRR22183690_{1,2}.fastq.gz $ brew install megahit -s # `-s` compiles from source because the bottle was compiled against GCC 9 so it gave me the error `dyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgomp.1.dylib` [...] $ megahit -1 SRR22183690_1.fastq.gz -2 SRR22183690_2.fastq.gz -o megapneumonia [...] $ seqkit stat megapneumonia/final.contigs.fa file format type num_seqs sum_len min_len avg_len max_len megapneumonia/final.contigs.fa FASTA DNA 508 230,544 213 453.8 8,451 $ brew install bowtie2 seqkit gnu-sed [...] $ wget -q https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz $ seqkit fx2tab viral.1.1.genomic.fna.gz|sed $'s/A*\t$//'|seqkit tab2fx>viral.fa $ bowtie2-build viral.fa{,} [...] $ bowtie2 -p3 -x viral.fa -fU megapneumonia/final.contigs.fa --no-unal|samtools sort ->megapneumonia.bam [...] $ x=megapneumonia.bam;samtools coverage $x|awk \$4|cut -f1,3-6|(gsed -u '1s/$/\terr%\tname/;q';sort -rnk4|awk -F\\t -v OFS=\\t 'NR==FNR{a[$1]=$2;next}{print$0,sprintf("%.2f",a[$1])}' <(samtools view $x|awk -F\\t '{x=$3;n[x]++;len[x]+=length($10);sub(/.*NM:i:/,"");mis[x]+=$1}END{for(i in n)print i"\t"100*mis[i]/len[i]}') -|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print$0"\t"a[$1]}' <(seqkit seq -n viral.fa|gsed 's/ /\t/;s/, .*//') -)|column -ts$'\t' #rname endpos numreads covbases coverage err% name NC_022518.1 9472 7 2334 24.641 5.13 Human endogenous retrovirus K113 complete genome NC_050152.1 101660 1 250 0.245918 0.00 Enterobacteria phage P7 NC_032111.1 163005 1 143 0.0877274 3.47 BeAn 58058 virus NC_018464.1 927 1 87 9.38511 0.45 Shamonda virus N and NSs genes NC_007610.1 131384 1 45 0.0342507 0.22 Listeria bacteriophage P100 NC_037667.1 2077288 1 36 0.00173303 0.00 Pandoravirus quercus NC_021858.1 1908524 1 34 0.00178148 0.00 Pandoravirus dulcis NC_011189.1 3418 1 34 0.994734 0.00 Mikania micrantha mosaic virus RNA2
The contigs which matched human endogenous retrovirus K113 probably just matched the human genome. When I did a BLAST search for the longest contig, it got a 98.9% identical match to both some human endogenous retroviruses and to human DNA. When I downloaded other metagenomic sequencing runs for human lung or BALF samples from SRA and aligned the reads against virus refseqs, human endogenous retrovirus K113 got the highest coverage for most runs, including SRR22183691, SRR22183705, SRR1553689, and SRR22183702.
In 2022, someone who called themselves "a mathematician from Hamburg, who would like to remain unknown" published a paper titled "Structural analysis of sequence data in virology: An elementary approach using SARS-CoV-2 as an example" (https://impfen-nein-danke.de/u/structural_analysis_of_sequence_data_in_virology.pdf, https://impfen-nein-danke.de/u/structural_analysis_of_sequence_data_in_virology_tables_and_figures.pdf). He wrote that when he aligned Wu et al.'s reads with BBMap against reference genomes of HIV, hepatitis delta, and measles, he was able to get the reads to cover almost the entire reference genomes, which Stefan Lanka's fans used as evidence to support their view that the genomes of viruses are fake. But actually the reads which aligned against HIV, measles, and hepatitis were either short reads or they had a high percentage of mismatches, because he used very loose alignment criteria with BBMap which tolerated a high number of mismatches (k=8 minratio=0.1
). When the mathematician from Hamburg filtered his reads with reformat.sh
, he also used a low required minimum length (37) and an extremely low required identity percent (60%):
Basically, we mapped the paired-end reads (2x151 bp) with BBMap [23] to the reference sequences we considered (Tables and Figures: Table 3) using relatively unspecific settings. We then varied the minimum length (M1) and minimum (nucleotide) identity (M2) with reformat.sh to obtain corresponding subsets of the previously mapped sequences with appropriate quality. Increasing minimum length M1 or minimum nucleotide identity M2 thereby increases the significance of the respective mapping. Subsequently, we formed consensus sequences with the respective subsets of selected quality with respect to the selected reference. We set all bases with a quality lower than 20 to "N" (unknown). A quality of 20 means an error rate of 1% per nucleotide, which can be considered sufficient in the context of our analyses. Finally, the assessment of the agreement between reference and consensus sequences was performed using BWA [24], Samtools [25], and Tablet [26]. The ordered pair (M1; M2) = (37; 0.6) was just chosen to give error rates F1 and F2, respectively, of less than 10% for reference LC312715.1. The results of all calculations performed are shown in Tables and Figures: Table 4. The calculations show the highest significance for the choice of the ordered pair (37; 0.6), which can be seen from the highest error rates in each case. Comparable significance is provided by the ordered pairs (47; 0.50) and (25; 0.62). While the genome sequences associated with coronaviruses show error rates approximately above 10% for all ordered pairs considered (M1; M2), the error rates of the two sequences LC312715.1 (HIV) and NC_001653.2 (Hepatitis delta) are below 10% and decrease further for the ordered pairs (32; 0.60) and (30; 0.60).
Here you can see that the mode length of his reads which aligned against Wuhan-Hu-1 (MN908947.3) was around 150 bases, but the mode length of the reads which aligned HIV (LC312715.1), Hepatitis delta (NC_001653.2), measles (AF266291.1), and a randomly generated reference sequence (rnd_uniform) was around 35-40 bases:
In the figure above, Wuhan-Hu-1 (MN908947.3) has two peaks for the length of the aligned reads, where the first peak around length 37 looks almost as high as the second peak around length 150. However the y-axis is cut off so you can't see the full height of the second peak. (Also I don't understand why the plot above shows that there were some reads longer than 151 bases which aligned against Wuhan-Hu-1, because the maximum length of Wu et al.'s reads is 151 bases. Maybe the mathematician from Hamburg counted the gaps in the middle of the aligned reads as part of the reads, or maybe he calculated read lengths based on the CIGAR strings so that he added together all numbers in the CIGAR string without skipping deletions. His paper indicated that he ran BBMap with the maxindel=0
parameter which doesn't allow indels, but it might be that he forgot to include the parameter in the run where he used Wuhan-Hu-1 as a reference.)
I tried aligning the reads from Wu et al. against Wuhan-Hu-1 so that I ran mapPacBio.sh
and reformat.sh
with the same parameters as the mathematician from Hamburg, but I got about 11 times as many reads of length 151 as reads of length 37:
Next I tried aligning Wu et al.'s reads against other reference genomes, where I again ran BBMap and reformat.sh
with the same parameters as the mathematician from Hamburg:
brew install --cask miniconda;conda init ${SHELL##*/} conda install -c bioconda bbmap parallel-fastq-dump brew install fastp parallel-fastq-dump --gzip --threads 10 --split-files --sra-id SRR10971381 # `fastq-dump` is slow and `fasterq-dump` doesn't support gzip fastp -i SRR10971381_1.fastq.gz -I SRR10971381_2.fastq.gz -o SRR10971381_1.trim.fq.gz -O SRR10971381_2.trim.fq.gz # trim adapter sequences, trim low-quality ends of reads, discard reads shorter than 15 bases, discard reads with 5 or more `N` letters printf %s\\n MN908947.3\ sars2 LC312715.1\ hiv NC_001653.2\ hepatitis AF266291.1\ measles>refs cat refs|while read l m;do curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=$l">ref.$m.fa;done gshuf -rn30000<<<$'A\nC\nG\nT'|tr -d \\n|awk 'BEGIN{print">random"}1'>ref.random.fa for x in $(cut -d\ -f2 refs) random;do mapPacBio.sh ref=ref.$x.fa in=SRR10971381_1.trim.fq.gz in2=SRR10971381_2.trim.fq.gz outm=$x.mapped.sam vslow k=8 maxindel=0 minratio=0.1;done for x in $(cut -d\ -f2 refs) random;do reformat.sh in=$x.mapped.sam out=$x.filtered.sam minlength=37 idfilter=60 ow=t;done for x in $(cut -d\ -f2 refs) random;do awk -F\\t '/NM:i:/{l=length($10);sub(/.*NM:i:/,"");sub("\t.*","");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f $x.filtered.sam|sed s/^/$x\ /;done>lengthcounts
The plot below shows the result of my BBMap alignment. If you compare SARS 2 to HIV, hepatitis, or the random sequence, they all have a similar mismatch rate and similar number of aligned reads at the lowest read lengths, but after around length 55, the mismatch rate of SARS 2 starts to drop until it reaches about 2% at length 151. However the mismatch rate of the other sequences remains at around 40% even for longer reads:
Hepatitis delta has an extremely short genome of only about 1,700 bases, so in the plot above, hepatitis delta gets less aligned reads at lengths 37 to 60 than the longer genomes. I don't know why the difference between hepatitis delta and the longer genomes is not bigger though, even though SARS 2 and the random 30kb-genome are almost 20 times longer than hepatitis delta, but it might be because the longer sequences are more likely to include multimapped reads which I omitted from this analysis. Or it might be because I ran BBMap with such loose criteria that even for a short reference genome, random short reads are likely to get aligned against some part of the genome.
(The number of mismatches is indicated by the NM
tag in the SAM file format. For some reason BBMap doesn't include the NM
tag for multimapped reads, which are reads that can map to two or more spots in the reference genome and which are indicated by a zero in the fifth column of the SAM file. Therefore I omitted multimapped reads from the plot above in order to calculate the average number of mismatches correctly.)
Here's R code for generating the plot above:
library(ggplot2) t=read.table("lengthcounts") t2=rbind(t,expand.grid(V1=unique(t$V1),V2=unique(t$V2),V3=0,V4=NA)) t=t2[!duplicated(t2[,1:2]),] group=factor(t$V1,levels=unique(t$V1)) color=c("black",hcl(c(0,210,120)+15,60,55),"gray60") ybreak=1e4 ybreak2=8 xstart=xbreak*floor(min(t$V2)/xbreak) xend=xbreak*ceiling(max(t$V2)/xbreak) yend2=48 zeroshift=1 log=log10(t$V3)+zeroshift log[log==-Inf]=0 yend=ceiling(max(log)) ylabel=c(0,1,10,100,paste0("1e",3:ceiling(max(log-zeroshift)))) mismatch=t$V4*yend/yend2 label=c("SARS 2 (MN908947.3) (30kb)","HIV-1 (LC312715.1) (9kb)","Hepatitis delta (NC_001653.2) (2kb)","Measles (AF266291.1) (16kb)","Random sequence (30kb)") ystep=yend/15 labels=data.frame(x=.25*(xend-xstart),y=seq(.3*yend,by=-ystep,length.out=length(label)),label) ggplot(t,aes(x=V2,y=log,color=group))+ geom_line(linewidth=.3)+ geom_point(aes(y=mismatch),size=.2)+ geom_label(data=labels,aes(x=x,y=y,label=label),fill=alpha("white",.7),label.r=unit(0,"lines"),label.padding=unit(.04,"lines"),label.size=0,color=color,size=2.2,hjust=0)+ scale_x_continuous(limits=c(xstart,xend),breaks=seq(0,xend,xbreak),expand=c(0,0))+ scale_y_continuous(limits=c(0,yend),breaks=c(0,zeroshift+seq(0,5,1)),labels=ylabel,expand=expansion(mult=c(.02,0)),sec.axis=sec_axis(trans=~.*yend2/yend,breaks=seq(0,yend2,ybreak2),name="Percentage of mismatches (dots)"))+ labs(x="Length of read",y="Number of aligned reads (lines)")+ scale_color_manual(values=color)+ theme( axis.text=element_text(size=6,color="black"), axis.ticks=element_blank(), axis.ticks.length=unit(0,"in"), axis.title=element_text(size=8), legend.position="none", panel.background=element_rect(fill="white"), panel.border=element_rect(color="gray85",fill=NA,linewidth=.4), panel.grid.major=element_line(color="gray85",linewidth=.2), panel.grid.minor=element_blank(), plot.background=element_rect(fill="white"), plot.margin=margin(.05,.15,.05,.1,"in") ) ggsave("1.png",width=4,height=2.5)
The code below selects reads that aligned against SARS 2, that survived the filtering by length done with reformat.sh
and that are not multimapped reads so they are not missing the NM
tag. It then shows that the average number of mismatches per aligned read was 37.5% for reads of length 37 and 2.0% for reads of length 151. There were a total of 4,856 aligned reads of length 37 and 62,766 aligned reads of length 151:
$ awk -F\\t '/NM:i:/{l=length($10);sub(/.*NM:i:/,"");sub("\t.*","");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f sars2.filtered.sam|(gsed -u 4q;echo ...;tail -n4) 37 4856 37.5 38 5010 38.0 39 3896 37.8 40 4363 38.2 ... 148 2316 2.3 149 6084 2.0 150 32431 2.2 151 62766 2.0
However in the case of the reads that got aligned against the HIV reference genome, there were only 14 aligned reads of length 151 and even those reads had 41% mismatches on average:
$ awk -F\\t '/NM:i:/{l=length($10);sub(/.*NM:i:/,"");sub("\t.*","");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f hiv.filtered.sam|(gsed -u 4q;echo ...;tail -n4) 37 4460 37.2 38 4471 37.5 39 3584 37.4 40 3843 37.8 ... 148 2 41.2 149 3 40.7 150 16 41.1 151 14 40.6
In the last sentence of the following paragraph, I don't know if the mathematician from Hamburg meant to say 120,000 long reads instead of 120,000 short reads, because otherwise it doesn't make sense:
By excluding all mappable sequences longer than 100 nucleotides, essentially the approximately 120,000 reads associated with SARS-CoV-2 were removed. The coverage distribution of the remaining short sequences now appears random, analogous to Figure 13. Again, this correlates with high error rates R1 (29.90%) and R2 (29.96%). This indicates that no significant structure of reference MN908947.3 is included in the published sequences, except for the approximately 120,000 (Tables and Figures. Table 1) associated short reads.
He got about 120,000 mapped reads with average length around 146 when he used Bowtie2, but he got about 60,000 reads with average length around 46 when he used BBMap with reformat.sh maxlength=100
:
However in my previous plot shown above, even at lengths 60-100, the reads that map to SARS 2 have a much lower mismatch rate than the reads that map to HIV, hepatitis D, measles, or the random sequence. So you shouldn't just be looking at number of mapped reads or coverage but also the mismatch rate.
When I tried aligning Wu et al.'s reads with Bowtie2, I got zero reads that aligned against HIV, hepatitis delta, measles, or the random sequence. I got 121,779 aligned reads for SARS 2, which was identical to the figure reported by the Hamburg mathematician, because both of us did trimming with fastp
using the default parameters.
With Bowtie2, the shortest reads that aligned against SARS 2 were 31 bases long. There were less than a hundred reads at each length below 89, and their error rate was much lower than in the case of BBMap. However after around length 100-120, the results of Bowtie2 start to converge with BBMap:
I ran Bowtie2 with the default parameters:
bowtie2-build MN908947.1.fa{,.index} bowtie2 -p3 -x MN908947.1.fa.index -1 SRR10971381_1.trim.fq.gz -2 SRR10971381_2.trim.fq.gz --no-unal|samtools sort -@2 ->sorted30.bam
I created a random 30,000-base reference sequence and a thousand random reads for each length between 30 and 150. 37% of the reads of length 30 got aligned by BBMap, but I got zero aligned reads above length 70. The proportion of mismatches was around 40% for most read lengths but 59% for length 70. This time I didn't get any multimapped reads so I didn't have to exclude them:
$ gshuf -e A -e C -e G -e T -rn30000|printf %s\\n \>ref $(paste -sd\\0 -)>ref.fa $ n=1000;for i in {30..150};do gshuf -e A -e C -e G -e T -rn$[i*n]|awk '{ORS=NR%i==0?RS:""}1' i=$i|paste <(seq $n|sed s/^/$i./) -;done|seqkit tab2fx>reads.fa $ bbmap.sh ref=ref.fa in=reads.fa outm=reads.mapped.sam vslow k=8 maxindel=0 minratio=0.1 [...] $ awk -F\\t '/^[^@]/{l=length($10);sub(/.*NM:i:/,"");sub(/\t.*/,"");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f reads.mapped.sam 30 372 40.6 31 312 40.8 32 230 42.1 33 196 41.0 34 180 41.5 35 137 41.6 36 104 41.3 37 89 41.2 38 73 41.5 39 53 41.5 40 41 39.9 41 29 43.5 42 24 40.3 43 24 41.4 44 15 43.2 45 14 40.8 46 13 41.3 47 10 41.7 48 6 41.7 49 7 42.6 50 3 40 51 3 43.8 52 5 43.8 53 3 40.9 54 5 41.5 55 2 42.7 56 3 42.3 57 1 40.4 70 1 58.6
When I filtered out reads with indentity percent below 70, I was left with only 39 reads:
$ reformat.sh in=reads.mapped.sam out=reads.filtered.sam idfilter=70 ow=t [...] $ awk -F\\t '/^[^@]/{l=length($10);sub(/.*NM:i:/,"");sub(/\t.*/,"");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f reads.filtered.sam 30 17 28.6 31 11 28.4 32 2 23.4 33 2 27.3 34 5 28.2 35 1 28.6 38 1 28.9
In the code blocks above, I ran BBMap with the same parameters which were used by the mathematician from Hamburg (vslow k=8 maxindel=0 minratio=0.1
). But when I ran BBMap with the default parameters, zero of the random reads mapped to the random reference genome.
When I googled for "minratio=0.1" "bbmap"
, I found the following usage example in the BBTools user guide (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/):
To map with super-high sensitivity (useful for very-low-quality data, or remote homologies):
mapPacBio.sh in=reads.fq out=mapped.sam vslow k=8 maxindel=200 minratio=0.1
The list of shell commands in the PDF by the Hamburg mathematician featured this BBMap command which was used to align the reads: mapPacBio.sh in=SRR10971381_1.fastq in2=SRR10971381_2.fastq outm=mapped.sam vslow k=8 maxindel=0 minratio=0.1
. So maybe he copied the command from the BBTools user guide, because he even used the same output file name mapped.sam
, he listed the parameters in the same order, and he used mapPacBio.sh
instead of bbmap.sh
.
The value for the K-mer length can range from 8 to 15, where the default value is k=13
and lower values are more sensitive. So the mathematician from Hamburg used the most sensitive possible value for the K-mer length.
The minratio
parameter is no longer listed in the help message of BBMap, but I think it's because it has the same effect as setting minid
but on a different scale, and the developer of BBMap recommended using minid
instead. The default value for minid
is 0.76 which is approximately equal to minratio=0.56
. The developer of BBMap wrote (https://www.biostars.org/p/376782/):
Sorry for the confusion; they set the same thing, just with different units. "minratio" is the internal number using the internal scoring system; if you use the "minid" flag it converts it to the equivalent minratio. So if you set "minid=0.76 " then it will be converted to approximately "minratio=0.56" and then the minratio variable will be set to 0.56. So there is no reason to set both of them. If you do set both of them, the last value will be used. I recommend minid over minratio because it is easier to read, but they accomplish the same thing.
"minid" is approximate since it adjusts minratio, or the ratio of the maximum possible internal score, but the internal score is not directly related to percent identity because of different penalties for substitutions, deletions, and insertions, which vary by length. If you need an exact %identity cutoff, use "idfilter=0.76" which will eliminate all alignments below 76% identity, as a postfilter. "minratio/minid" actually affect mapping, and prevent the creation of any alignments with an internal score below "minratio". As such, a high value of minid will make the program faster and a low value will make it slower. Whereas "idfilter" does not affect speed.
That said, I don't recommend using "idfilter" unless you have a good reason (like, a paper with a graph showing %identity cutoffs, for clarity; not for something like variant calling). "minid/minratio" are proportional to the internal score which should better model actual mutations or evolution, by trying to minimize the number of independant events rather than the edit distance; "idfilter=0.5", for example, will disallow a 100bp read from mapping with a 400bp deletion in the middle, since that would be only 20% identity. "minid=0.5" would allow this mapping because the internal score would actually still be quite high (maybe 0.9 of the maximum possible ratio).
The mathematician from Hamburg wrote:
Alignment with the nucleotide database on 05/12/2021 showed a high match (98.85%) with "Homo sapiens RNA, 45S pre-ribosomal N4 (RNA45SN4), ribosomal RNA" (GenBank: NR_146117.1, dated 04/07/2020). This observation contradicts the claim in [1] that ribosomal RNA depletion was performed and human sequence reads were filtered using the human reference genome (human release 32, GRCh38.p13). Of particular note here is the fact that the sequence NR_146117.1 was not published until after the publication of the SRR10971381 sequence library considered here.
But actually the date shown by the NCBI's website is the date when the entry was modified, and the original publication date of NR_146117.1 is in 2017. You can see the original publication date from the revision history by clicking the small "GenBank" menu in the top left corner and then selecting "Revision History" (https://www.ncbi.nlm.nih.gov/nuccore/NR_146117.1?report=girevhist).