Assembly of Wuhan-Hu-1 and notes about a paper by a mathematician from Hamburg

Contents

Assembly of Wuhan-Hu-1

Different versions of Wuhan-Hu-1 at GenBank

The standard reference genome of SARS 2 is known as Wuhan-Hu-1, and it was described in a paper by Wu et al. titled "A new coronavirus associated with human respiratory disease in China" (https://www.nature.com/articles/s41586-020-2008-3). There's currently three versions of Wuhan-Hu-1 at GenBank: MN908947.1 is 30,474 bases long and it was published on January 12th 2020, MN908947.2 is 29,875 bases long and it was published on January 14th 2020, MN908947.3 is 29,903 bases long and it was published on January 17th 2020 (https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3?report=girevhist). I'm not sure if the dates are UTC or local time.

The only differences between the three versions are at the beginning and end of the sequences:

$ brew install mafft
[...]
$ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.'{1,2,3}>wuhu.fa
$ mafft --globalpair --clustalout --thread 4 wuhu.fa
[...]
CLUSTAL format alignment by MAFFT G-INS-1 (v7.490)


MN908947.1      cggtgacgcatacaaaacattcccaccataccttcccaggtaacaaaccaaccaactttc
MN908947.2      cggtgacgcatacaaaacattcccaccataccttcccaggtaacaaaccaaccaactttc
MN908947.3      -----------attaaaggtt-----tataccttcccaggtaacaaaccaaccaactttc
                           *. *** .**     .*********************************

[... 29,760 bases omitted from the middle of the alignment]

MN908947.1      aatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttcctgg
MN908947.2      aatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttctt-----
MN908947.3      aatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttaggag
                *******************************************************

MN908947.1      gatcttgatttcaacagcctcttcctcatactattctcaacactactgtcagtgaacttc
MN908947.2      ------------------------------------------------------------
MN908947.3      aatgacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa---------------------


MN908947.1      taaaaatggattctgtgtttgaccgtaggaaacatttcagtattccttatctttagaaac
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      ctctgcaactcaagtgtctctgccaaagagtcttcaaggtaatgtcaaataccataacat
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      attttttaagtttttgtatacttcctaggaatatatgtcatagtatgtaaagaagatctg
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      attcatgtgatctgtactgttaagaaacacttcaaaggaaaacaacctctcatagatctt
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      caaatcctgttgttcttgagttggaatgtacaatattcaggtagagatgctggtacaaaa
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      aaagactgaatcatcaggctattagaaagataaactaggcctctaacatagtaactttaa
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      cagtttgtacttatacatattttcacattgaaatatagttttattcatgactttttttgt
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      tttagcttctctgtcttccattatttcaagctgctaaaaattaaaaatatcctatagcaa
MN908947.2      ------------------------------------------------------------
MN908947.3      ------------------------------------------------------------


MN908947.1      agggctatggcatctttttgtaaaaataaggaaagcaaggttttttgataatc
MN908947.2      -----------------------------------------------------
MN908947.3      -----------------------------------------------------

The first and second versions started with the 27-base segment CGGTGACGCATACAAAACATTCCCACC which was changed to ATTAAAGGTTT in the third version. If you remove the first three bases from CGGTGACGCATACAAAACATTCCCACC, then it's identical to a 24-base segment that's included near the end of all three versions Wuhan-Hu-1:

$ seqkit locate -p tgacgcatacaaaacattcccacc wuhu.fa|column -t
seqID       patternName               pattern                   strand  start  end    matched
MN908947.1  tgacgcatacaaaacattcccacc  tgacgcatacaaaacattcccacc  +       4      27     tgacgcatacaaaacattcccacc
MN908947.1  tgacgcatacaaaacattcccacc  tgacgcatacaaaacattcccacc  +       29360  29383  tgacgcatacaaaacattcccacc
MN908947.2  tgacgcatacaaaacattcccacc  tgacgcatacaaaacattcccacc  +       4      27     tgacgcatacaaaacattcccacc
MN908947.2  tgacgcatacaaaacattcccacc  tgacgcatacaaaacattcccacc  +       29360  29383  tgacgcatacaaaacattcccacc
MN908947.3  tgacgcatacaaaacattcccacc  tgacgcatacaaaacattcccacc  +       29360  29383  tgacgcatacaaaacattcccacc

I have found several similar assembly errors in viral sequences where a short segment at either end of the sequence consists of another part of the genome or its reverse complement. For exaple in a SARS 1 sequence titled "SARS coronavirus SoD" (AY461660.1), there's a 24-base segment at the 3' end end which is the reverse complement of another 24-base segment near the 3' end:

$ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=AY461660.1'>sod.fa
$ seqkit grep -nrp AY461660 sod.fa|seqkit locate -p cacattagggctcttccatatagg|column -t
seqID       pattern                   strand  start  end    matched
AY461660.1  cacattagggctcttccatatagg  +       29692  29715  cacattagggctcttccatatagg
AY461660.1  cacattagggctcttccatatagg  -       29629  29652  cacattagggctcttccatatagg
$ seqkit subseq -r 29629:29652 sod.fa # the match on the reverse strand is a reverse complement of the match on the positive strand
>AY461660.1 SARS coronavirus SoD, complete genome
CCTATATGGAAGAGCCCTAATGTG

The assembly errors are easy to spot if you do a multiple sequence alignment of several viral sequences and you look for inserts at the beginning or end of the alignment which are only included in one sequence but missing from all other sequences. Then you can do a BLAST search for the insert to find its origin.

The 3' end of the first version of Wuhan-Hu-1 accidentally included a segment of 618 bases of human DNA (https://twitter.com/ChrisDeZPhD/status/1290218272705531905). The first 20 of the 618 bases are actually included near the 3' end of the genome SARS 2, and they're also the last 20 bases of the second version of Wuhan-Hu-1 (TGTGATTTTAATAGCTTCTT). You can do a BLAST search for the last 618 bases by going here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn. Then paste the sequence below to the big text field at the top and press the "BLAST" button:

$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'|seqkit subseq -r -618:-1
>MN908947.1 Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome
tgtgattttaatagcttcttcctgggatcttgatttcaacagcctcttcctcatactatt
ctcaacactactgtcagtgaacttctaaaaatggattctgtgtttgaccgtaggaaacat
ttcagtattccttatctttagaaacctctgcaactcaagtgtctctgccaaagagtcttc
aaggtaatgtcaaataccataacatattttttaagtttttgtatacttcctaggaatata
tgtcatagtatgtaaagaagatctgattcatgtgatctgtactgttaagaaacacttcaa
aggaaaacaacctctcatagatcttcaaatcctgttgttcttgagttggaatgtacaata
ttcaggtagagatgctggtacaaaaaaagactgaatcatcaggctattagaaagataaac
taggcctctaacatagtaactttaacagtttgtacttatacatattttcacattgaaata
tagttttattcatgactttttttgttttagcttctctgtcttccattatttcaagctgct
aaaaattaaaaatatcctatagcaaagggctatggcatctttttgtaaaaataaggaaag
caaggttttttgataatc

The best match is 616 bases long because it's missing the last 2 bases of the query, but 614 out of 616 bases are identical with a result titled "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence".

Another common assembly error is that a piece of host DNA or host rRNA is accidentally inserted to either end of the contig. For example I noticed that at the beginning of the SARS-like bat virus Rs7907, there was an insertion which was missing from other SARS-like viruses:

$ curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1j-YFiMYG4DkVKSget2fYW-gaJDy6NCkW'
$ seqkit fx2tab sarslike.fa|grep -Ev 'Rs7931|Ra7909'|grep -C10 Rs7907|awk -F\\t '{print$1 FS substr($2,1,200)}'|seqkit tab2fx|mafft --globalpair --quiet -|seqkit fx2tab|awk -F\\t '{print $2 FS$1}'|cut -d, -f1|column -ts$'\t'
g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct  MT040333.1 Pangolin coronavirus isolate PCoV_GX-P4L
g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct  MT040335.1 Pangolin coronavirus isolate PCoV_GX-P5L
a----------------------------------------------------------------------ttaaaggtttatac-cttcccaggtagcaaaaccaaccaactct  MW532698.1 Pangolin coronavirus isolate GX_P2V
g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct  MT040334.1 Pangolin coronavirus isolate PCoV_GX-P1E
----------------------------------------------------------------------------------------tcccaggtagcaaaaccaaccaactct  MT072864.1 Pangolin coronavirus isolate PCoV_GX-P2V
g----------------------------------------------------------------------ctttaactttttacaaatcccaggtagcaaaaccaaccaactct  MT040336.1 Pangolin coronavirus isolate PCoV_GX-P5E
a-----------------------------------------------------------------------------------c-cttcccaggtaacaaaaccaaccaac-ct  OL674075.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7905
-------------------------------------------------------------------------------------------------------------------  OL674081.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7952
--------------------------------------------------------------------------------------cttcccaggtaacaaaaccaaccaac-ct  OL674079.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7924
------------------------------------------------------------------------------------c-cttcccaggtaacaaaaccaaccaac-ct  OL674078.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7921
ggggattgcaattattccccatgaacgaggaattcccagtaagtgcgggtcataagcttgcgttgattaagtccctgcccttt---gtacacaccgcccaaaaccaaccaac-ct  OL674076.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7907
-------------------------------------------------------------------------------------------------------------------  OL674074.1 Severe acute respiratory syndrome-related coronavirus isolate Rs7896
a----------------------------------------------------------------------ttaaaggtttttac-cttcccaggtaacaaaaccaaccgac-ct  MZ081380.1 Betacoronavirus sp. RsYN04 strain bat/Yunnan/RsYN04/2020
a----------------------------------------------------------------------ttaaaggtttttac-cttcccaggtaacaaaaccaaccgac-ct  MZ081376.1 Betacoronavirus sp. RmYN05 strain bat/Yunnan/RmYN05/2020
a----------------------------------------------------------------------ttaaaggtttttac-cttcccaggtaacaaaaccaaccgac-ct  MZ081378.1 Betacoronavirus sp. RmYN08 strain bat/Yunnan/RmYN08/2020
--------------------------------------------------------------------------------------ctacccagg---aaaagccaaccaac-ct  AY304488.1 SARS coronavirus SZ16
-------------------------------------------------------------------------------------------------------------------  OK017852.1 Sarbecovirus sp. isolate YN2020B
-------------------------------------------------------------------------------------------------------------------  OK017856.1 Sarbecovirus sp. isolate YN2020F
g------------------------------------------------------------------------------------------------------------------  OK017855.1 Sarbecovirus sp. isolate YN2020E
-------------------------------------------------------------------------------------------------------------------  OK017854.1 Sarbecovirus sp. isolate YN2020D
g------------------------------------------------------------------------------------------------------------------  OK017853.1 Sarbecovirus sp. isolate YN2020C

When I did a BLAST search for the first 150 bases of Rs7907, the first 96 bases were 100% identical with several sequences of mammalian DNA or rRNA, like for example "PREDICTED: Pan troglodytes 18S ribosomal RNA (LOC129143085), rRNA" and "Delphinus delphis genome assembly, chromosome: 21".

Rapid amplification of cDNA ends (RACE)

When the middle part of an RNA sequence is known but it's not certain if the ends of the sequence are correct, the ends can be sequenced accurately using method called RACE (rapid amplification of cDNA ends). RACE is also known as "one-sided PCR" or "anchored PCR", and it basically requires only specifying a single primer for each amplified segment instead of a forward and reverse primer like in regular PCR (https://en.wikipedia.org/wiki/Rapid_amplification_of_cDNA_ends).

In the third version of Wuhan-Hu-1, an additional 44 bases were inserted to the end which were missing from the second version. I believe it's because they used RACE to sequence the ends of the genome. The paper by Wu et al. said: "The genome sequence of this virus, as well as its termini, were determined and confirmed by reverse-transcription PCR (RT–PCR)10 and 5′/3′ rapid amplification of cDNA ends (RACE), respectively." (https://www.nature.com/articles/s41586-020-2008-3)

Supplementary Table 6 includes the primers they used to do RACE (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2008-3/MediaObjects/41586_2020_2008_MOESM1_ESM.pdf):

D. Primers used in 5'/3' RACE
5-GSPCCACATGAGGGACAAGGACACCAAGTG573-599 (599 bp)
5-GSPnCATGACCATGAGGTGCAGTTCGAGC491-515 (515 bp)
3-GSPTGTCGCGCATTGGCATGGAAGTCACACC29212-29239 (688 bp)
3-GSPnCTCAAGCCTTACCGCAGAGACAGAAG29398-29423 (502 bp)

In 2020, Steven Quay posted a tweet where he wrote: "Dr. Shi lied? Below is NEW RaTG13 sequence https://ncbi.nlm.nih.gov/nuccore/MN996532 with the missing 1-15 nt at 5' end. So I blasted 1-180 of this new sequence against her RNA-Seq for RaTG13 to find the missing 15 nt. But they are not there. So 1-15 nt appears FABRICATED by Dr. Shi & WIV." (https://twitter.com/quay_dr/status/1318041151211884552) Actually the RACE sequences of RaTG were only added to SRA in December 2021 after Quay's tweet, but one of them matched the segment of 15 nucleotides which was added to the start of the second version of RaTG13 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR16979454):

$ fastq-dump SRR16979454 # RACE sequences of RaTG13
$ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN996532.'{1,2}|mafft ->ratg.fa # first and second version of RaTG13
[...]
$ Rscript -e 'm=t(as.matrix(Biostrings::readDNAStringSet("ratg.fa")));pick=which(rowSums(m!=m[,1])>0);write.table(cbind(m[pick,],pick),quote=F,col.names=F,row.names=F)'
- A 1
- T 2
- T 3
- A 4
- A 5
- A 6
- G 7
- G 8
- T 9
- T 10
- T 11
- A 12
- T 13
- A 14
- C 15
C T 2125
C A 2132
C A 6936
G A 7127
G T 18812
C G 18813
A - 29856
A - 29857
A - 29858
A - 29859
A - 29860
A - 29861
A - 29862
A - 29863
A - 29864
A - 29865
A - 29866
A - 29867
A - 29868
A - 29869
A - 29870
$ seqkit locate -irp attaaaggtttatac SRR16979454.fastq|column -ts$'\t' # search RACE sequences for the 15-base segment that was added to the start of the second version of RaTG13
seqID          patternName      pattern          strand  start  end  matched
SRR16979454.4  attaaaggtttatac  attaaaggtttatac  -       461    475  ATTAAAGGTTTATAC

As far as I know, the RACE sequences used by Wu et al. were never published, but I guess they match a longer part of the poly(A) tail than the metagenomic reads which only included a few bases of the poly(A) tail.

Missing poly(A) tail in metagenomic reads

The third version of Wuhan-Hu-1 has a 33-base poly(A) tail, but I think they sequenced the tail with RACE because there's only a few bases of the tail included in the metagenomic reads published at the SRA. The output below shows reads that aligned against the end of the third version of Wuhan-Hu-1. To save space, I only included one read per each starting position even though many positions have over ten aligned reads. The first column shows the starting position and the second column shows the number of mismatches relative to Wuhan-Hu-1:

$ brew install bowtie2 fastp samtools
[...]
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_{1,2}.fastq.gz
[...]
$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>sars2.fa
$ bowtie2-build sars2.fa{,}
[...]
$ fastp -i SRR10971381_1.fastq -I SRR10971381_2.fastq -o SRR10971381_1.trim.fq.gz -O SRR10971381_2.trim.fq.gz -l 70
[...]
$ bowtie2 -p3 -x sars2.fa -1 SRR10971381_1.trim.fq.gz -2 SRR10971381_2.trim.fq.gz --no-unal|samtools sort -@2 ->sorted58.bam
[...]
$ samtools view sorted58.bam|awk -F\\t '!a[$4]++{x=$4;y=$10;sub(/.*NM:i:/,"");print x,$1,y}'|tail -n30
29763 1 AATGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAG
29767 2 TTCAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA
29769 8 GGCTGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCGTCTCTTTAGA
29771 4 GGTCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCAA
29772 2 TTTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA
29773 1 GTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGA
29775 2 TATGGAGAGCTGCCTATATGGAAGAGCCCTAATGTATAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACA
29776 6 TTGGAGAGCTGCCTATATGGCAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGATGC
29777 2 ATGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAG
29780 1 AGTGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGG
29781 4 AAACTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCGCATGTGATTTTAATAGCTTCTTAGGAGAATAACA
29791 1 TGTGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA
29855 8 CTGTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29857 8 AATAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29858 8 AAAAAAAAAACTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29859 6 AAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29860 7 AAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29861 6 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29862 3 AAAAATCACAAAAAAAAAAAAAAAAAAAAAAAA
29863 4 AAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29864 4 AAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29865 1 AATGAAAAAAAAAAAAAAAAAAAAAAAAAAA
29866 3 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29867 4 AAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAA
29868 2 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29869 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29870 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29871 0 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29872 1 AATAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
29873 1 AAAAAATAAAAAAAAAAAAAAAAAAAAAAAA

In the output above, there's a huge gap between positions 29,791 and 29,855 with no reads starting from the 63 positions in between. I think it's because the reads after the gap don't actually come from SARS 2 but from some other organisms.

The last 50 bases of Wuhan-Hu-1 before the poly(A) tail are ATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGAC, which is matched in the by the reads above the gap within a couple of mismatches. But the first read below the gap starts with CTGTT which has three nucleotide changes from ATGAC at the corresponding position in Wuhan-Hu-1.

Out of the reads whose starting position is 29,700 or above, there's only one read which has five A bases after the sequence ATGAC:

$ samtools view sorted58.bam|awk '$4>=29700'|cut -f10|grep -Eo ATGACA+|sort|uniq -c|sort -n
      1 ATGACAAAAA
      2 ATGACAAA
      2 ATGACAAAA
      8 ATGACA

Timeline of Wuhan-Hu-1

Wu et al.'s paper in Nature was submitted a week before MN908947.2 was published at GenBank, so maybe at the time when they submitted the Nature paper, they had not yet noticed that they accidentally included about 600 bases of human DNA at the end of MN908947.1:

(I'm not sure which of the dates above are UTC and which are local time.)

In Supplementary Table 8 of Wu et al. 2020, there's a list of the PCR primers they used to do PCR-based sequencing in order to verify the results of the metagenomic de-novo assembly they did with MEGAHIT (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2008-3/MediaObjects/41586_2020_2008_MOESM1_ESM.pdf). The last reverse primer is AAAATCACATGGGGATAGCACTACT which is shown to end at position 29,864, so I guess I guess by the time they designed the primers, they had already noticed that they accidentally included a piece of human DNA at the beginning of MN908947.1 which was 30,474 bases long. The first forward primer is CCAGGTAACAAACCAACCAACTT and it's shown to start at position 36, which matches the coordinates in the first and second version of Wuhan-Hu-1 but not the third version, so I guess they designed the primers based on the second version. However another part of Supplementary Table 8 says that the "primers for WHCV detection using qPCR" were "designed based on the whole genome of WHCV (MN908947.3)", and their coordinates match the coordinates in the third version and not the second version. So I guess they designed different sets of primers based on different versions of the genome.

Why is the longest MEGAHIT contig not reproducible?

Below I ran MEGAHIT for Wu et al.'s lung metagenomic reads. When I did trimming with fastp, my longest contig was 29,802 bases long and it missed the last 102 bases of MN908947.3 and had an insertion of one base at the beginning. When I did trimming with trimmomatic, my longest contig was 29,875 bases long and now it only missed the last 30 bases of MN908947.3 even though it still had the single-base insertion at the beginning:

brew install megahit -s # `-s` compiles from source because the bottle was compiled against GCC 9 so it gave the error `dyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgomp.1.dylib`
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_{1,2}.fastq.gz # download from ENA because it's faster than downloading from SRA and doesn't require installing sratoolkit
fastp -i SRR10971381_1.fastq.gz -I SRR10971381_2.fastq.gz -o SRR10971381_1.trim.fq.gz -O SRR10971381_2.trim.fq.gz -l 70
megahit -1 SRR10971381_1.trim.fq.gz -2 SRR10971381_2.trim.fq.gz -o megahutrim
brew install trimmomatic
trimmomatic PE SRR10971381_{1,2}.fastq.gz SRR10971381_{1,2}.paired.fq.gz SRR10971381_{1,2}.unpaired.fq.gz AVGQUAL:20 HEADCROP:12 LEADING:3 TRAILING:3 MINLEN:75 -threads 4 # adapter trimming was omitted here
megahit -1 SRR10971381_1.paired.fq.gz -2 SRR10971381_2.paired.fq.gz -o megahutrimmo

The paper by Wu et al. says: "In total, we generated 56,565,928 sequence reads that were de novo-assembled and screened for potential aetiological agents. Of the 384,096 contigs assembled by Megahit [9], the longest (30,474 nucleotides (nt)) had a high abundance and was closely related to a bat SARS-like coronavirus (CoV) isolate - bat SL-CoVZC45 (GenBank accession number MG772933) - that had previously been sampled in China, with a nucleotide identity of 89.1% (Supplementary Tables 1, 2).". And 56,565,928 is identical to the number of reads in the SRA run deposited by Wu et al. (https://www.ncbi.nlm.nih.gov/sra/?term=SRR10971381). And 30,474 is identical to the length of the first version of Wuhan-Hu-1 at GenBank.

However when I tried aligning the reads in Wu et al.'s metagenomic dataset against the first version of Wuhan-Hu-1, I didn't get a single read which aligned against the last 585 bases. The starting position of my last aligned read was 29,807 and it ended 590 positions before the end of MN908947.1:

$ curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'>wuhu1.fa
$ bowtie2-build wuhu1.fa{,}
[...]
$ bowtie2 -p3 -x wuhu1.fa -1 SRR10971381_1.fastq.gz -2 SRR10971381_2.fastq.gz --no-unal|samtools sort -@2 ->sorted59.bam
[...]
$ samtools view sorted59.bam|awk -F\\t '!a[$4]++{x=$4;y=$10;sub(/.*NM:i:/,"");print x,30473-x-length(y),$1,y}'|tail|column -t
29785  591  8   GGCTGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCGTCTCTTTAGA
29787  616  4   GGTCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCAA
29788  590  7   TTTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA
29789  591  5   GTTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGA
29791  585  11  TATGGAGAGCTGCCTATATGGAAGAGCCCTAATGTATAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACA
29792  591  7   TTGGAGAGCTGCCTATATGGCAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGATGC
29793  592  6   ATGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAG
29796  594  4   AGTGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGG
29797  585  12  AAACTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCGCATGTGATTTTAATAGCTTCTTAGGAGAATAACA
29807  590  6   TGTGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAA

In Wu et al.'s metagenomic dataset, about half of the reads have been masked entirely so that they consist of only N bases:

$ seqkit grep -rsp '^N+$' SRR10971381_1.fastq.gz|seqkit head -n1
@SRR10971381.2 2 length=151
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
$ seqkit grep -rsp '^N+$' SRR10971381_1.fastq|seqkit stat # out of a total of about 28 million reads, about 14 million reads consist of only N letters
file  format  type    num_seqs        sum_len  min_len  avg_len  max_len
-     FASTQ   DNA   13,934,971  2,027,338,447       35    145.5      151

So it could be that by the time Wu et al. submitted the reads to SRA, they had noticed that their initial 30,474-base contig included a segment of human DNA at the end, so maybe they replaced the reads which matched human DNA with reads that consisted of only N letters. That would explain why the number of reads at SRA is the same as the number of reads which produced their 30,474-base contig but why the contig cannot be reproduced. But it could be that at the point when they noticed the assembly error in MN908947.1, they were no longer able to update the main text of their paper in Nature, which might explain why the assembly error in MN908947.1 was accounted for in Supplementary Table 8 but not in the main text of the paper.

Wu et al. wrote that about 24 million reads out of about 57 million reads remained after human reads were filtered out:

Sequencing reads were first adaptor and quality trimmed using the Trimmomatic program[32]. The remaining 56,565,928 reads were assembled de novo using both Megahit (v.1.1.3)[9] and Trinity (v.2.5.1)[33] with default parameter settings. Megahit generated a total of 384,096 assembled contigs (size range of 200–30,474 nt), whereas Trinity generated 1,329,960 contigs with a size range of 201–11,760 nt. All of these assembled contigs were compared (using BLASTn and Diamond BLASTx) against the entire non-redundant (nr) nucleotide and protein databases, with e values set to 1 × 10−10 and 1 × 10−5, respectively. To identify possible aetiological agents present in the sequencing data, the abundance of the assembled contigs was first evaluated as the expected counts using the RSEM program[34] implemented in Trinity. Non-human reads (23,712,657 reads), generated by filtering host reads using the human genome (human release 32, GRCh38.p13, downloaded from Gencode) by Bowtie2[35], were used for the RSEM abundance assessment.

However in the STAT analysis for the reads uploaded to SRA, only about 0.1% of the reads match Homo sapiens, about 0.8% of the reads match simians, and about 4.6% of the reads match eukaryotes. There's also an unusually low percentage of only 39% identified reads, because the half of reads which consist of only N bases are unidentified:

$ curl -s 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?cluster_name=public&acc=SRR10971381'>SRR10971381.stat
$ jq -r '.[].tax_totals.total as$tot|.[].tax_table|sort_by(-.total_count)[]|(100*.total_count/$tot|tostring)+";"+.org' SRR10971381.stat|egrep 'Eukaryota|Simiiformes|Homo sapiens'
4.591841930004224;Eukaryota
0.8129416704698984;Simiiformes
0.1192095708215023;Homo sapiens
$ jq -r '.[].tax_totals|100*.identified/.total' SRR10971381.stat
39.12950566284354

In order to compare the results to other human BALF samples from patients with pneumonia, I searched the SRA for bronchoalveolar pneumonia metagenomic human. In one run which was part of a study titled "Bronchoalveolar Lavage Fluid Metagenomic Next-Generation Sequencing in non-severe and severe Pneumonia", about 86% of the reads were identified and 76% matched simians (https://www.ncbi.nlm.nih.gov/sra/?term=SRR22183690):

$ curl -s 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?cluster_name=public&acc=SRR22183690'>SRR22183690.stat
$ jq -r '.[].tax_totals.total as$tot|.[].tax_table|sort_by(-.total_count)[]|(100*.total_count/$tot|tostring)+";"+.org' SRR22183690.stat|egrep 'Eukaryota|Simiiformes|Homo sapiens'
81.79456158724525;Eukaryota
75.61440435123265;Simiiformes
12.85625010794863;Homo sapiens
$ jq -r '.[].tax_totals|100*.identified/.total' SRR22183690.stat
86.16966070371517

In an early metagenomic run for COVID titled "Total RNA sequencing of BALF (human reads removed)", only about 0.6% of reads matched simians. The run was submitted by Wuhan University and published on January 18th 2020, and I didn't find any SARS 2 run on SRA with an earlier publication date (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA601736). About 13% of the reads were masked so that they consisted of only N bases, which adds further weight behind the theory that a similar method was used to mask human reads in Wu et al.'s set of reads:

$ curl -s 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/run_taxonomy?cluster_name=public&acc=SRR10903402'>SRR10903402.stat
$ jq -r '.[].tax_totals.total as$tot|.[].tax_table|sort_by(-.total_count)[]|(100*.total_count/$tot|tostring)+";"+.org' *02.stat|egrep 'Eukaryota|Simiiformes|Homo sapiens'
1.435508516404756;Eukaryota
0.5964291097600983;Simiiformes
$ jq -r '.[].tax_totals|100*.identified/.total' SRR10903402.stat
71.32440955587016
$ seqkit grep -srp '^N*$' SRR10903402_1.fastq.gz|seqkit head -n1
@SRR10903402.1 1 length=151
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
$ seqkit seq -s SRR10903402_[12].fastq.gz|grep '^N*$'|wc -l
177770
$ seqkit seq -s SRR10903402_[12].fastq.gz|wc -l
1353388

Comparison to another early metagenomic sequencing run of SARS 2

The NCBI's sequence read archive has an "Analysis" tab which shows a breakdown of what percentage of reads are estimated to come from different organisms and taxonomial nodes. It was implemented using a software called SRA Taxonomy Analysis Tool (STAT), which searches for 32-base k-mers among the reads which match a set of reference sequences. A database which contains the number of reads which match each taxonomical node for each SRA run is currently about 500 GB in size, but you can search it through BigQuery at the Google Cloud Platform: https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud-based-taxonomy-analysis-table/. I tried using BigQuery to search for the oldest sequencing runs which matched over 10,000 reads of SARS 2 (taxid2697049):

select * from `nih-sra-datastore.sra.metadata` as m, `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax where m.acc=tax.acc and tax_id=2697049 and total_count>10000 order by releasedate

I got 17 results from before 2020 but none of them had a single read which aligned against Wuhan-Hu-1 when I used Bowtie2 with the default settings. The oldest result I found which had reads that aligned against SARS 2 was submitted by Wuhan University and published on January 18th 2020 (in an unknown timezone) (https://www.ncbi.nlm.nih.gov/sra/?term=SRR10903402). When I ran MEGAHIT for the reads submitted by Wuhan University so that I didn't do any trimming, my longest contig was 17,699 bases long, and it was identical to positions 12197-29895 of Wuhan-Hu-1:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/002/SRR10903402/SRR10903402_{1,2}.fastq.gz # download from ENA because downloading from SRA is too slow
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>sars2.fa
megahit -1 SRR10903402_1.fastq.gz -2 SRR10903402_2.fastq.gz -o megawuhanjan
seqkit sort -lr megawuhanjan/final.contigs.fa|seqkit head -n1|cat - sars2.fa|mafft --clustalout -

When I trimmed the reads with fastp so that I used the default settings apart from a longer minimum length, almost half of the read pairs got an adapter sequence trimmed, so the adapter sequences may have earlier prevented assembling a longer contig. Now my longest contig was 29,809 bases long, and it missed the first 76 bases and last 19 bases of Wuhan-Hu-1 (which should demonstrate that it's common for MEGAHIT contigs to miss pieces from the start or end of the genome):

$ fastp -i SRR10903402_1.fastq.gz -I SRR10903402_2.fastq.gz -o SRR10903402_1.trim.fq.gz -O SRR10903402_2.trim.fq.gz
Read1 before filtering:
total reads: 676694
total bases: 101889418
[...]
Filtering result:
reads passed filter: 1126442
reads failed due to low quality: 193854
reads failed due to too many N: 0
reads failed due to too short: 33092
reads with adapter trimmed: 287572
bases trimmed due to adapters: 12910527
[...]
$ megahit -1 SRR10903402_1.trim.fq.gz -2 SRR10903402_2.trim.fq.gz -o megawuhanjan3
[...]
2023-05-20 14:22:39 - 425 contigs, total 246369 bp, min 286 bp, max 29809 bp, avg 579 bp, N50 540 bp
2023-05-20 14:22:39 - ALL DONE. Time elapsed: 72.962219 seconds
$ seqkit sort -lr megawuhanjan3/final.contigs.fa|seqkit head -n1|cat sars2.fa -|mafft ->temp.aln
[...]
$ Rscript --no-init-file -e 'm=t(as.matrix(Biostrings::readDNAStringSet("temp.aln")));pick=which(rowSums(m!=m[,1])>0);writeLines(unlist(lapply(split(pick,cummax(c(1,diff(pick)))),\(x)paste(x[1],tail(x,1),paste(apply(m[x,],2,paste,collapse=""),collapse=" ")))))'
1 76 ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACT ----------------------------------------------------------------------------
29885 29903 AAAAAAAAAAAAAAAAAAA T------------------

When I did variant calling for the reads submitted by Wuhan University so that I used Wuhan-Hu-1 as the reference, I didn't get a single variant as expected:

brew install bowtie2 samtools bcftools
bowtie2-build sars2.fa{,.index}
x=wuhanjan;bowtie2 -p3 -x sars2.fa.index -1 SRR10903402_1.trim.fq..gz -2 SRR10903402_2.trim.fq.gz --no-unal|samtools sort -@2 ->$x.bam
samtools index $x.bam;bcftools mpileup -f sars2.fa $x.bam|bcftools call -mv -Ob -o $x.bcf;bcftools index $x.bcf;bcftools consensus -f sars2.fa $x.bcf>$x.fa

Some of the reads aligned by Bowtie2 had a longer poly(A) tail than Wuhan-Hu-1, so I tried aligning the untrimmed reads against a version of Wuhan-Hu-1 with an extended poly(A) tail. I found one read where the poly(A) tail started 73 bases from the end of the read, even though it had a bunch of non-A bases near the end of the read and it only had 48 contiguous A bases. But I also got another read with 47 contiguous A bases, another with 39, and another with 38. And may of the long segments of A bases are preceded by GAATGAC which are also the last bases of Wuhan-Hu-1 before the poly(A) tail. So I guess the reads may have included samples of SARS 2 which actually had a poly(A) tail longer than 33 bases:

$ (cat sars2.fa;echo AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)>sars2a.fa
$ bowtie2-build sars2a.fa{,.index}
[...]
$ bowtie2 -x sars2a.fa.index -1 SRR10903402_1.fastq.gz -2 SRR10903402_2.fastq.gz --no-unal -p3|samtools sort -@2 ->sorted54.bam
[...]
$ samtools view sorted54.bam|cut -f4,10,17|tail -n8|column -t
29755  ACGTGTGCTCTTCCGATCTGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATTTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTTTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAA  NM:i:14
29756  AGTGTACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTTTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTTTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAATCAAACACA   NM:i:6
29757  GTATCAAGTAAAAAATGTTAGGGAGAATTGCCTAAATGGAAAAGCCATAATTTTTAAAAATAATTTTAGTAGTGTTATCCCCATGTAATTTTAAAAATTTCTTAGGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAG   NM:i:24
29759  GTACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAAAAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAACAAAAAAAAAAAATCAAACACAAA   NM:i:6
29761  ACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCTTAATGTGTAAAATTAATGTTAGGAGTGCTATCTCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAATGATATGAATAGCGTTGGTTA  NM:i:19
29762  AAGGAAAAAAAGCTAGGAAGAGATGCCTAAATGGAAAAAACAAAATTTTAAAAATTAATTTTAGGGGGGGTATCCCCATGTGATTTTAATAATTTTTTAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAA  NM:i:28
29765  TGAACAATGCTAGGGGGAGCTGCCTATATGGAAAAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAAAAGCTTCTTAGGAGAAAGACAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAATCAAACAC      NM:i:9
29773  GCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAACGGAAGAGCGGCGGAGAGGAAA  NM:i:15
29773  GCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGAAAAGCAACGGA  NM:i:8
29793  TGGAAGAGCCCTAATGAGTAAAATTAATTTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAAAAGCAAGTGGAGGAAAAAAA  NM:i:11
$ samtools view sorted54.bam|cut -f10|tail -n100|grep -Eo A+|awk '{print length}'|sort -n|tail # longest consecutive segments of A bases among last 100 reads
26
27
27
27
28
34
38
39
47
48

The last read shown above has 48 consecutive A bases. Its reverse pair has 47 consecutive T bases even though they're at a different spot so that there's only 4 bases before them, even though in the forward pair there's 25 bases after the segment of contiguous A bases:

$ seqkit grep -p $(samtools view sorted54.bam|tail -n1|cut -f1) SRR10903402_1.fastq.gz |seqkit seq -s
TTGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTAAAAAACTTTTAAAATCAATTGGGGTTAGCACTACTAAAATTAATTTTACAAATTTGGGCTTTTCCAAAATGGGAAAAACACACGTCTAAACCCCAG

Some people were saying that it was somehow problematic that the full poly(A) tail of Wuhan-Hu-1 could not be reproduced from the metagenomic sequencing run of Wu et al. 2020. However above I have shown that there's another early metagenomic sequencing run of SARS 2 where the reads actually feature a long poly(A) tail.

Other lung metagenomes can be used as controls for MEGAHIT assembly

In order to test if the genome of SARS 2 can be assembled from other metagenomic datasets, I searched the SRA for human bronchoalveolar pneumonia: https://www.ncbi.nlm.nih.gov/sra/?term=human%20bronchoalveolar%20pneumonia.

I picked the first result which was from a study titled "Bronchoalveolar Lavage Fluid Metagenomic Next-Generation Sequencing in non-severe and severe Pneumonia": https://www.ncbi.nlm.nih.gov/sra/?term=SRR22183690. When I ran MEGAHIT with the default settings, I got a total of about 500 contigs with a maximum length of about 8,000. When I aligned the contigs against a file of about 15,000 virus reference sequences, I got only 14 aligned contigs, none of which matched SARS2:

$ wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR221/090/SRR22183690/SRR22183690_{1,2}.fastq.gz
$ brew install megahit -s # `-s` compiles from source because the bottle was compiled against GCC 9 so it gave me the error `dyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgomp.1.dylib`
[...]
$ megahit -1 SRR22183690_1.fastq.gz -2 SRR22183690_2.fastq.gz -o megapneumonia
[...]
$ seqkit stat megapneumonia/final.contigs.fa
file                            format  type  num_seqs  sum_len  min_len  avg_len  max_len
megapneumonia/final.contigs.fa  FASTA   DNA        508  230,544      213    453.8    8,451
$ brew install bowtie2 seqkit gnu-sed
[...]
$ wget -q https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
$ seqkit fx2tab viral.1.1.genomic.fna.gz|sed $'s/A*\t$//'|seqkit tab2fx>viral.fa
$ bowtie2-build viral.fa{,}
[...]
$ bowtie2 -p3 -x viral.fa -fU megapneumonia/final.contigs.fa --no-unal|samtools sort ->megapneumonia.bam
[...]
$ x=megapneumonia.bam;samtools coverage $x|awk \$4|cut -f1,3-6|(gsed -u '1s/$/\terr%\tname/;q';sort -rnk4|awk -F\\t -v OFS=\\t 'NR==FNR{a[$1]=$2;next}{print$0,sprintf("%.2f",a[$1])}' <(samtools view $x|awk -F\\t '{x=$3;n[x]++;len[x]+=length($10);sub(/.*NM:i:/,"");mis[x]+=$1}END{for(i in n)print i"\t"100*mis[i]/len[i]}') -|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print$0"\t"a[$1]}' <(seqkit seq -n viral.fa|gsed 's/ /\t/;s/, .*//') -)|column -ts$'\t'
#rname       endpos   numreads  covbases  coverage    err%  name
NC_022518.1  9472     7         2334      24.641      5.13  Human endogenous retrovirus K113 complete genome
NC_050152.1  101660   1         250       0.245918    0.00  Enterobacteria phage P7
NC_032111.1  163005   1         143       0.0877274   3.47  BeAn 58058 virus
NC_018464.1  927      1         87        9.38511     0.45  Shamonda virus N and NSs genes
NC_007610.1  131384   1         45        0.0342507   0.22  Listeria bacteriophage P100
NC_037667.1  2077288  1         36        0.00173303  0.00  Pandoravirus quercus
NC_021858.1  1908524  1         34        0.00178148  0.00  Pandoravirus dulcis
NC_011189.1  3418     1         34        0.994734    0.00  Mikania micrantha mosaic virus RNA2

The contigs which matched human endogenous retrovirus K113 probably just matched the human genome. When I did a BLAST search for the longest contig, it got a 98.9% identical match to both some human endogenous retroviruses and to human DNA. When I downloaded other metagenomic sequencing runs for human lung or BALF samples from SRA and aligned the reads against virus refseqs, human endogenous retrovirus K113 got the highest coverage for most runs, including SRR22183691, SRR22183705, SRR1553689, and SRR22183702.

Comments about a paper by a mathematician from Hamburg

Distribution of read lengths and error rates in BBMap alignment

In 2022, someone who called themselves "a mathematician from Hamburg, who would like to remain unknown" published a paper titled "Structural analysis of sequence data in virology: An elementary approach using SARS-CoV-2 as an example" (https://impfen-nein-danke.de/u/structural_analysis_of_sequence_data_in_virology.pdf, https://impfen-nein-danke.de/u/structural_analysis_of_sequence_data_in_virology_tables_and_figures.pdf). He wrote that when he aligned Wu et al.'s reads with BBMap against reference genomes of HIV, hepatitis delta, and measles, he was able to get the reads to cover almost the entire reference genomes, which Stefan Lanka's fans used as evidence to support their view that the genomes of viruses are fake. But actually the reads which aligned against HIV, measles, and hepatitis were either short reads or they had a high percentage of mismatches, because he used very loose alignment criteria with BBMap which tolerated a high number of mismatches (k=8 minratio=0.1). When the mathematician from Hamburg filtered his reads with reformat.sh, he also used a low required minimum length (37) and an extremely low required identity percent (60%):

Basically, we mapped the paired-end reads (2x151 bp) with BBMap [23] to the reference sequences we considered (Tables and Figures: Table 3) using relatively unspecific settings. We then varied the minimum length (M1) and minimum (nucleotide) identity (M2) with reformat.sh to obtain corresponding subsets of the previously mapped sequences with appropriate quality. Increasing minimum length M1 or minimum nucleotide identity M2 thereby increases the significance of the respective mapping. Subsequently, we formed consensus sequences with the respective subsets of selected quality with respect to the selected reference. We set all bases with a quality lower than 20 to "N" (unknown). A quality of 20 means an error rate of 1% per nucleotide, which can be considered sufficient in the context of our analyses. Finally, the assessment of the agreement between reference and consensus sequences was performed using BWA [24], Samtools [25], and Tablet [26]. The ordered pair (M1; M2) = (37; 0.6) was just chosen to give error rates F1 and F2, respectively, of less than 10% for reference LC312715.1. The results of all calculations performed are shown in Tables and Figures: Table 4. The calculations show the highest significance for the choice of the ordered pair (37; 0.6), which can be seen from the highest error rates in each case. Comparable significance is provided by the ordered pairs (47; 0.50) and (25; 0.62). While the genome sequences associated with coronaviruses show error rates approximately above 10% for all ordered pairs considered (M1; M2), the error rates of the two sequences LC312715.1 (HIV) and NC_001653.2 (Hepatitis delta) are below 10% and decrease further for the ordered pairs (32; 0.60) and (30; 0.60).

Here you can see that the mode length of his reads which aligned against Wuhan-Hu-1 (MN908947.3) was around 150 bases, but the mode length of the reads which aligned HIV (LC312715.1), Hepatitis delta (NC_001653.2), measles (AF266291.1), and a randomly generated reference sequence (rnd_uniform) was around 35-40 bases:

In the figure above, Wuhan-Hu-1 (MN908947.3) has two peaks for the length of the aligned reads, where the first peak around length 37 looks almost as high as the second peak around length 150. However the y-axis is cut off so you can't see the full height of the second peak. (Also I don't understand why the plot above shows that there were some reads longer than 151 bases which aligned against Wuhan-Hu-1, because the maximum length of Wu et al.'s reads is 151 bases. Maybe the mathematician from Hamburg counted the gaps in the middle of the aligned reads as part of the reads, or maybe he calculated read lengths based on the CIGAR strings so that he added together all numbers in the CIGAR string without skipping deletions. His paper indicated that he ran BBMap with the maxindel=0 parameter which doesn't allow indels, but it might be that he forgot to include the parameter in the run where he used Wuhan-Hu-1 as a reference.)

I tried aligning the reads from Wu et al. against Wuhan-Hu-1 so that I ran mapPacBio.sh and reformat.sh with the same parameters as the mathematician from Hamburg, but I got about 11 times as many reads of length 151 as reads of length 37:

Next I tried aligning Wu et al.'s reads against other reference genomes, where I again ran BBMap and reformat.sh with the same parameters as the mathematician from Hamburg:

brew install --cask miniconda;conda init ${SHELL##*/}
conda install -c bioconda bbmap parallel-fastq-dump
brew install fastp
parallel-fastq-dump --gzip --threads 10 --split-files --sra-id SRR10971381 # `fastq-dump` is slow and `fasterq-dump` doesn't support gzip
fastp -i SRR10971381_1.fastq.gz -I SRR10971381_2.fastq.gz -o SRR10971381_1.trim.fq.gz -O SRR10971381_2.trim.fq.gz # trim adapter sequences, trim low-quality ends of reads, discard reads shorter than 15 bases, discard reads with 5 or more `N` letters
printf %s\\n MN908947.3\ sars2 LC312715.1\ hiv NC_001653.2\ hepatitis AF266291.1\ measles>refs
cat refs|while read l m;do curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=$l">ref.$m.fa;done
gshuf -rn30000<<<$'A\nC\nG\nT'|tr -d \\n|awk 'BEGIN{print">random"}1'>ref.random.fa
for x in $(cut -d\  -f2 refs) random;do mapPacBio.sh ref=ref.$x.fa in=SRR10971381_1.trim.fq.gz in2=SRR10971381_2.trim.fq.gz outm=$x.mapped.sam vslow k=8 maxindel=0 minratio=0.1;done
for x in $(cut -d\  -f2 refs) random;do reformat.sh in=$x.mapped.sam out=$x.filtered.sam minlength=37 idfilter=60 ow=t;done
for x in $(cut -d\  -f2 refs) random;do awk -F\\t '/NM:i:/{l=length($10);sub(/.*NM:i:/,"");sub("\t.*","");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f $x.filtered.sam|sed s/^/$x\ /;done>lengthcounts

The plot below shows the result of my BBMap alignment. If you compare SARS 2 to HIV, hepatitis, or the random sequence, they all have a similar mismatch rate and similar number of aligned reads at the lowest read lengths, but after around length 55, the mismatch rate of SARS 2 starts to drop until it reaches about 2% at length 151. However the mismatch rate of the other sequences remains at around 40% even for longer reads:

Hepatitis delta has an extremely short genome of only about 1,700 bases, so in the plot above, hepatitis delta gets less aligned reads at lengths 37 to 60 than the longer genomes. I don't know why the difference between hepatitis delta and the longer genomes is not bigger though, even though SARS 2 and the random 30kb-genome are almost 20 times longer than hepatitis delta, but it might be because the longer sequences are more likely to include multimapped reads which I omitted from this analysis. Or it might be because I ran BBMap with such loose criteria that even for a short reference genome, random short reads are likely to get aligned against some part of the genome.

(The number of mismatches is indicated by the NM tag in the SAM file format. For some reason BBMap doesn't include the NM tag for multimapped reads, which are reads that can map to two or more spots in the reference genome and which are indicated by a zero in the fifth column of the SAM file. Therefore I omitted multimapped reads from the plot above in order to calculate the average number of mismatches correctly.)

Here's R code for generating the plot above:

library(ggplot2)

t=read.table("lengthcounts")
t2=rbind(t,expand.grid(V1=unique(t$V1),V2=unique(t$V2),V3=0,V4=NA))
t=t2[!duplicated(t2[,1:2]),]

group=factor(t$V1,levels=unique(t$V1))

color=c("black",hcl(c(0,210,120)+15,60,55),"gray60")

ybreak=1e4
ybreak2=8
xstart=xbreak*floor(min(t$V2)/xbreak)
xend=xbreak*ceiling(max(t$V2)/xbreak)
yend2=48

zeroshift=1
log=log10(t$V3)+zeroshift
log[log==-Inf]=0
yend=ceiling(max(log))
ylabel=c(0,1,10,100,paste0("1e",3:ceiling(max(log-zeroshift))))

mismatch=t$V4*yend/yend2

label=c("SARS 2 (MN908947.3) (30kb)","HIV-1 (LC312715.1) (9kb)","Hepatitis delta (NC_001653.2) (2kb)","Measles (AF266291.1) (16kb)","Random sequence (30kb)")
ystep=yend/15
labels=data.frame(x=.25*(xend-xstart),y=seq(.3*yend,by=-ystep,length.out=length(label)),label)

ggplot(t,aes(x=V2,y=log,color=group))+
geom_line(linewidth=.3)+
geom_point(aes(y=mismatch),size=.2)+
geom_label(data=labels,aes(x=x,y=y,label=label),fill=alpha("white",.7),label.r=unit(0,"lines"),label.padding=unit(.04,"lines"),label.size=0,color=color,size=2.2,hjust=0)+
scale_x_continuous(limits=c(xstart,xend),breaks=seq(0,xend,xbreak),expand=c(0,0))+
scale_y_continuous(limits=c(0,yend),breaks=c(0,zeroshift+seq(0,5,1)),labels=ylabel,expand=expansion(mult=c(.02,0)),sec.axis=sec_axis(trans=~.*yend2/yend,breaks=seq(0,yend2,ybreak2),name="Percentage of mismatches (dots)"))+
labs(x="Length of read",y="Number of aligned reads (lines)")+
scale_color_manual(values=color)+
theme(
  axis.text=element_text(size=6,color="black"),
  axis.ticks=element_blank(),
  axis.ticks.length=unit(0,"in"),
  axis.title=element_text(size=8),
  legend.position="none",
  panel.background=element_rect(fill="white"),
  panel.border=element_rect(color="gray85",fill=NA,linewidth=.4),
  panel.grid.major=element_line(color="gray85",linewidth=.2),
  panel.grid.minor=element_blank(),
  plot.background=element_rect(fill="white"),
  plot.margin=margin(.05,.15,.05,.1,"in")
)

ggsave("1.png",width=4,height=2.5)

The code below selects reads that aligned against SARS 2, that survived the filtering by length done with reformat.sh and that are not multimapped reads so they are not missing the NM tag. It then shows that the average number of mismatches per aligned read was 37.5% for reads of length 37 and 2.0% for reads of length 151. There were a total of 4,856 aligned reads of length 37 and 62,766 aligned reads of length 151:

$ awk -F\\t '/NM:i:/{l=length($10);sub(/.*NM:i:/,"");sub("\t.*","");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f sars2.filtered.sam|(gsed -u 4q;echo ...;tail -n4)
37 4856 37.5
38 5010 38.0
39 3896 37.8
40 4363 38.2
...
148 2316 2.3
149 6084 2.0
150 32431 2.2
151 62766 2.0

However in the case of the reads that got aligned against the HIV reference genome, there were only 14 aligned reads of length 151 and even those reads had 41% mismatches on average:

$ awk -F\\t '/NM:i:/{l=length($10);sub(/.*NM:i:/,"");sub("\t.*","");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f hiv.filtered.sam|(gsed -u 4q;echo ...;tail -n4)
37 4460 37.2
38 4471 37.5
39 3584 37.4
40 3843 37.8
...
148 2 41.2
149 3 40.7
150 16 41.1
151 14 40.6

In the last sentence of the following paragraph, I don't know if the mathematician from Hamburg meant to say 120,000 long reads instead of 120,000 short reads, because otherwise it doesn't make sense:

By excluding all mappable sequences longer than 100 nucleotides, essentially the approximately 120,000 reads associated with SARS-CoV-2 were removed. The coverage distribution of the remaining short sequences now appears random, analogous to Figure 13. Again, this correlates with high error rates R1 (29.90%) and R2 (29.96%). This indicates that no significant structure of reference MN908947.3 is included in the published sequences, except for the approximately 120,000 (Tables and Figures. Table 1) associated short reads.

He got about 120,000 mapped reads with average length around 146 when he used Bowtie2, but he got about 60,000 reads with average length around 46 when he used BBMap with reformat.sh maxlength=100:

However in my previous plot shown above, even at lengths 60-100, the reads that map to SARS 2 have a much lower mismatch rate than the reads that map to HIV, hepatitis D, measles, or the random sequence. So you shouldn't just be looking at number of mapped reads or coverage but also the mismatch rate.

Comparison of Bowtie2 and BBMap

When I tried aligning Wu et al.'s reads with Bowtie2, I got zero reads that aligned against HIV, hepatitis delta, measles, or the random sequence. I got 121,779 aligned reads for SARS 2, which was identical to the figure reported by the Hamburg mathematician, because both of us did trimming with fastp using the default parameters.

With Bowtie2, the shortest reads that aligned against SARS 2 were 31 bases long. There were less than a hundred reads at each length below 89, and their error rate was much lower than in the case of BBMap. However after around length 100-120, the results of Bowtie2 start to converge with BBMap:

I ran Bowtie2 with the default parameters:

bowtie2-build MN908947.1.fa{,.index}
bowtie2 -p3 -x MN908947.1.fa.index -1 SRR10971381_1.trim.fq.gz -2 SRR10971381_2.trim.fq.gz --no-unal|samtools sort -@2 ->sorted30.bam

Aligning random reads with BBMap

I created a random 30,000-base reference sequence and a thousand random reads for each length between 30 and 150. 37% of the reads of length 30 got aligned by BBMap, but I got zero aligned reads above length 70. The proportion of mismatches was around 40% for most read lengths but 59% for length 70. This time I didn't get any multimapped reads so I didn't have to exclude them:

$ gshuf -e A -e C -e G -e T -rn30000|printf %s\\n \>ref $(paste -sd\\0 -)>ref.fa
$ n=1000;for i in {30..150};do gshuf -e A -e C -e G -e T -rn$[i*n]|awk '{ORS=NR%i==0?RS:""}1' i=$i|paste <(seq $n|sed s/^/$i./) -;done|seqkit tab2fx>reads.fa
$ bbmap.sh ref=ref.fa in=reads.fa outm=reads.mapped.sam vslow k=8 maxindel=0 minratio=0.1
[...]
$ awk -F\\t '/^[^@]/{l=length($10);sub(/.*NM:i:/,"");sub(/\t.*/,"");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f reads.mapped.sam
30 372 40.6
31 312 40.8
32 230 42.1
33 196 41.0
34 180 41.5
35 137 41.6
36 104 41.3
37 89 41.2
38 73 41.5
39 53 41.5
40 41 39.9
41 29 43.5
42 24 40.3
43 24 41.4
44 15 43.2
45 14 40.8
46 13 41.3
47 10 41.7
48 6 41.7
49 7 42.6
50 3 40
51 3 43.8
52 5 43.8
53 3 40.9
54 5 41.5
55 2 42.7
56 3 42.3
57 1 40.4
70 1 58.6

When I filtered out reads with indentity percent below 70, I was left with only 39 reads:

$ reformat.sh in=reads.mapped.sam out=reads.filtered.sam idfilter=70 ow=t
[...]
$ awk -F\\t '/^[^@]/{l=length($10);sub(/.*NM:i:/,"");sub(/\t.*/,"");a[l]+=$0;n[l]++}END{for(i in a)print i,n[i],100*a[i]/n[i]/i}' OFMT=%.1f reads.filtered.sam
30 17 28.6
31 11 28.4
32 2 23.4
33 2 27.3
34 5 28.2
35 1 28.6
38 1 28.9

In the code blocks above, I ran BBMap with the same parameters which were used by the mathematician from Hamburg (vslow k=8 maxindel=0 minratio=0.1). But when I ran BBMap with the default parameters, zero of the random reads mapped to the random reference genome.

Parameters used with BBMap

When I googled for "minratio=0.1" "bbmap", I found the following usage example in the BBTools user guide (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/):

To map with super-high sensitivity (useful for very-low-quality data, or remote homologies):

mapPacBio.sh in=reads.fq out=mapped.sam vslow k=8 maxindel=200 minratio=0.1

The list of shell commands in the PDF by the Hamburg mathematician featured this BBMap command which was used to align the reads: mapPacBio.sh in=SRR10971381_1.fastq in2=SRR10971381_2.fastq outm=mapped.sam vslow k=8 maxindel=0 minratio=0.1. So maybe he copied the command from the BBTools user guide, because he even used the same output file name mapped.sam, he listed the parameters in the same order, and he used mapPacBio.sh instead of bbmap.sh.

The value for the K-mer length can range from 8 to 15, where the default value is k=13 and lower values are more sensitive. So the mathematician from Hamburg used the most sensitive possible value for the K-mer length.

The minratio parameter is no longer listed in the help message of BBMap, but I think it's because it has the same effect as setting minid but on a different scale, and the developer of BBMap recommended using minid instead. The default value for minid is 0.76 which is approximately equal to minratio=0.56. The developer of BBMap wrote (https://www.biostars.org/p/376782/):

Sorry for the confusion; they set the same thing, just with different units. "minratio" is the internal number using the internal scoring system; if you use the "minid" flag it converts it to the equivalent minratio. So if you set "minid=0.76 " then it will be converted to approximately "minratio=0.56" and then the minratio variable will be set to 0.56. So there is no reason to set both of them. If you do set both of them, the last value will be used. I recommend minid over minratio because it is easier to read, but they accomplish the same thing.

"minid" is approximate since it adjusts minratio, or the ratio of the maximum possible internal score, but the internal score is not directly related to percent identity because of different penalties for substitutions, deletions, and insertions, which vary by length. If you need an exact %identity cutoff, use "idfilter=0.76" which will eliminate all alignments below 76% identity, as a postfilter. "minratio/minid" actually affect mapping, and prevent the creation of any alignments with an internal score below "minratio". As such, a high value of minid will make the program faster and a low value will make it slower. Whereas "idfilter" does not affect speed.

That said, I don't recommend using "idfilter" unless you have a good reason (like, a paper with a graph showing %identity cutoffs, for clarity; not for something like variant calling). "minid/minratio" are proportional to the internal score which should better model actual mutations or evolution, by trying to minimize the number of independant events rather than the edit distance; "idfilter=0.5", for example, will disallow a 100bp read from mapping with a 400bp deletion in the middle, since that would be only 20% identity. "minid=0.5" would allow this mapping because the internal score would actually still be quite high (maybe 0.9 of the maximum possible ratio).

Modification date at GenBank confused with publication date

The mathematician from Hamburg wrote:

Alignment with the nucleotide database on 05/12/2021 showed a high match (98.85%) with "Homo sapiens RNA, 45S pre-ribosomal N4 (RNA45SN4), ribosomal RNA" (GenBank: NR_146117.1, dated 04/07/2020). This observation contradicts the claim in [1] that ribosomal RNA depletion was performed and human sequence reads were filtered using the human reference genome (human release 32, GRCh38.p13). Of particular note here is the fact that the sequence NR_146117.1 was not published until after the publication of the SRR10971381 sequence library considered here.

But actually the date shown by the NCBI's website is the date when the entry was modified, and the original publication date of NR_146117.1 is in 2017. You can see the original publication date from the revision history by clicking the small "GenBank" menu in the top left corner and then selecting "Revision History" (https://www.ncbi.nlm.nih.gov/nuccore/NR_146117.1?report=girevhist).