This pucing tour is depressing but, nonetheless I pfind it pfascinating. Thank you for all you are doing. Miss being able to see soothing kitty pics on my feed to help me cope. When the world is melting down, there's nothing like a cat to reassure you life is still purrty great.
Actually Ben's longest contig was only 29802 bases long, so it was 101 bases shorter than the third version of Wuhan-Hu-1 at Genbank, and it had one extra base at the beginning and it was missing the last 102 bases. I got the same result when I tried to use MEGAHIT without Trimmomatic, but when I added Trimmomatic to my pipeline, my longest contiguous sequence now missed only the last 30 bases of Wuhan-Hu-1: https://usmortality.substack.com/p/sars-cov-2-genome-assembly/comment/13659949. I probably used the wrong parameters with Trimmomatic though, or my pipeline is still missing other parts.
In the second version of Wuhan-Hu-1 at GenBank, the last 20 bases (which are also included in the first and third versions but not at the very end) are TGTGATTTTAATAGCTTCTT, which is identical to a segment of human chromosome 1, and the extra 598 nucleotides after the 20-base segment in the first version also match human chromosome 1.
If you do a BLAST search for the last 618 nucleotides of the first version, it's 99.68% identical to a result titled "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence". You can simply copy positions 29856 to 30473 from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.1. Then paste it here and press the BLAST button: https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Another paper where they did de-novo assembly for an early SARS 2 sample was this paper Ren et al.: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7147275/. They wrote: "Quality control processes included removal of low-complexity reads by bbduk (entropy = 0.7, entropy-window = 50, entropy k = 5; version: January 25, 2018),[11] adapter trimming, low-quality reads removal, short reads removal by Trimmomatic (adapter: TruSeq3-SE.fa:2:30:6, LEADING: 3, TRAILING: 3, SLIDING WINDOW: 4:10, MINLEN: 70, version: 0.36),[12] host removal by bmtagger (using human genome GRCh38 and yh-specific sequences as reference),[13] and ribosomal reads removal by SortMeRNA (version: 2.1b).[14]" The reason why the first version of Wuhan-Hu-1 included the 618-base segment of human DNA at the end may have been that they didn't remove human reads before they ran MEGAHIT, or at least they didn't mention removing human reads in the methods section of the Wu et al. paper.
This pucing tour is depressing but, nonetheless I pfind it pfascinating. Thank you for all you are doing. Miss being able to see soothing kitty pics on my feed to help me cope. When the world is melting down, there's nothing like a cat to reassure you life is still purrty great.
Ben from USMortality failed to reproduce the genome of Wuhan-Hu-1 using MEGAHIT, but I think it might be because he didn't do any quality control or trimming or merging of the reads: https://usmortality.substack.com/p/sars-cov-2-genome-assembly, https://github.com/USMortality/Megahit-SARS-CoV-2/blob/master/megahit.sh. Are you able to modify his shell script so it produces the correct result?
He didn’t trim the reads and as a result had a different length poly A tail.
Actually Ben's longest contig was only 29802 bases long, so it was 101 bases shorter than the third version of Wuhan-Hu-1 at Genbank, and it had one extra base at the beginning and it was missing the last 102 bases. I got the same result when I tried to use MEGAHIT without Trimmomatic, but when I added Trimmomatic to my pipeline, my longest contiguous sequence now missed only the last 30 bases of Wuhan-Hu-1: https://usmortality.substack.com/p/sars-cov-2-genome-assembly/comment/13659949. I probably used the wrong parameters with Trimmomatic though, or my pipeline is still missing other parts.
That’s normal variation in the PolyA tract in the genome.
This is a classic example of a person doing something for the first time and thinking they are expert.
Dunning Kruger move.
Nothing burger. When genomics experts explained this to him, he went conspiracy instead of learning.
Ben,
You blocked me on Twitter and you block anyone who even liked my comments as well.
Why do you expect me to answer your questions over here?
Not taking that revisionist history on this platform Ben.
You jumped onto a thread about this work casting shade on the PCR used.
I simply stated your known and public doubts about genomics and you call that slander.
Goodbye.
I already solved the mystery in this comment I posted to your Substack: https://usmortality.substack.com/p/sars-cov-2-genome-assembly/comment/13407456. Or actually the mystery was solved by ChrisDeZPhD in this Twitter thread and I just repeated what he said: https://twitter.com/ChrisDeZPhD/status/1290218272705531905.
In the second version of Wuhan-Hu-1 at GenBank, the last 20 bases (which are also included in the first and third versions but not at the very end) are TGTGATTTTAATAGCTTCTT, which is identical to a segment of human chromosome 1, and the extra 598 nucleotides after the 20-base segment in the first version also match human chromosome 1.
If you do a BLAST search for the last 618 nucleotides of the first version, it's 99.68% identical to a result titled "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence". You can simply copy positions 29856 to 30473 from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.1. Then paste it here and press the BLAST button: https://blast.ncbi.nlm.nih.gov/Blast.cgi.
You can align the three different versions of Wuhan-Hu-1 like this: `brew install mafft;curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.'{1,2,3}|mafft --clustalout -`.
Another paper where they did de-novo assembly for an early SARS 2 sample was this paper Ren et al.: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7147275/. They wrote: "Quality control processes included removal of low-complexity reads by bbduk (entropy = 0.7, entropy-window = 50, entropy k = 5; version: January 25, 2018),[11] adapter trimming, low-quality reads removal, short reads removal by Trimmomatic (adapter: TruSeq3-SE.fa:2:30:6, LEADING: 3, TRAILING: 3, SLIDING WINDOW: 4:10, MINLEN: 70, version: 0.36),[12] host removal by bmtagger (using human genome GRCh38 and yh-specific sequences as reference),[13] and ribosomal reads removal by SortMeRNA (version: 2.1b).[14]" The reason why the first version of Wuhan-Hu-1 included the 618-base segment of human DNA at the end may have been that they didn't remove human reads before they ran MEGAHIT, or at least they didn't mention removing human reads in the methods section of the Wu et al. paper.
Thanks for digging into that. If it's a lung metagenome, that can happen.
Also need to be careful to look for index hopping on various Illumina platforms.