Lately, we have been exploring the use of Oxford Nanopore Sequencing to enable 24 hour turn around time on pathogen detection in the Cannabis and Food testing space.

We have played with these tools for years but lately it has matured to a point where it is hard to install any other sequencer first. The barrier to entry for Nanopore sequencing is an order of magnitude lower than any other sequencer. ($5K-$10K).

The informatics is more challenging but it comes with the unique benefits of long reads and methylation detection.

One key aspect of this technology is that it reads single molecules of DNA. Most other sequencers have to amplify DNA in order to see it. Illumina looks at 1000s of copies of a DNA molecule or a cluster of DNA seeded from the clonal amplification of a single molecule. SOLiD used 50,000 copies of a DNA that were single molecule seeded onto a magnetic bead and clonaly amplified via emulsion PCR. Helicos attempted to directly read single molecules on arrays but required large optical tables and highly sensitive cameras while struggling to get more than 30bp reads. All 3 of these fluorescent sequencing approaches attempted to sequence millions to billions of DNA molecules in parallel, 1 base at a time and they required all molecules advance synchronously in the sequencing process.

Read lengths were limited by what you could PCR onto an array (100bp- 1000bp) and the efficiency of their respective sequencing by synthesis chemistries. Illumina won this read length battle with bidirectional 300bp reads or 600 serial chemistry reactions. To get 300 serial chemistry reactions to complete with accuracy, you need near perfect completion at each step. Imagine a chemistry that is 99.99% complete at each step but 300 serial steps still leaves you with (99.99%)^300th power or ~3% of your 1000 DNA strands being out of phase or 1-3 bases out of synchrony with the leading strands.

These 2 details are what limit ensemble sequencers like Illumina, Ion Torrent/454, SOLiD, and Complete Genomics/MGI… ie….What is the length you can faithfully amplify billions of diverse molecules uniformly and how efficient is each step in your serial synthesis of DNA.

When you turn to sequencing single molecules, a different real time approach is taken. Let the molecules rip. You don’t try to start and stop DNA polymerization at each step so you can read it and synchronize all billion clusters or beads on an array. You build a real time detector that can keep up with the rate of synthesis on single molecules. This is optically very hard to do and as a result some electrical sensing methods have emerged as the real powerhouses of 3rd generation sequencing.

Why are optics hard at this scale?

Because the molecules are smaller than the diffraction limit of light. Light below 200nm wavelength is very damaging to DNA and individual bases of DNA are angstroms apart so a even a single photo laser cant resolve one base from another at a single molecule level. Zero mode waveguides are needed to created femtoliter observation volumes with light based systems (Pacific Biosciences). These illuminate such a small volume that if you keep fluorescent bases very dilute, you will see one stall in the observation volume as the polymerase incorporates it. The process of incorporation has to liberate the dye in order for the next base to not be obscured by the first base. These system use triphosphate bases where the dye is on the gamma phosphate which is liberated upon DNA incorporation.

An alternative to using light is to use picoamps of current as DNA is threaded through a very small pore. As the DNA is threaded through such pores, the current changes based on the sequence context that is in the pore. Both of these techniques can keep up with rate of polymerases (1-400bp+/sec).

Only Pacific Biosciences and Oxford Nanopore are capable of real time (asynchronous) single molecule sequencing where the polymerases or motor proteins are allowed to rip at single to 400 bases per second and let each molecule proceed at its own rate, through its own detector. Then you have to make 1000s to Millions of detectors that can do this in parallel to get the throughput up and make it cost effective.

All single molecule methods suffer from signal to noise. Its just very difficult to read molecules at 100 bp/sec and get everything right. There are some clever inventions in place to error correct these data.

Pacific Biosciences relies on fluorescence and as a result requires 600K-$1M optical system. It sequences circular molecules so it can lap the molecules many (20-30) times to improve the accuracy. It also slowed the polymerases down to 1-3 bases/sec and has earned the title of the most accurate DNA sequencer alive today. It can read 100Kb molecules (10-40kb is the sweet spot ) but the Megabase reads are really only seen with the Nanopores.

Since most single molecule sequencing error is random, reading the molecules multiple times has a compounding improvement in accuracy.

Nanopores (ONT) can only read single strands of DNA threaded through a pore so the circular consensus trick doesn’t work for them. Instead they rely on DNA being double stranded, peeling that strand into 2, and the fact that the next most likely read to be read through a pore, is its reverse complement strand as it is most proximal to the pore once the 1st strand is threaded through. They employ some clever tricks by surrounding pores with tether DNA that can hybridize to a Y adaptor you ligate onto your DNA library. This helps to keep the opposite strand sticking around and not float away once the first strand is threaded through.

They (ONT) also attach a ATP driven motor protein to the Y adaptor (Pink-below). This protein unwinds dsDNA into single stranded DNA much like a helicase but its rate of unwinding is governed by ATP concentration. This is very important governor as you want the DNA to go through the pore (Aqua) at a constant rate and not just rip through via electrophoretic conditions.

This symphony of electric chemistry can have some error modes. You need both ends of your dsDNA to get these Y adaptors ligated on (see below). You need 4 ligation events to occur; 2 on each strand. 50% of the time you can get single adaptor and only get a simplex read out of the pore. If you only get 3/4 of the ends to ligate, you also get a simplex read.

If you’re DNA has a nick in it, you’ll get a full length 1st strand read and a partial read on the 2nd strand. Reducing or repairing nicks in DNA purification becomes critical.

When all 4 ends get ligated appropriately, you get duplex reads which jump the accuracy by an order of magnitude (Q20→Q30 scores). Q20 = 1% error rate. Q30 =0.1% error rate.

The break through in this approach is that the accuracy is independent of the length of the molecules. You can get Perfect 40Kb reads.

The signal processing of this system is more expensive than the sequencer. You need high powered GPUs to get the most accurate data and these can cost $6K (more than the sequencer).

With all this hard work, you get another novel detail…

The electrical signals can see the methylation status of the DNA. Other methods can get partial pictures of methylation (see in B,C, and D) but they all require different methods for different types of methylation where, ONT just sees them all.

This week, I discussed this topic with Steve Kirsch.

A critical difference between Process 1 and Process 2 in the Pfizer trial was how they synthesized the DNA.

Process 1 used PCR to amplify the Spike region from a plasmid. The process of PCR erases the methylation status of DNA.

The figure below is from a Nature Biotech paper we published where we described a way to amplify DNA and put a random methylation code back into the DNA during PCR to evade most gene patents. The patent law is odd. You have to man modify DNA in order for it to be patentable and Justice Clarence Thomas ruled that PCR or cDNA synthesis qualified as such a man modification as it altered the DNA. This is off target rabbit hole for now but the figure helps people to visualize how the act of PCR erases the natural methylation state of DNA. We developed methods to incorporate methylC during PCR but this process is random and doesnt reflect the methylation code that exists in the native DNA. Its much like the use of N1-methyl-PseudoUridine in modRNA synthesis. Every U gets replaced or some % of them if you use a blend of dCTP and dmCTP.

How is Process 1 different than Process 2?

Process 2 used plasmid DNA from E.coli and if you use the wrong E.coli strain to manufacture the plasmids, you get Dam and Dcm methylation of the plasmid DNA. Dam and Dcm are methyl transferases in E.coli that methylate the A in GmATC and the 2nd C in CmCWGG sequences. W is an IUPAC code for A or T so CCTGG and CCAGG get methylated with Dcm.

My most recent preprint uses these Nanopore sequencers to measure the methylation status of the plasmid DNA and we can see it is indeed methylated at GATC and CCWGG sites.

The CCWGG signal was a little trickier to pull out of the data as that degenerate W base required 2 passes through ModKit. This is a Methylation software tool that doesn’t run well on MacOSx so you really need a Docker container to run it.

The version I used required you run the CCAGG and CCTGG motifs through the software separately and then compile the results. The Yellow Tick marks in the middle are where these motifs exist in the plasmid sequence and their % methylation is captured by the heatmap.

You’ll also notice the scale on the m6A heat stripe is about 2X the scale I have in my preprint. That is a reflection of my initial pass on this only looking at 1 strand. GATC is palindromic so the methyl A on the opposite strand is actually 1 base offset (under the T) and I under-counted the methylation by a factor of 2 in the first pass.

Once you have modkit installed, you can feed the methylation motif you are looking for and it will extract the methylation data from the bam file.

This will end up in a bed file as seen below with %methylation in a column 10

This data demonstrates we have both m6A GA*TC methylation and 5mC CC*WGG methylation in the the Pfizer Process 2. This is materially different DNA than what was in Process 1 as the methylation signals on the plasmid DNA are more stimulatory to cGAS-STING.

Here are some SUP Duplex reads that hit the Pfizer Vaccine.

Here are some SUP simplex reads that hit the Pfizer Vaccine.

Longest one is 4,054 bases!

Confirms with BLAST with the ends trimmed a bit. This is against Didier Raoult’s submission in NCBI.

Conclusions

The Pfizer change from Process 1 to Process 2 left in materially different DNA than was tested in the trial. There are several papers in the literature listed above that demonstrate bacterial methylation signals are more inflammatory than DNA derived from PCR. You can read more about this in my recent PrePrint. The more vials we sequence, the more we discover longer and longer DNA contaminants suggesting full length 7.8Kb molecules will soon be found. Having documented their failure to linearize the plasmid DNA in the preprint below, its possible small amounts of full length circular plasmids exist in some vials. These are replication competent plasmids that will make more of themselves once in a mammalian cell. Much like a virus with a self amplifying spike sequence and a higher capacity to integrate (as its DNA).

This may explain the spike persistence seen in various studies.