Wednesday, April 1, 2026

I'm convinced! Illumina Protein Prep might be a game changer!


Brazenly borrowed from this whitepaper. 

If I have a super power as a person or a scientist, it is that I'm very okay with being wrong. It helps that it happens all the time and the fact that I have friends and a domestic partner who are way way way smarter than me. I'm used to be the dumbest person in the room and I can just discover that I'm wrong.

And boy - was I wrong about this new Illumina Protein Prep thing. 

I thought it was just a repackaging of SomaScan, a product that has had the strangest propensity for avoiding the very simple experiment that would make me stop making fun of it. After a decade I was starting to think that 1) They were doing it just to get on my nerves or 2) They had done it - and aptamer off binding could not be used to estimate a protein concentration in a complex mixture in any meaningful way (translation - it doesn't work). 

But Illumina has been killing it for years and years! We have petabytes full of Illumina short read sequencing data all over the world. Sure, you could argue they missed the long read sequencing bandwagon and that is a little weird. But a behemoth of an organization like that has the money and the people to avoid becoming complacent.

So when Illumina acquired whatever SomaScan had changed their name to that month, you had to think "wait. maybe there IS something to it!" 

And here I sit while turning a TOF after a power outage that caused me to miss the last day of a conference. Embarrassed and corrected.

The A problem with aptamers is that they are only linear within an EXTREMELY narrow linear dynamic range. If sample A has x target and sample B has 2x target, you can basically see that difference. If sample C has 10x target, you're probably okay, but you're at the end of the dynamic range. If sample D has 1,000,000,0000x more protein, you get about the same value as sample C. More on that and other problems with aptamers here. 

This new product is so much more than the original product it was based on - because after you have your aptamer readout you NOW do NGS sequencing on tags on those aptamers. And then you do the quantification off of the NGS readout! By counting the reads! And we all know that there is no better way of doing quantification than counting things. And if there is, it's probably counting an indirect measurement of an indirect measurement. Wait. Didn't we do something like that before? 

Okay, but that doesn't fix the linear dynamic range issue of the original measurement. But now you've got rock solid absolutely amazing quan on those narrow measurements, right? 

And this is where I change my mind about this whole thing! 



This group took a good hard look at precision and accuracy in a pile of different ways to do RNASeq, with a special emphasis on low input techniques like scSeq and scNSeq, but lots of work on the bulk as well.

The CVs ARE AMAZING.

Less than 1! Across the board! Okay, fancy mass spec people, tell me how many times that you've reported a CV <1 across an entire dataset. I'd love to say that I only report out proteins with less than 10%, but we use a 20% CV cutoff.

Oh...fuuuuuuuuuuuuuuuck..... they mean CV%, right? Not CV 1 = 100%??

Oh. So...a CV of 1 is a CV% of 100%. Right. So I'm going to puke. Hey! And the new TIMSTOF water pumps reset their temperature after a power outage. That's cool. So..I have more time since I have to set my water cooler temperature to the temp written on it in sharpie (25C) and I assume wait for this thing to re-equilibrate...

Okay, so maybe we need to look at these numbers a little more. 


It's hard to see but there is a red line which is a CV of 0.1 or  CV% of 10. As you might notice. They don't often get very close to those numbers. Now, we could argue this is cheating. The maximum number of cells analyzed in each study was used to generate a pseudobulk metric. So this is averaging thousands or tens of thousands of cells. What we need is - yeah! 

This paper - 


Which features a super duper method for improving RNASeq reproducibility in measurements! 
And - ACROSS A GENE they get to 


Around 22 to 24 CV%. Ouch. 

This is where it gets way weirder. My TOF is finally back so I need to go do work, but do you think they're attaching a huge gene to each aptamer? Or do you think they're attaching a single short oligo? I'm no expert, but I suspect it's the former and this is like global proteomics CV% on a single peptide compared to across a protein. The numbers get better when you've got a higher sequence coverage.

I'll be honest, I started out this post as an April Fool's joke, but it turned out that I learned a lot. 

I'm not going to change the title, though. I think that this product will change the game and I don't think it's going to be in a great way. On paper this product looks like it will still not be able return quantitative protein values, and it looks like when it does, the variability in metrics will be worse than the product it is based on due to the difficulty in reproducing the output data consistently. 

We'll see, though. If you are using this product, or have access to it and you want to do the easy and obvious experiment to show me I'm wrong and this works, please reach out. In the meantime I'll still tell every conference audience and every classroom I'm in front of that there is zero evidence that this stuff can quantify a protein.