Tuesday, May 13, 2025

PFly - Is this the missing link in LCMS proteomics deep learning models?




Okay - so this one has bugged me (and a lot of other people for a long time) - we can do a pretty great job now of predicting peptide fragmentation (unless the vast majority of PTMs are involved). Supposedly we can do a solid job of predicting peptide elution patterns (exclusively from C-18 reversed phase chromatography). 

What has been missing is predicting what peptides from each protein will actually ionize (or fly). 

This has been tried before, btw -


- however, is often the case in academic software development, many of these lead to 404 errors or software that only runs in Windows 95 - or....well....they aren't very good. 

I'm a little sad to say this but when I did my normal round of sending a paper that I just found yesterday and was reading at lunch the responses were univerally ...skeptical at best.... but maybe this is finally it! 

Introducing pFLY! (I read it at lunch yesterday and it's faded in my mind a little but I'm just about 99.0% that the p stands for Pug) 


??

Friday, May 9, 2025

Benchmarking SILAC workflows in 2025! Wait. What?




Okay - for just a second I thought I'd mistakenly scheduled posting this article about 10 years in the future, but apparently this really is new


Mandatory - 


For any of you scientists out there who aren't getting up in the morning complaining about your joints, SILAC was something we used to do in proteomics all the time. We did it to the point it was called the "gold standard for proteomics quantification" all the time. And not just from the companies that sold us the heavy labeled amino acids which caused every plate of cells to cost $183,000. And at this time that was a big deal because to do good comprehensive proteomics on an Orbitrap XL in 2009 when I was doing it required 15 plates of cells. If you did a great job, 3 weeks of run time would get you 3,000 proteins quantified. Please not that some of these numbers are exaggerated.

Anyway, you'd grow cells passage after passage in media with heavy lysine and arginine until you were pretty sure that all your proteins had heavy labels. Then you'd pretend that you didn't think that cells were way too dumb for something like 18 passages in very strange isotopically labeled amino acids could have any possible phenotypic effects. Then you'd take cells grown without it, treat one with drug and one with DMSO, then pretend that DMSO has no phenotypic effects. Then you'd lyse your cells and mix your light and heavy proteins or digested peptides (I forget, I last used it for a paper in 2010? 2009?) and run them. At the MS1 level you'd see your heavy/light pairs and quantify off those.

There were drawbacks, some of which could probably be inferred from my description of the method above, but a lot? at least some? good science came from it. I can't think of any off the top of my head, but you've probably heard my philosophy that it's best to ignore everything in proteomics before 2017 -and this technique was largely gone by then. 

However - if you did have some reason to do SILAC in 2025 - I bet you'd wonder what could and should process the data! And here you go! 

Silliness aside, I've never considered doing SILAC DIA. 

Oh yeah, you can do some really cool stuff with SILAC by introducing it and then changing the media. That can provide measurements of protein half-life and protein turnover and things like that. There are reasons. Just don't use it for pairwise drug treatment stuff. There are much better ways to do those things now! 

Thursday, May 8, 2025

Top-down proteomics of Alzheimer's tissues reveals entirely new targets!

 


I've got a lot to do today, but this new study is a jaw-dropper.


Quantitative top down (intact protein!) proteomics. Of over 100 individuals (I think over 1,000 samples?!?!?) and multiple proteoform level alterations that appear differential in Alzheimer's? I will come back to this, but you should absolutely check this out. 

Wednesday, May 7, 2025

Lessons learned migrating proteomics to HPC (high performance computing) environments!

 

Sometimes this blog is just what I learn as I'm going through learning something for myself - and this is clearly one of those posts. 

One thing that was not emphasized nearly as well as it could have been during my interviews at Pitt was the absolutely amazing world class High Performance Computational /Computer / Cluster (HPC) framework that we have. 

It took a little work and me bugging colleagues with dumb questions, but I've got some workflows going that need a lot of firepower! 

Namely things like FragPipe open search - and R packages that almost inevitably require ludicrous amounts of RAM. 



Things I've learned so far. 

1) The time to get my data to the HPC can be a bottleneck worth considering. My TT Ultra2 is generating around 160GB of data/day right now. Around 1.5GB per single cell and closer to 4GB for libraries and QC samples. Seems to average out pretty close to 160GB. Transferring 1 day of files to the HPC seems to be around 1-2 hours. Not a big deal, but something to consider if you're the person prepping samples, running the instruments, writing the grants and papers, writing blogposts and picking your kid up on time from daycare every day. Worth planning those transfer out. 

2) NOT ALL PROGRAMS YOU USE WORK IN LINUX. FragPipe, SearchGUI/PeptideShaker, MaxQuant are all very very pretty in Linux. Honestly, they look nicer and probably run better than in Windows. DIA-NN will run in Linux, but you do lose the GUI. You have to go command line. But what you can do is set up your GUI runs and then export those from DIA-NN. Maybe I'll show that later. 

3) You may need to have good estimates of your time usage. In my case I currently get a 50,000 core hour allotment. If I am just doing 80 Fragpipe runs, I need to think about 

Cores I need x number of hours I need those cores. I can't request more than 128 cores simultaneously right now (for some reason, yesterday I could only request 64 with FragPipe, I should check). But if I need 128 cores - do I need those for 10 hours? If so, thats' 1,280 core hours I will blow through. 

Since MSFragger is ultra-fast but match between runs and MS1 ion extraction is less fast and uses fewer maximum cores/file, there isn't a difference for a small dataset for using 32 cores. Your bottlenecks aren't where you really scale up forever.

4) Things that are RAM dependent may be WAY WAY FASTER. I think we scale to 8GB of RAM/core on our base clusters here. 32 cores gives me 256 GB of RAM! If your program is fast enough to read/write to offset a lack or RAM or use every amount of RAM around to maximum effect, those things can be much much faster.

5) Processes that are processing core speed dependent may be slower. For a test, I gave FragPipe 22 14 cores on a desktop in my lab and 14 cores on the HPC with the same 2 LFQ files. Unsurprisingly, you can really crank up the Ghz on desktop PCs where it makes sense to have lower overall core speeds when you have 10,000 cores sitting around. 

6) You probably need help with all installation and upgrades. Most of us are used to that by now, though. I can upgrade my lab PCs to FragPipe 23 today. I need to put in a service request to have someone upgrade me on the HPC. 

7) You may have to wait in line. I tried to set up some FragPipe runs before bed and requested the HPC allotments. Then I dozed off in my chair waiting my turn. Then when I woke up the clock had already started ticking. I wasn't using my cores, but I had blocked them so no one else could use them, so they did count against me. 

I'll probably add to this later -but I highly recommend this recent study out of Harvard which has been my go-to guide.

Also this study trying to address the LFQ bottlenecks! 

Monday, May 5, 2025

6,500 proteins per (LARGE) mouse cell (oocyte) and studying age related changes!

 


I wasn't sure if I liked this study or not, but then I got to an interesting and very counter-intuitive observation and then the biology and decided that I did. 


It's not the first single cell oocyte paper we've seen, and it should be noted that they are quite big cells. These authors estimated them at about 2 nanograms of protein, which seems right based on what I remember from another study. 

One thing that I find really surprising here is that - unlike previous studies - this group tried the reduced volume of 384 well plates and found autosampler vials more reproducible. I'm stumped on this one. This is contrary to everything I've seen and Matzinger et al., found and is frankly just counter intuitive across the board. 

The surface area of an autosampler vial is huge, comparatively to the bottom of a 384 well plate. I do find it a complete pain in the neck to calibrate some autosamplers for accurately picking up out of 384 well plates, but I don't know how much that plays in here. Also some glass binds less peptides than some plastics. Insert shrug. 

That aside, the authors put one oocyte into things with the CellenOne and then add digest. Incubate and inject. 60 min run to run on a 50um x 20cm column and running diaPASEF with a 166ms ramp time. 

Data analysis was in SpectroNaut. 

Okay, and the reason this is escaping the drafts folder is because the biology is really cool. They look at both artificial (handling) and natural (aging linked) conditions and how they effect single oocytes. There are a lot of people out there who care about how those things (probably not in mice, but maybe?) change throughout the aging process! 

Editors make statement on proteomics transparency AND a video for how to make your data available!

 


I wonder if this was inspired by some of the same things that I was just complaining about? 

Okay, so rather than just complain about it, I also went crowdsourcing to find resources - and here is a 4 minute video showing you how to make your data publicly available on PRIDE! 



Sunday, May 4, 2025

Use single cell proteomics (SCP) to add biological relevance to single cell sequencing (scSeq) data!

 


Transcript abundance tells you what a cell wants to do.

Peptide/protein abundance tells you what the cell is actually doing.

You can get measurements of the transcripts of tens of thousands of cells with a few hours of effort and passing it off with reports coming back in a few days.

Each single cell proteome is a lot slower and a lot more expensive, but worth it for the whole... biological relevance... thing.... 

What if you could do a scSeq on tons and tons of cells - and single cell proteomics (SCP) on a small number to correct all that scSeq data? Would you be downloading that as fast as you possibly could? 



Saturday, May 3, 2025

New data analysis strategy in SpectroNaut leverages diagonalPASEF features!

 

I've been on the fence about diagonalPASEF, but I guess when my SpectroNaut license goes live it's probably time to try it. 

I legitimately don't know who came up with diagonalPASEF - there were too many cool methods too fast for me to even try them. But it almost looks like 3 groups (all European...of course....) all had very similar ideas. But on my new instrument it's just a button, so Imma just push it and see what happens.

The bummer is that I do have to take my source off and calibrate the instrument with the ESI source - which I haven't done since it was installed - (you can do good mass and TIMS calibration now without the ESI source but you do need to sensitivity tune and/or quad tune for diagonalPASEF with the source).

But this is legitimately smart looking


Whoa! I went to the new DIA-Neural Network website (Aptila.bio) has anything about support for this mode that I should read and found something I didn't know was publicly shared! Also, not sure yet on whether diagonal is supported, but it does look like I can lie to DIA-NN and say that it is SLICE-PASEF. We'll see! 

Y'all, this ASMS is going to be sooooo crazy. Despite the lack of Europeans and the fact none of us in the US have any money to do science.....




Friday, May 2, 2025

Illumina protein prep! For when you truly do not care how much of each protein is in your sample!

 


Everyone, I think it is time to admit that the biologists have different opinions about what is important in proteomics.  And maybe we're the ones that are wrong. This field originated largely in analytical chemistry where they drilled accuracy and precision into us. Sure, there are reasons for accurate protein measurements, like when you're in clinical chemistry, and maybe my time in those dark basement labs ruined my brain to think that when we measure a protein we actually really want to know how much of that protein is there. 

The biologists want to detect a protein and they want to be able to say that in condition 1 vs condition 2 one of those conditions might possibly maybe have more protein. They don't care at all how much more protein. And - again - I'm the one here who is probably wrong. Hannah did this phenomenal thesis project in my lab and she worked out the nanomolar concentrations of 7k or 8k proteins at the blood brain barrier. We were operating under the assumption that absolute concentrations have value. Like - if your are doing medical imaging you know that proteins below xxnM just can't be visualized with any of today's technology. Don't try. And maybe that's just one outlier where we absolutely have to know the protein concentration.

Maybe the other clinical assays, like CRP and troponin and ALT/AST ratios are also outliers. Sure - whether you're going to get a wire jabbed into a blood vessel might be determined by the absolute amount of troponin in your blood right now as compared to 30 minutes ago. But it really appears that for the vast majority of new people in proteomics they want to know - is there probably less protein here and more protein there? 

So if you really just want to detect proteins and you truly do not care how much of the protein is around - and you've got a lot of money - do I have a technology to show you! 

SOMASCAN COUPLED TO NEXT NEXT GEN SEQUENCING! It's called Illumina Protein Prep! 

For real, it's a real thing. 

First of all - let's look at what aptamers are and what they do. I'll back way up because some people looked at me like I was out of my mind when I talked about the proteomics assay with the lowest quantitative dynamic range.

Let's start with this review from ancient history (2010). Don't worry, this is a physical limitation of protein oligonucleotide interactions. Not much has changed but there are more modern references below. 


Aptamers are oligonucleotides and it's really really cool that one of the cheapest and easiest molecular reagents to make in a customized way can bind proteins at all! Not joking. That's cool stuff. And despite what companies will charge you, they are pennies to manufacture. 

The binding, however, is calculated through either the dissociation constant or association constant and this functions in a linear way over an extremely narrow dynamic range. 


This was taken from the review above. Please note the fluorescence intensity of the blank. In this solo interaction of one aptamer vs. one protein we see a relative increase in aptamer binding from 10nM to 150nM. At 150nM of protein, however, you no longer get a linear response. Lots of reasons for this and I haven't taught stoichiometry since....let's go with a long time.....and I don't want to get into it. 

Imagine you have patient A and patient B. And one has 5nm of IgE in their blood? Well....that's probably about where the blank is, so you get a zero. What happens if you have 1,000nM of IgE? Well...you probably register at about 150nM, maybe a little bit more? Again, maybe you do not actually care in any way whether you have 150nM or 10,000nM? Maybe you're just weird for wanting to know.

What's important here, though is that each aptamer is like this. It is designed for a very specific protein and each one has it's own binding and dissociation constants. It's also important to know that in a complex solution, you're dumping in (in the case of Illumina protein prep) about 10,000 of these different aptamers! It is very very likely that the 1 order linear quantitative dynamic range represented in this figure in an isolated 1 vs 1 system is perturbed and not quite as successful as the above.

Edit -5/3/2025 because I'm self conscious about the crazy number of views this 20 minutes of typing has gotten in a single day. 

This is how a pile of aptamer measurements work.

True concentration of protein X - 0 nM - Aptamer readout - Not zero

True concentration of protein X - 5nM - Aptamer readout - Same as blank

True concentration of protein X - 10nM - Aptamer readout - 2x blank

True concentration of protein X - 20nM - 2x of 10nM - This is good! You're in your dynamic range! 

True concentration of protein X - 50nM - 3x of 10nM - It's still higher, but you've already left that little window where you're aptamer binding corresponds linearly to the amount of proteins (your linear quantitative dynamic range). 

True concentration of protein X - 100nM - 4x of 10nM.....It's still higher but you are now need fancy math to have some way of estimating how much of the protein is there based on the aptamer binding response. 

True concentration of protein X  -1000nM - about 5x of 10nM.... You've maxed out your concentration and all you know is that you've got more than 100nM

True concentration of protein X - 10000nM - about 5x of 10nM - same as above. 

This is important because as you'll see at the very last panel, it s pretty common in mass spectometery to get a linear concentration /signal increase across this ENTIRE range. 

So - in aptamer measurements - 

A) You almost always see a signal whether or not there is any protein there at all. So....when someone tells you they can detect 1,000 or 10,000 or 100,000 proteins in a sample you need to keep in mind that that is simply how many aptamers they put into the mixture. That doesn't mean they actually detect that number of proteins. They love to mix those terms up. And maybe you see a measurement for each protein aptamer. That does not necessarily mean protein detection. 

B) You can trust that signal corresponds to how much protein is present in only a very narrow concentration range. 

C) Above that 10x concentration range the value you see has no relationship AT ALL to the amount of protein present. You've simply maxed out. 

End 5/3/2025 edits

Again - the figures and review above are old - what can we do in 1 vs 1 relationships in 2025? Here is what I'd consider the high water mark today


I should go get some lunch and if you want to read this you should, because it is a solid advance in aptamer binding measurements - that last word is key because aptamer binding is not going to change. These are limited by physics and chemistry and they simply won't change. Yes, you can select for more efficient aptamers for your protein, but you aren't going to change the fundamentals of dissocation constants and maintain the proteins in a state in which they can be measured. 

How did they do? Pretty darned good! About 1 order!  


To be fair this study is focused on measuring aptamer binding over a course of time in a single molecule context. This isn't about extending the linear dynamic range of protein measurements. There are things out there about that. In some techniques what they do is have one aptamer that is good at one concentration and another that is better at others. Then you combine the measurements of both to get a better range. There is a preprint somewhere, but I've spent too much time on this.

So....imagine my disappointment when knowing that I couldn't talk about what I knew regarding an illumina - somalogic partnership (I just assume I'm under NDA with every proteomics company in the world now and I just don't share anything until I can google search it) - and I discover that does not appear to be what they did? 

They appear to simply throw in the requirement to own a NovaSeq 6000 or NovaSeq X system to generate - get this - data on up to 384 samples per WEEK, which is 1/3 the speed of O-link? And even slower than mass spectrometry? 

And if you're new here and aren't familiar with the quantitative dynamic range of mass spectrometry - here is the first thing I found searching my desktop. It's a Sciex app note, but this isn't extraordinary data. I can show you real data like this all day. It's actually surprising because normally you think vendor app notes are going to be crazy unachievable data and this is just very normal. 


You spike 0.l ng/mL of this peptide in rat plasma - you can see it. If you put in 5ng/mL you get a peak that is 6e4 tall. If you put in 500ng/mL (100x more) you basically get a peak that is 100x taller. So...if you want to know how much protein is in your sample, you always have mass spectrometry to fall back on! 

Thursday, May 1, 2025

Deep multi-omics from blood spots - real life actionable and translatable methods!


There are a lot of increasingly complex ways to measure proteomics in blood. You can use 100 different nanoparticles or mix in aptamers or use double antibody arrays - all things that can be easily translated to the clinic as long as you're willing to pay 

for your next blood test! 

So...what if you took a step back? Maybe 10? And used actual clinically available material? And then what if you fully embraced heresy and used HPLC methods that someone in a clinic would actually be successful doing? 

Sounds like science fiction? If so, you should check this out