Friday, November 22, 2019
What technology should you use for the highest coverage of single cell proteomics?
Let's go with....ALL OF THEM!
Wednesday, November 20, 2019
Okay -- I now have a PrecisionFDA account and I've just uploaded data to it, and I'm trying to reduce some spectra with RIDAR with it -- and I have no idea why this is here, but I like it.
Disclaimer: Since this is an HHS US government thing it might be for US people only? But...I'm in Japan right now and I logged right in, even with the 2-factor identification thing.
You can go to PrecisionFDA here.
What do I know about it?
Well...they are the ones responsible for the CPTAC challenge to identify mislabled samples -- so they're clearly the good guys. That was cool, even if the start and end of the challenge deadline made it clear they didn't expect anyone interested in participating had a job. Scientists tend to be busy people, yo. If you want them to volunteer for stuff and they see they have to start and complete in like 3 weeks, no one is going to take you up on it.
What else do I know about them? I just got a free account and uploaded data to their cluster. There are a bunch of tools there already but it currently looks like all dumb genome and transcriptome stuff, but if someone is going to let me run my tools on their power bill they're cool people in my book.
Tuesday, November 19, 2019
About darned time!
Just accepted at JPR -- the first (as far as I'm aware, please correct me if I'm wrong, study showing the use of TMTPro (previously TMT16-plex)!
Quick summary of my rapid readthrough:
1) This group typically uses NCE of 38 for TMT10/11-plex reagents, they use 32 for TMTPro (please keep in mind that proper HCD NCE can vary from system to system and there are ways to calibrate for that now. The important part is that the HCD is lower/closer to what we use for unlabled peptides! This is particularly good for those of us still using MS2 for TMT. The authors describe the use of both MS2 and SPS MS3 on their instrument. (And -- in my hands an HCD of 32 on an Orbitrap Fusion lines up pretty close to a NCE HCD of 27 on a Q Exactives -- again, varies from instrument to instrument, but this all sounds right to me!)
2) The larger tag makes the peptides a bit more hydrophobic (elute later) but it is a shift of a few minutes that can be easily adjusted for
3) When comparing number of peptide/protein IDs directly TMTPro results in a few percent less identifications, but you get 5 extra samples done simultaneously, so I still call that a win.
Concise, well-executed little study that will deserve the thousands of citations it will get for being the first one to press.
And -- for those of us dying to get our hands on a TMTPro dataset -- all these files have been deposited! (PRIDE PXD014750) I'm filling out the form on PRIDE now to have the files released for public download.
Monday, November 18, 2019
A long time ago I was in a relatively serious car accident. My recovery cost me two weeks of classes and I learned that concussions are seriously no fun at all. However, if you gave me an option of going through that again or migrating all my computers to Windows 10....I'd need time to think about it.
Unfortunately, like all of us, there is no choice at all. Windows 10 support is ending and our field is intrinsically tied to this Cortana and Bing infused catastrophe. Or is it? What is still missing?
Sure -- the instruments need to run on a corporate operating system but there are increasing numbers of options for the data processing that don't involve somoeone running an ad to try and sell you stuff while looking for your stuff on your hard drive
(If you do run into an Ad inside your computer, this tutorial will help. This appears disabled in Enterprise versions, but who knows for how long? I removed Cortana from the SysReg manually, and on the next update, there she is, helpfully taking me to a place to purchase Kanye's new album every time I type the exact name of an Excel spreadsheet into the search bar.
I should sleep more. This is getting out of hand.
Certain bioinformaticians in our field have been leading the charge away from Windows for quite some time and my obsession with learning how to follow them is filling the pages of this increasingly strange blog these days. And ThermoRAWFileParser couldn't have come at a better time!
I'm working on installing the ProteoWizard on our cluster now, and as far as I can tell there is still considerable extra functionality in it that I should definitely still get both up there, but this new tool has some really cool advantages as well, including the direct production of JSON metadata files. And, in a head to head with msConvert, it appears the new tool produces mzML files more accurately, as they result in more total peptide IDs!
Sunday, November 17, 2019
Ummm....okay...so this is open access and it addresses one of the biggest (and scariest) elephants in the room. I hate to keep drawing attention to it, but with 40+ peer-reviewed studies on forensic proteomics in 2019 already, we need to start talking about this.
Anyone in the world can go to ProteomeXchange and download data from one of the repository partners like PRIDE or MASSIVE. If there is personally identifiable information in there, do we need to be thinking about this?
This thoughtful paper addresses these and (IMAHMFO) properly describes them as "dilemmas".
With genetics we need to be extremely cautious with how the data is made anonymous -- and explicit disclosure agreements and fancy government forms for release of genetic data with descriptions of the potential consequences. I think I've been told that there are people at Hopkins who do this stuff as a job, informing patients of their rights when they're participating in big genetics studies.
If you could track single amino acid variants specific to people in things as benign as hair? It doesn't seem all that hard to imagine that you could definitely identify a person and stuff about them from a plasma proteome, right? Maybe y'all on the biology side are already doing this stuff and I should just get out of the noisy room more? I hope so!
Saturday, November 16, 2019
If you need to catch up on a ton of those genetics terms and techniques you've heard people mumbling about, there might not be a better or more interesting way than this new documentary.
CRISPR stuff? Check!
GeneDrive stuff? Check!
Some...interesting....looking "Biohacker" guys saying reasonably accurate science things and then injecting themselves with stuff?
Friday, November 15, 2019
I think I just successfully convinced @SpecInformatics to throw in on a study where we try to do ALL THE PROTEOMICS THINGS on High Performance Computing.
I just got 100,000 core hours for free, and I was told that if I could come up with a valid excuse I could probably have another 300,000 hours to use in the next 365 days.
CompOmics FTW! The amazing people at the UVA HPC were easily able to set up an Anaconda module and -- BOOM -- SearchGUI.
Interesting thing I forgot 2 Linux boxes ago -- or honestly didn't know -- while SearchGUI installs your 10 search engines with the Windows install package, they might not automatically install in the Linux versions. Okay -- but this is can be a huge advantage.
Wait -- you know about SearchGUI. I ramble about it all the time. Okay -- if you don't -- SearchGUI is this amazing idea from a bunch of smart Belgish(?) people? who said -- wow, there are a lot of amazing search engines out there for free but most of them are a pain in the arse to set up and use, so people using one aren't going to have the energy to set up the others. Can we fix this? Oh...and choosing just one is dumb....let's fix that too!" And you get --- 10 engines you've heard of -- in a super easy interface!
(I was only running with decoy search off because I was trying to troubleshoot something odd.)
It's an amazing bit of convenience and power that you can get here. I can't recommend it enough. I even started making tutorial videos for it a couple years ago and forgot it completely. Maybe I'll finish them later! My calendar says there is some free time coming up in August of 2024.
Can you imagine how much work it would be for this group to keep up in the improvements of each of these engines? They do a great job, but the awesome Comet engine has had at least 2 updates since ASMS 2019, which I'm convinced was yesterday.
I don't know how to do it yet, but it looks like I can just get going with the newest version! Success!!
Right now I've just got Novor and DirecTag going -- because if you've got 100,000 computational core hours and you don't go after de novo first you probably don't need it. I always need de novo!
How long does this HPC need to NovoR + DirecTag search a human Hela MGF file from 200ng from a QE HF? (I've got ProteoWizard, I've just got to get it set up properly so it will accept .RAW and .d)
About 60 seconds for both. Interestingly, at 3AM it is about 40% faster than 1pm....
If you've got an HPC on your campus -- go talk to the nice people that run it -- and see if it can be an asset for you! My next plan -- MAXQUANT -- because --
MaxQuant isn't just for Windows anymore!!!
Thursday, November 14, 2019
The arguments are building up for why you need this.
If you're also thinking "...wait...remind me what Galaxy is again...? I know I saw a talk from that really cool guy from Minnesota (Pratik)"
Galaxy is a flexible interface for linking all sorts of tools on super computer thingies. GalaxyP is the proteomics version. You can have someone smart build you a GalaxyP instance on your supercomputer thing -- but there is a cooler way of doing this -- you can just borrow time on someone else's!
GalaxyProject.EU has workflows built in that you can use AND they have loads of tutorial stuff so you aren't starting alone on that terrifying project.
You can directly access all this stuff here.
Tuesday, November 12, 2019
This article isn't brand new, but I just stumbled across it and really appreciated the perspective on it. It's open and available here.
1) How do you get funding to set up and run a core outside of where most of them are?
2) What challenges would you face if you packed up and decided to go there? Yo...the 24 hours to pump down your Orbitrap after every brown out....that sounds like a blast, right?
3) And this is the absolute best part of the article -- the Opportunities! -- yes, there is all sorts of great basic science that you can do with baker's yeast. But -- there are diseases the World Health Organization reference lists as serious people killers that I've never heard of, and I bet that almost no proteomics or metabolomics has ever been done on. There is such an opportunity to do good and have an impact that we can't possibly ignore the development of biological mass spec in the developing world.
Yeah, you could argue that you could send more samples here, but have you gotten human samples from Africa before? I have and I wish I knew about this new technique that helps you tell how many freeze/thaws your samples have been through! When your samples are coming thousands of miles there is a very good chance that some valuable data may be lost, particularly in molecules that might not be as structurally robust.
Monday, November 11, 2019
How we deal with "missing values" may always be controversial and I'm going to assume that no level of improvements in mass spectrometry engineering is going to be able to fix this. Sure, we can get better coverage, but sometimes that peptide just isn't going to be there -- maybe because it's a got a single amino acid variant (SAAV) or maybe because it's got a post translational modification in patient/or condition A that is not present at all in B.
At some level, though, we've got tough decision to make. Do you reeeeeeaaaallllly want to divide by zero? Or do you want to throw out that whole peptide measurement in your downstream analysis pipeline? It often makes sense to impute a value for that peptide or molecule that you can't see in your extracted chromatogram.
ProtRank may not be the ultimate solution (...cause...realistically there may not be one universal solution...), but it's a different take on this old problem. You can read about it in this new open article.
ProtRank is assembled in Python and is available at github here.
This study is interesting in it's examination of some extreme dataset models and looks at the biases typical imputation methods cause in them. One place that is really scary to impute is phosphoproteomics. A lot of phosphorylation sites change to such an extent that they exceed the linear dynamic range of the instruments (I don't fall into the school of thought that there are truly 100% on/off switches, I think it's different bi-stability cliffs -- I almost threw in some references here, but I really should go to work). Do you impute here?
Want to talk about a nightmare dataset? They look at phosphoproteomic shifts in IRRADIATED CELLS. DNA damage repair functions through phosphorylating everything it can to stop processes that make the radiation damage worse. The increases in phosphorylation are probably as big as you can get. Imputing some values shifts the data to the point that you lose a lot of the known phosphorylation changes. Whoops.
How much better does ProtRank do? In some part we have to wait and see. It is applied in a big biological study that is in preparation. This is the introduction and logic behind the code, and a nice way to say "download me!" So...
What great timing. I was just whining about how I can't make Perseus do something that seems really simple in my head -- BOOM! 4 new Perseus videos!
You can access the MaxQuant Summer School videos on the YouTube page here.
I'm personally going to start with video T4. Because I suspect I'm missing something important right at the beginning in my dumb pipeline.
Sunday, November 10, 2019
Do you have 4-5 weeks?
Do you need to get an absolute understanding of the rates of protein turn-over IN A LIVING ANIMAL SYSTEM?
This isn't the first technique for protein turnover measurements. This may be, however, the most complete picture that we've been able to get.
If your strengths aren't exactly centered in the wet lab aspect of proteomics does this look a little bit like a nightmare? Yes. I can confirm. However, it's only the first 70 steps in the protocol that will negatively affect my already erratic sleep patterns -- at 71 we get to the data processing....yeah...it's MATLAB...but it's already all done for you!
Recently there has been an explosion of new evidence that proteomics has value in forensics analysis. While it's obvious that this is a great thing -- I'd also argue that it might be kind of a scary thing as well. Could you, for example, determine every sample I've ever prepared in my life from the RAW data by identifying a specific keratin peptide variant that is unique to the majestic Pugs that I've dedicated my life to rescuing and protecting from a world that isn't nearly good enough for them?
This new study from NIST suggests that -- yes -- this and possible AND can even be used for the identification of human genetic variants (which you could effectively argue might be an application of this technology that would be slightly more widely applicable...I guess....)
I'd like to point out a technical in this study that is really cool. They did In-gel digests of these hair samples. The gels were then stained with SimplyBlue Safe Stain. Then the gels were scanned.
Why'd they scan the gels? To determine where to cut the gels so that the protein loads were thereby equivalent!
Should I know about this? Why haven't we all always done this when using SDS-PAGE to fractionate our proteins? We could break out the scanners and the Windows XP software that is up on a shelf somewhere from the days of 2D-gels and make them easily do this, right?
Back to the study -- they use all sorts of different extraction conditions and protocols and that is a big part of the study -- developing the methods to do this, but I'm obviously going to focus on the data -- and this is reaaaally cool.
They're starting with a standard and well-characterized hair sample (cause you can obviously get standard hair material(?)) and they use MSPepSearch to analyze the peptides from the digested hair. 40% of the peptides don't match anything in the NIST human spectral library database. 40%!!
In my mind there are 2 main causes for this and my first guess would be
1) The default button in MaxQuant and other software to ignore the common lab contaminants. I'm sure I've mentioned before my difficulty in studying phosphorylations in keratins because the software just hid them by default -- geez -- that was almost a decade ago.... my layout for PD 2.4 is still set to hide wool, Pug and trypsin peptides
2) Is it individual variation? Could it be THAT prevalent? That would be nuts, right?
The authors deploy NIST Hybrid Search to answer this question. If you haven't tried this, you should. FAST and accurate identification of delta shifted spectra against spectral libraries. I feel like I've given away too much stuff in this great paper already. It is NIST, so the paper is open access.
Friday, November 8, 2019
Talk about a surprise! I am cranking on this cool dataset for a talented young biologist and I thought -- what the heck -- I haven't put anything into STRING in so long I'm not even sure if it is still supported and ---
The output is just stunning -- and reeeeeeeaaaaaaaaly helpful for his model. Almost all the pieces fall right into place for this phenotype....obviously results will vary depending on your model, coverage, etc., Dr. JJ Park did the proteomics on these samples on an HF-X and the data is as good as I've ever seen, so that doesn't hurt at all.
I suggest that if you put some data into String in 2013....
....and blocked the site on your browser so it would never happen again that you consider a revisit. This isn't the same thing at all anymore.
The one that is live today is v11 and the improvements are detailed in this great paper from earlier this year.
It's not just me being out of the loop either, v11 is a substantial upgrade. Not only does the number of organism double, and the libraries that it reference increase markedly in this release, but this is the first version that allowed the upload of complete genome/proteome sized datasets. In fact, it gives you all sorts of warnings if you attempt to upload just the proteins that you've determined are significant. By default is wants to take all your data.
100% recommended you check it out!
Thursday, November 7, 2019
I've mentioned Picky on this blog before, but I don't think it's possible to bring enough exposure to some tools and Picky is definitely one of them.
You can read about it at Nature methods here.
But that's probably not what you actually want to do. What you actually want to do is go here:
And use this awesome Shiny app to just build your method of choice. Look, you can build your own awesome PRM or SRM targeted experiment. You can think really hard about cycle time on your D30 vs your D20 system or flip a coin to decide whether you should use static dwell time or set a maximum cycle on your newest triple quads. Or...you can focus on your experimental design and data output and just design your targeted experiment with....
The clip I took from the one I'm building is chosen because I need to target an alternative sequence isoform. Which I could definitely do this morning -- OR -- I could just press the button on Picky.....
...and magically open the door to selective targeting of Proteoforms!
All jokes aside, if there is an easier way in this world of dealing with targeting alternative protein isoforms, send me an email so I start using it!