News in Proteomics Research: January 2020

Friday, January 31, 2020

PCR + Mass spectrometry for Coronavirus detection!

This study is a couple of years old, but it highlights a whole clever way of detecting pathogens -- amplifying DNA and then doing mass spectrometry of the amplicons.

Advantages:
-You can start with virtually no DNA and make a ton of it
-MALDI-TOF is fast when you've got a ton of samples to screen
-MALDI instruments, while maybe not very common in research environments, are increasingly common in clinical labs. (Big question, though, is the flexibility -- I think that a lot of these are locked down to performing one specific assay, but -- still -- those places would have staff with the technical expertise to prep the samples and run the instruments
-If you pick primers well, you'd be resistint to mutations in the viral strains

Cons:
-PCR takes time. Is it faster than it was?
-It takes some people a long time to make primers and to verify they don't cross-react. (I've heard tools have gotten better)
-MALDI is almost always connected to low power TOF devices (sensitivity, resolution and accuracy are the things that are typically the low part)

Check out this alternative technique that would alleviate some of this --

Same general idea -- 15 years ago -- this group amplified their virus DNA (SARS-coronavirus) and then did FTICR....are there a few thousand smaller, faster FTMS devices around the world right now? If you've already done PCR would you even need to couple it to HPLC? FIA-MS on an Orbitrap?

Sorry if you're tired of hearing about these viruses, but I'm motivated to read/write about it

On the topic of the 2019-nCOV (Wuhan) coronavirus -- check out this beautiful resource from NextStrain!

If you are working on this from a proteomics/mass spec/clinical detection perspective and want to talk, please reach out (orsburn@vt.edu). I'd be happy to lend a hand developing/troubleshooting assays or by (much more usefully) connecting you to people who could be of great assistance.

And -- while I'm sleepy rambling -- check this out -- ModPred (www.modpred.org) thinks palmitoylation is a dominant PTM -- if you are developing peptide specific assays -- I'd skip that entire protein terminus.

Thursday, January 30, 2020

MaxQuant.Live 1.2 is live!

I'm a little behind on this and don't have actual data to compare yet, but MaxQuant.Live 1.2 is up for download.

Important factor for many of us -- it still appears to rely on the same Foundation and Tune 2.9 (no 2020 mandatory upgrades if you were already using it in 2019).

The interface looks a little cleaner, but if you were also hoping for some magical new data acquisition method to appear in the App store, you'd also be disappointed.

Given how active the developers have been on the awesome MaxQuant.Live Google group discussion forum, I think we're looking at a great software that still hasn't been utilized to it's potential, but has had some minor bugs ironed out.

APEX Proteomics applied to Stress Granules Provides insight into ALS progression!

A great way to see trends in proteomics is to go to ProteomeXchange and see what everyone is uploading.

The word "spatial" is completely blowing up. I think there are a couple versions of the modern "spatial" techniques. APEX, however, might be the best example. There is a great overview of this technique on the Krogan lab website that you can view here (I don't want to steal their nice image!)

Want the 6 18 second description/reminder?
1) You make a version of your protein of interest with a peroxidase on it and then you put it into your biological system and let it interact with all it's friends.
2) You put some biotin-phenols that float around in your system doing (presumably) no harm or alterations to your system
3) !!SURPRISE!! your system by adding hydrogen peroxide!
4) Nothing special happens to any other proteins (except normal H2O2 effects, I guess) but your protein of interest has that peroxidase on it --it reacts with the biotin phenols around it which causes all it's local protein friends to be labeled with biotin!
5) Pull down the biotin labeled proteins and now you know all the proteins in the general area of your protein of interest. Cool, right?

(It is a bit more complex than this. You don't want the APEX fusion being expressed all the time, for example, you can wait and activate it when your cells get ot the right stage -- you also need ot quench the reaction, but now I need to change how many seconds this took again...)

However, if you want to see it in action in a medical context -- check out this new preprint!

This group used multiple APEX Fusions to study stress granules (SG)s which according to the authors "form in response to a variety of cellular stresses by phase-separation of proteins associated with non-translating mRNAs" (yes, stolen from the first sentence of the abstract).

Unlike a lot of spatial labeling techniques -- this one requires nothing special from the mass spectrometer. The biotin labeled proteins are pulled down and digested. This group used a QE Plus for part of the study and it looks like they upgraded to a QE HF at some point. The LC separation was a long (3hour) gradient on 50cm columns. A relatively cycle time is maintained between the two instruments by using 17,500 resolution for MS/MS on the Plus and 30,000 resolution on the HF. The data processing here is MaxQuant and it looks very typical.

With this system set up, the experiment gets more complex, with the addition of ALS linked dipeptides that show alterations in the stress granules and the SUMoylation -- which is where the biology goes beyond me.

What I do get:
1) A fantastic application of this powerful new technique
2) A method that demonstrates that I could definitely do my part of this.
3) Important new understandings into the progression of a protein disease?

Wednesday, January 29, 2020

Skyline for small molecules/metabolomics and Skyline 20.1!

Skyline has had support for small molecules and metabolites for years now-- but I still have a lot of trouble setting it up and have to bug smarter (and typically younger...) people for help a lot. What I could use is a Step-by-Step protocol and template files I can download.

--- and here it is!

While I'm on the Skyline topic -- I just got this great email overnight -- Skyline 20.1 is up.

It does require a manual download and install (which you can download here) -- but the Skyline team hasn't forgotten about proteomics.

Edit: On my laptop that has the Windows 10 disease, I did have to manually remove the last install of Skyline and reboot to install 20.1.

I've trimmed the email to remove any mention of command line Skyline and stats words I'm unfamiliar with. And highlighted my favorite parts.

MSFragger spectral libraries!! Pull out the dark proteome and then quantify it?!? For the biopharma groups that are finding multi attribute monitoring the most cost-effective way forward? Supported!

Improvements since Skyline 19.1 include:

Prosit spectrum and iRT prediction support directly integrated into the UI
- Building libraries for targeted peptides in a document through Peptide Settings - Library - Build button.
- Prosit spectrum prediction viewing in the Spectrum Match plot with new right-click menus, including mirror plotting
- Settings in Tools > Options > Prosit
Support for spectral library building from MS Fragger pepXML search results
Support for diaPASEF!
- We have run this with 2 separate 3-organism datasets through the LFQBench statistical assessment and that works.
Improved ddaPASEF and initial prmPASEF support.
Performance gains in importing Agilent and Waters IMS data as much as 2x or more.
Parallel file import with proteomewide DIA in the UI or by default on the command-line has performance similar to what was previously only available from the command-line using --import-process-count. Choose "Many" on your next import or just ignore threading the next time you import from the command-line.
Optimized spectrum memory handling for instrument vendors with .NET data reader libraries, benefitting Agilent, SCIEX, and Thermo
A new "Consistency" tab in the Refine > Advanced form, supporting CV and q value cut-offs
New checkbox for Refine > Advanced - Results tab Max precursor peak only
Support for Multiple Attribute Model (MAM) grouping with Peptide.AttributeGroupID and PeptideResults.AttrributeAreaProportion
Added File.SampleID and .SerialNumber (of the instrument) as fields in Document Grid custom reports
Transitions Settings - Full-Scan - MS/MS filtering has been extended to apply to all non-MS1 spectra (e.g. MS3) as long as the MS1-level precursor matches the target precursor m/z The redundant library filtering phase of spectral library building is around 20x
Improved iRT calibration UI making it easy to create new sets of standards based on existing sets that can be used in spectral library building and the Import Peptide Search wizard
More iRT improvements including more intelligent use of 80+ CiRT peptides when CiRT is chosen during library building
New right-click > Quantitative menu item for changing the Quantitative property on transitions in the Targets view TIC and BPC now come from raw data files and do not need to be extracted from MS1 spectra which has performance benefits for MS1 filtering
New global "QC" transitions have been added such as the pressure trace
Calibration curve fixes to make ImCal (Isotopolog Calibration Curves) work
New "Calculated" annotations have been added which support storing Skyline calculated values in annotations for future use with AutoQC
Support KEGG IDs as molecular identifiers in small molecule targets.
Improved support for D used in chemical formulae in place of the Skyline default H'
Added support for Thermo Exploris and Eclipse instruments
Support for opening .skyp files downloaded directly from Panorama

Tuesday, January 28, 2020

Publicly available (unpublished?) proteomic, metabolomic and lipidomic (MERS-CoV) coronavirus data!

Wow. Do I ever love ProteomeXchange!

Skip my reading and go to MASSIVE and get proteomic data from cancer cells infected with a coronavirus and -- if you're into that sort of thing -- you can get metabolomic data here and lipidomic data here!

The RAW files currently heating my apartment may not have been published yet, but they are publicly available and I just contacted the uploader, but I'm moving fast because this data is 1) awesome and 2) pertinent

The Wuhan Coronavirus (2019-CoV) has a very close neighbor (possibly the closest according to my rough pBLAST of the entire translated sequence, but that may just be a consequence that it emerged more recently, as sequencing technology has gotten cheaper and more common -- leading to more data) -- that is called the MERS-CoV (here is the entry from UniProt) or Middle East respiratory syndrome-related coronavirus.

The experiment is 9 files from infected Calu cells (appears to be an immortalized and/or human cancer cell line) infected with the virus and 3 files from "Mock" (presumably uninfected).

The files were acquired on an Orbitrap Velos in "high/low" mode (120k resolution MS1 and CID ion trap fragmentation). The files appear to originate from PNNL, where it is rumored they know a thing or two about running mass spectrometers.

MetaMorpheus recalibration shows the MS1 is spot on, something like -1ppm off actual when compared against human and -- get this -- I can get >70% coverage of the main capsule protein from the virus in the virus infected proteomes. This is really cool because that protein is well conserved betweent the 2 (by pBLAST score, anyway).

Update: More fast moving science!! Just because the pBLAST scores line up, it doesn't mean that the peptides do -- check this out!

Again -- big disclaimer -- this is a mass spectrometrist's blog. I know very little about viruses and is just interested in this topic!

Single cell RNASeq + Plasma Proteomics + Machine Learning!

You should check out this new preprint here!

What a great week or 10 days for proteomics. Holy cow. January was kind of laggy and then -- BOOOOOOOOOOOM --!!

Okay -- in yet another that is going into a file called "January 2020 papers you must read!!" -- which -- is too many words for this cursed Windows 10 thing --

(BERNIE BOMB!)

Back to the paper -- if you just read the abstract, the phrasing will make you think that or friends at the Max Planck jumped on the ScoPE-MS electric Porsche into the future but you'll find inside that more standard plasma profiling (which looks a lot to me like at lot of the clinical proteomics proof - of - concept work we've seen from the Mann lab -- high fractionation, rapid HF runs [relatively affordable instrument!!] for individual patients and MBR). You can read my rambling about one of my favorite of these recent studies here.

Couple that to high throughput single cell transcriptomics and then using machine learning to link the plasma proteome features to the single cell transcripts across 31 clinically derived factors from these patients and -- it looks like the future to me, but it appears they took the Tesla.

...which...of course, that is a thing, right?

Since I'm still rambling -- this preprint was posted in medrXiV, which has some great disclaimers.

Monday, January 27, 2020

Predicting human life span with deep plasma proteomics??

...and the 2020 Grammy for most eye grabbing title goes to....

...this brand new study that is the first or second thing I get to once I'm safely behind the publisher's financial security of a university library paywall....

To be clear, I haven't read this and I'll probably doubly verify the QC/QA checks on my baloney detectors before I do. But if you think there is a force on earth that can keep me from reading this today --

...I mean...besides the paywall....$8.99....

Sunday, January 26, 2020

Wuhan Coronavirus (2019-nCoV) Complete Protein FASTA download

Edit 2/10/2020: UniProt has resource up. These are better. You can check them all out here!

I was looking for a complete protein FASTA database for the Wuhan coronavirus and came up empty.

The NCBI database was just updated yesterday (direct link here) so I pulled the newest sequences and just assembled them into a single file.

You can download the complete protein FASTA from this Google drive link here.

Hit me up if you have any issues with it.

Image above is from this preprint which was updated on 1/22 after it was ORIGINALLY POSTED ON 1/21!! This is how fast science can be, people!

Yikes -- okay, well I guess the way that blew up I wasn't the only person looking for it.

Disclaimers: I'm a loud mouthed mass spectrometrist who knows very little about viruses. I just put all the sequences NCBI translated into one file so the common proteomics software on my computers will accept it.

An Encyclopedia with Quantitative Proteomics of 375 cancer cell lines!?!?!?

Ummm.....whoa....I'm just going to leave this here. This is far too large of a resource for me to tackle on a Sunday morning.

Here is an overview and a lot of links to/around/about the study -- including an query-able -- SQL database in case you're not sure where to put 4,000 Fusion RAW files....

Correction: It's only 500 or so files. Multiplexin'

And here is a short paper about it....

Saturday, January 25, 2020

EPIFANY -- A smart and fast method for protein inference!!!

This new study in press at JPR is critically important for shotgun proteomics and smarter people than me (which means just about everyone) should really take a good look at this and 1) verify it is as good as it looks and 2) see about integrating the source code or primary logic into all sorts of other tools. (An earlier draft was also made available through biorXIV.)

Okay -- so -- shotgun proteomics is really really good at one thing -- making

Peptide
Spectral
Match(es) -- (PSM)s.

And if we're working with the best proteins in the whole entire world then each and every one of those PSMs is unique to 1 particular protein and when we identify that PSM and quantify it in a sample we have proven that particular protein is there and we can even get quantification estimates / measurements on that one protein from that one PSM. (I need a word count on sentences).

However -- from an evolutionary perspective it doesn't make a ton of sense for each protein to have developed in isolation with no relation to any other protein. So...a lot of PSMs could be derived from more than one protein. And if you only identify PSMs that could originate in more than one protein, what do you do?

You INFER the protein identity.
How do you do that?
Well -- probably by a set of mostly arbitrary rules that were chosen because....we had to do something...and it's a great idea if we keep them to ourselves...because they don't reflect well on us or our field.

The best one? When you've got equal evidence, it's probably the biggest protein in your FASTA database....(some tools use the highest percent coverage, but then you'll get all weirded out because if UniProt contains your full length variant and 4 alternative "fragment of" protein sequences you'll only ever see the fragments and then you'll be afraid your lysis method broke off all your C-termini...which...you can't rule out....see...it doesn't sound great when you say it out loud. I hate explaining it when I can tell people are paying attention. I go ahead and get the idea of a "razor" peptide out of the way next, because it's better to get two things that damage your credibility out of the way at the same time and then you can spend the rest of your talk or lecture trying to gain it back.

I'm oversimplifying a complex and varied environment of protein informatics software here. It isn't all this way. From the paper:

"Some methods tackle this problem by either ignoring shared peptides (Percolator 7,8), employing maximum parsimony principles and finding a minimal set of proteins explaining found peptides or PSMs (PIA4 ), iteratively distributing its evidence among all parents (ProteinProphet 9 ) or incorporating the evidence in a fully probabilistic manner (Fido 10, MSBayesPro11, MIPGEM12)"

The best way to do this? An exhaustive recent analysis showed on the iPRG 2016 (the big ABRF study that comes up a lot) that the full probabilistic models are the way to go. More statistics, FTW!

However -- I've only used Fido, but it required a whole lot more processing time/power than even Percolating a large dataset. And this study suggests it's not just Fido...it's a brute force approach that, in the end, may not be realistic.

EPIFANY uses some fancy statistics to achieve the same (better?) inference results, but use alternative logic (something about loopy beliefs) that massively reduce the data processing load.

Full disclaimer -- I'm still trying to figure out how to use it because it runs in KNIME and I might be too dumb for it. I just found this cool KNIME cheatsheet thing -- with this and the full pipeline and all data available here I'm hoping to work my way through it. [Hooooly cow. You can run it from command line....how did I miss that!?!? ]

However -- the evidence here is solid that this is a better way to infer protein identifications. The authors test it against multiple datasets including the iPRG and use all sorts of ways to infer the protein identities and EPIFANY is the best -- or close enough -- and finishes in a reasonable time.

And -- look -- even if it didn't work any better at all, wouldn't it be better for us to use the tools that at least tried to use intelligent statistics to infer our protein identities? Grant review boards are grumpy by design. We don't need to give them excuses to fund more transcriptomics.

Friday, January 24, 2020

Top 10 proteomics papers of 2019 -- by Dr. Tanveer Batth!

Tanveer did an amazing job of cataloguing his favorite papers of 2019. He got a couple that would definitely have topped my list and several that I completely missed!

Click anywhere above this line to go to his blog to check it out.

Thursday, January 23, 2020

Determine if your methionine oxidation is from biology or an artifact!

Okay -- so, despite all appearances, methionine oxidation (Met-Ox)is actually a really important thing. Before I get distracted, you should check out this really smart way of studying whether it is a biological Met-Ox or a sample prep Met-Ox artifact here.

This is an aside, but -- holy cow -- the first 11 papers I tried to find to prove this from home were all locked behind paywalls. I had to go back to this 1997 PNAS paper for something that was open access.

Are you a US citizen and do you think that if your tax dollars funded some research then those results should have to be openly accessible to you? If so, check out this thing some guy set up....

Here is a direct access link to this petition.

With that out of the way -- back to Met-Ox. For real -- this is important. It can be used as a metric for ROS scavenging and for a long time has been thought to be impaired in a lot of diseases and may even be a generic metric of aging. It just turns out that we don't have a great way of determining what is real Met-Ox and what is an artifact of the myriad ways our field extracts and digests proteins. And now we do! If it looks like Met-Ox might be playing a key role in your biology you can get some heavy labeled hydrogen peroxide and -- ouch -- it is surprisingly expensive, at least at the first suggestion Google had for purchasing it and find out for sure!

Wednesday, January 22, 2020

BioPlex Update Preprint -- 5,500 New Protein Interactomes -- in a new cell line!

Ummm....so on a scale of 1 to BioPlex -- how big is your big proteomics data? Holy cow. You know, sometimes when you don't hear about these huge proteomics undertakings its easy to think "maybe they thought the first 10,000 human proteome interactomes was enough..."

NOPE. BioPlex is alive and well and providing human protein protein interaction data at a pace that doesn't quite seem possible.

Proof? Check out this new preprint!

Not familiar with BioPlex? It is a bulldozer type approach to human protein interactions. Instead of doing something complicated and elegant -- why not just synthesize every open reading frame in humans and do an expert level immunoprecipitation -- mass spectrometry experiment on them. Yeah -- every one! BioPlex 3.0 showed about half the theoretical human proteome. For real.

It is a project so big and ambitious that is is easy to forget about. How do you take this another step forward? Well -- you throw in some different cell types. And instead of looking at a few interactomes, you look at a few THOUSAND interactomes.

What on earth do you do with all that data? Besides make the most intimidating plots of all time (which you can do online at the BioPlex Explorer, here), well -- this might be the biggest of the big data for proteomics right now. Did you need an excuse to buy that TensorFlow laptop and take that online course that keeps popping up on that sidebar you can't seem to block anymore on Reddit? To really explore this -- we're going to need those artificial learning machine things -- OR

-- the BioPlex explorer is suprisingly powerful and intuitive!

Check this out -- I've got a protein that is strongly dysregulated in a bunch of samples by both transcript and by proteome. It seems important, but it's been confusing. I'll just put that into the BioPlex explorer -- BOOM --visualizations of protein-protien interactions!

Okay -- so no surprise to me -- this thing has a done of direct interacting partners. One thing that is cool and new here is how different this family of interactors is between the BioPlex 3.0 and the new HCT interactome.

If I didn't know what this protein did BioPlex provides that information and the data is all directly exportable in several formats -- and links directly to AMIGO (which was undergoing maintenance stupid early in the morning when I was writing this)

Around these very practical resources the preprint paper makes some very impressive solutions regarding the human interactome -- and -- let's just say that the interactome doesn't shift on a small scale. The interactome appears to shift on a completely global scale. Which...has some definite ramifications, right?

How many times do you get an IP-MS (AE-MS) that is a pulldown from cell line A and cell line B? Hopefully the main characteristic of that cell line, for example, say homozygous KRAS weirdo terminus in B vs wild type in A? Hopefully that main protein is driving the change in your protein-protein interactions for your bait. But....if you've globally shifted the entire interactome? How does that change your results and confound your downstream interpretation? Way too big picture for me, but something that we need to keep in the back of our minds. Biology is complicated...

TL/DR: BioPlex is growing and is a shining example of what proteomics can be. Send this paper to every biologist you know. My guess is that it's going to be in a big journal pretty soon.

Tuesday, January 21, 2020

CyTOF data on single cells for 281 cancer patients with long term clinical data!

Somewhere around you, possibly within walking distance, depending on the relative funding level and decision making skills of your administrators is probably a big grey and orange box like the thing above. I'd be comfortable betting you $1.14 that it probably isn't doing anything right this second. This box is called a CyTOF and -- I swear -- it has all sorts of promise, but it's a bit of technology that is seeking a real application. And I'm going to jump on every paper I see that suggests we may have finally found it.

Imagine that you're doing flow cytometry. You've tagged your cells with a couple of proteins and these proteins have a dye on them. The instrument measures the intensity of these dyes as the cells are sorted and you basically get single cell data on the intensity of these two proteins in each single cell. Now upgrade that idea and replace the dyes with protein tags that you can see with a mass spec. The cells get sorted, go through, get ionized, and a mass spec measures the tags for each cell! Great idea, right?

{Deleted a lot of me making fun of how ridiculously underpowered the TOF on the back end of the CyTOF prior to hitting the publish button. However, because this technology has so much promise, it's a little soul crushing they didn't partner with someone -- anyone-- to make a better detector. There are mass spec you can carry around that are higher resolution.}

Okay -- but you are still looking at a bunch of proteins across a bunch of cells. With the right experimental design maybe you can get past the limitations on the back end -- and for real -- maybe this is it!!

This group uses 35 protein tags -- which is a lot for a flow instrument -- and they use a smart experimental design and a really large cohort -- and they end up getting over 700 sample across and collect data on the 35 proteins across the SINGLE CELLS from these patients. Right?!?!? Yeah -- it's only 35 markers -- but this is a ton of smart and then they can correlate their findings at the single cell level with clincal data and -- get this -- they have long term recovery data for 280 of these patients!

This is how you utilize a CyTOF. The problem is going to be access to data that is this powerful for every institution that has invested in these things -- but -- it's a start and this is an awesome study.

Monday, January 20, 2020

Announcing the First Ever News In Proteomics MineAthon (Challenge)!

I have been working on yet another crazy idea off and on for a month or two and it's now almost (~~like 18%~~) fully organized.

I'll stand by these words all day. Proteomics hardware is about mature. Yeah, we'll get some cooler stuff down the road, but until we figure out how to fix our informatics problem -- who cares if you get 3% more peptide IDs or 10% more spectra? Most of the tools people are using are only converting a tiny percentage of spectra into biological findings. There is much more to be gained with smarter data processing than even applying phase constraint over a wider mass range. In the most popular data processing pipelines people aren't even looking for PTMs, because it's still really hard to do it.

SO....Let's see where we are right now!

Do you think your data processing pipeline is the best for finding important biological changes and PTMs? Want to prove it, participate in some cool human research, be on a cool paper, a wold-wide webcast talk and maybe even get a trophy and definitely get the chance to talk some smack to your peers? Yes?

Time to sign up for the ---

FIRST ANNUAL (News In) PROTOEMICS (Research?) DATA MINEATHON!!

(EDIT: I was just told an "athon" means you do it now. This is a "challenge" since we do it over an extended time period)

(...echo...echo...echo...)

How's it work?

You register by sending an email to lcmsmethods@gmail.com on or before we start mining data! Let's put a deadline of February ~~13th~~ 16th 2020 to start. I'll make a list with your name and contact info on it and definitely will not lose that list. This is important to me.

On February ~~13th~~ 16th you and anyone else who has signed up (~~honestly, maybe just you~~) will be provided the link to download a relatively large label free human proteomics data set (the one I like is 66 Q Exactive single shot files, but we're looking for the most important and under mined set of data we can find and I can't swear it'll be that one. I want to use something realistic for today's human studies by using a real and awesome human study.

You have until March 31st to turn in your results (I like long deadlines. I figure most of you people have jobs and classes and stuff and probably like decently long deadlines as well).

The goal will be to find the most important differences between patient and control samples with a specific focus on those pesky PTMs!

Why would you do this?

No reason, to be honest. I'm just too lazy to do it myself and I'm crowdsourcing so I don't have to. Wait! That's not right! There are reasons!

1) Bragging rights. There will be a real winner to this contest, as well as some top candidates based on some of these criteria by our not-yet-chosen judges:
A) Most PTMs
B) Best evidence of said PTMs
C) Best presentation of said PTMs
D) Most useful PTMs
E) Metrics for the quantitative changes of said PTMs.

Remember when we got dumb trophies for everything? "You ran around the playground without falling down more than twice? Have a trophy!" Then you never ever get a trophy ever again? That's dumb. I think we should get an awesome trophy for this. I'll find a trophy store. Not even joking.

2) FAME!! Are you familiar with GenomeWeb? It's a big deal for people that do science business stuff. The top candidates, chosen by our impartial and-not-yet-selected judges, will be allowed (if they're interested) to present their analysis and their results via a live streamed webinar on GenomeWeb. I've talked to them and they didn't say no. I don't think anyone actually said yes, but they were totally cool about it and they're altogether great people.

3) A paper! Yo, we're going to try and find the most important and under-mined set of files that we can. Then we're going to mine the crap out of it and try to show what today's proteomics can really do! And we're going to showcase the ever loving shit out of the fact that it's 2020 and proteomics isn't just hardware.

I think I'm going to even put this in for at least a poster or a talk or two somewhere so I can talk about how amazing you and your solution are. Somehow I gave like 10 invited talks last year. I hope I'm not dumb enough to do that many this year, but I'll totally get you and your results and solution as much exposure as I can (which I can't swear will help you in any way. I think I get invited to talk places just so people can find out if I'm as strange in person as I appear in writing and, if you are short a qualified proteomics speaker, you can always try me, I clearly love talking about this stuff)

Who is eligible?

Everyone! We don't care if you wrote your own pipeline or if you've just kluged (is that a word?) together a bunch of different tools into something semi-feasible that totally works for you although you've never been able to explain it to anyone else well enough that they could do it (although...to be honest...that might not be ideal, but I'll work with it!) I don't care what timezone you are in (we'll just adjust the webinar accordingly and I'll ship the trophy wherever. Although if you are somewhere really cool I seriously might come deliver it myself. Again, this is important to me.

Disclaimers:

There aren't any. A lot of my favorite people I've ever met have been responsible for the software that I use every day. I clearly have my biases and my favorite tools, but that's why I'm going to get some impartial judges. I'd like to just be the hype man.

If no one enters?

That's okay, too! I really wanted to write something in this box today and I'm going to run the same dataset through every tool I have on my PCs and I'll announce a winning software and I'll be very glad that I put the deadlines so far in the future! Your solution just won't have a chance if I don't know how to use it. Probably your solution isn't very good anyway. Poop head.

Sunday, January 19, 2020

ShinyGO! A beautiful, simple and powerful online data interpretation tool.

I didn't want to write about this one until I got these stupid manuscript edits out the door, because I needed ShinyGO and I didn't want anyone else slowing it down.

"Minor" edits done! Tool sharing time! You can read about ShinyGO here.

Don't feel like reading? You can play with ShinyGO online here!

Are there lots of tools like this out there hiding on the web? Yup! There sure are, but this honestly might be the easiest way to dig through a bunch of different resources all at once. You will need to get your protein list to universal gene identifiers or something similar (it'll translate several different types) and then you can start doing all sorts of analyses. For the figure above, I let ShinyGO select the closest related organism (ended up being Scumbag Arabidopsis) and I flipped through different databases until I got some visualizations that made my data make some sense (turned out being a nice visualization of KEGG resources with pathway representation scaled. I found that the best way to get what I wanted was to take the proteins that are up/down regulated separately and then create my network and then compare them. Worked for this model organism!

I love the fact that I can move my network nodes around and then export the image with them in that place. If you don't like that particular visualization you can export the Edges and Nodes and import them into your tool of choice.

At long last -- A Guidance Document for HLA Peptides!

Twenty somethig years of analyzing HLA Class I/II antigens with mass spectrometry and we need to face facts -- mass spectrometry of these things still sucks. Important? Yes. Super incredibly important. But mobile protons are NOT a fun thing for us to work with as our exclusive charge acceptor. Our technologies work best with doubly charged peptides that end in lysines or arginines.

But LCMS is the only thing that has ever worked at all for these molecules, so we're stuck with them. What we need is a set of guidelines to at least reference --- and here is the first one I've seen!

Honestly, I expected this to have maybe 50 different authors on it as some sort of an over arching consensus of a big meeting on the topic to sort it out. And, that might make me a little more comfortable, but I tell you what, this document is not bad at all. You know why?

This group has actually validated some HLA peptides and successfully utilized them! This isn't a big hypothetical piece. This is the next stage beyond where most of us have been going. I came out thinking it was really sobering. On our side we're pressured to find more and more of these peptide identifications and hitting so many hundred or thousand is the only metric we have. We don't need to find 1,500 mediocre peptides. We need to find the one really good one that differs in the cell type you want to target. This isn't a long read and if you're doing these kinds of studies, I 100% recommend it.

Saturday, January 18, 2020

Making publication ready annotated spectra with IPSA and PD (or any other tool)!

IGNORE MY WRITING. Make beautiful MS/MS spectra easily online by pushing this hideous big button.

Lookin' to make beautiful spectra for your poster or publication? Just push that big button!!

This might, yet again, be a post mostly for me, because I can't seem to remember the name of this tool and I keep going to Google Scholar and looking through 2019 papers from the Coon lab. And that isn't exactly a one-paper-a-year sort of lab. And since I'm already typing I'm going to show you how to use this awesome tool.

You can read about in in MCP here.

If you need to do something fancy, beyond what the online IPSA tool can do, you can download the whole thing on Github here and manipulate it (or run it locally in your own web browser on your offline computer.

I'm going to go through this from a Proteome Discoverer centric application using IMP-PD 2.1 (the free PD version that you can get here.). Sweet! Here is a great tutorial for installing it, I'd not happened into before...as well as a tool I haven't checked out yet!

One thing Proteome Discoverer has never done (and -- honestly -- most, if not all software packages) is made images that your editor and reviewers won't make fun of. There are some people out there with the kind of ~~exploited students~~ free time that have made them remake all sorts of spectra in things like Adobe illustrator that puts anyone who can't afford the financial or time costs at a disadvantage. Illustrator images can be so pretty you can hardly look at them

Compare that to your output from your normal tool of choice. Functional? Yes. Pretty? Probably not (and the ones that are pretty, like Scaffold, often don't contain all the information you want.

IPSA fixes that! You'll need a couple of things first.

1) Your spectra sequence
2) Exact mass of your post-translational modification, if applicable
3) A clean spectra to work off of

I'll assume you have both 1 and 2. For number 3 I'm going to use the free version of Proteome Discoverer and version 2.1 because of two nodes that are compatible with 2.1. The IMP-MS2 Spectrum Processor and the Spectrum grouper.

I think the IMP-MS2 spectrum processor has been integrated into MSAmanda 2.0 for later versions of PD, but this is how I'm doing things (PD 2.1, last I checked, was compatible with the largest group of second party nodes and I'll always keep a version or two installed on everything just in case I need something neat that I don't have in later versions. I strongly advise you take the 15 minutes when you get a new computer and do the same!)

The MS2 spectrum processor will deconvolute your MS/MS fragments to all singly charged. BOOM! much simpler spectra. It won't work well, or at all(?) with low resolution spectra, but it works perfectly on the higher resolution ones. You can also deisotope. I do, just so it isn't so hard to look at everything. It reduces your spectra to the monoisotopic alone.

The spectrum grouper finds your MS/MS spectra (if you have them) that are clearly duplicates, even if you fragmented the +2 and +3 versions and puts them together if you select grouping on "singly charged mass." To be perfectly honest, I'm not 100% sure I know what this is doing. I thought I did, but I can't be 100% sure I know what "grouping" means in this context. Meh. I'll investigate later.

If you're trying to annotate PTM MS/MS spectra, befinitely throw in the ptmRS/AScore (at least similar enough to consider together) for into the pipeline.

Run this and make some stupendous identifications!

From the PSM tab in PD you can double click on your peptide of interest and bring up a nice and informative MS/MS spectra that your reviewers will make fun of. If you right click on it, you can select "Copy Points". This will remove all the MS/MS fragments and these annotations and make a 3 column text output.

Next, go to Excel or your Excel like program of choice and paste the data into it. It'll look something like this. (Please note, examples don't match in image above/below, because I'm lazy.)

I did this so I could ditch the scan headers, which will confuse IPSA.

All IPSA wants is the MS/MS fragments and intensities. I highlight the cells in rows A &B (columns? I'll live my entire life without ever truly knowing which is which -- thanks, dyslexia, you're the best!) and copy with ctrl+C, or whatever you Macintosh people use.

Hit the big button at the top of this blog post and go to IPSA.

Click to expand the image below. This tool is amazingly straight-forward, but I'm still going to number things.

1) Ctrl+V your cells into the big obvious box!

2) Copy/Paste your peptide sequence in. In PD 2.1 it is easiest to do this from the very top line of the Peptide Summary (where you found the "Copy Points" button a couple of images up from this one.

3) Put in the charge state of your peptide.

4) Determine the charge states of the fragments you want to see. I've, in error, selected 2 here, which doesn't make sense in this workflow, but it did make sense with a low resolution spectra I couldn't deconvolute. Keep in mind that the IMP tool isn't perfect and may not catch every MS/MS, if the charge state is particularly high. Worth toying with if you've got an unexplained fragment.

5) Select the fragment ions you want to see, and whether you want to see neutral losses. I only use them if there are a lot of unmatched MS/MS spectra.

6) Use a reasonable fragment tolerance in Da or PPM. Matching tolerance is how low vs. base peak you want to still label. It defaults to zero which might be too messy. Putting a 5 in means that you won't label stuff below 5% of base peak.

Okay -- so this is reeeeeaaaly cool. If you don't like where your labels are, height wise, you can move them. See that y5 for example? Just click and drag it up so you can clearly read it. Then when you....

7) Generate your SVG, it keeps it that way! I'd also recommend exporting the data so that you have this output. It makes a handy CSV with the same title (your peptide sequence) as the SVG and saves it in the same place.

Maybe you're done! However, if you want to make changes, NOW it's time to break out Powerpoint or Illustrator.

Illustrator will directly import the SVG and allow you to manipulate it if you know how to use it right. (I don't and I made my figures much worse). Powerpoint (at least my 365 version) will directly import the SVG. And then I can make changes to it.

If you want to make changes like add in some text, you aren't done yet. IPSA uses 3 nonstandard fonts that you probably don't have installed. At the top of the page there is a button that says "Download fonts". Do that. Unzip them and then type "Fonts" into your Windows search bar thing.

In Windows 10 (booooooooo!) it'll look like this.

Predictably, it won't work quite right, but if you click/drag/drop enough times and say the exact right combination of profanities it will eventually recognize the fonts you uploaded. Powerpoint may not immediately add the new fonts to the font bar, but if you type the name of the font in the box it will update it.

The spectra are annotated in OpenSans. On my screen to match it exactly with a spectra taking up a full slide, my b/y ions are OpenSans 16. This may not be universal. The other 2 fonts are the text around the beautiful spectra.

Why am I adding stuff? I'm just putting in the exact mass of the ions that most clearly illustrate where the NAGs are located in my MS/MS spectra, which is never ever ever on tyrosine.

Now that you've got your spectra in, add the right arrows and colors, save the image however you want (probably .TIFF, since the image isn't compressed.) And then I'm done typing!

Friday, January 17, 2020

Context specific FDR for top down proteomics!

On the bridge of the Starship Northwestern, Captain Kelleher and his crew are exploring the farthest reaches of proteomics, going where no lab has gone before.

I just had the worst idea ever -- and -- of course there is an internet tool where you can take anyone'es picture and "Trek" it.

(...sorry...)

What started this ramble? Well, while we're here on earth still struggling with accurate estimation of FDR for linear and slightly modified peptides, on the Starship Northwestern they're beaming down tools for accurate estimation of FDR for freaking intact proteoforms!

How are you currently assessing FDR for proteoforms? I'll tell you how I am. I'm not. I'm so pumped that I've identified a few dozen proteins from fragmenting their intact forms that I'm just popping them into my list. And -- I'd wager that is what basically everyone is doing outside the 4 labs that do top down proteomics each and every day. And if you've got an exact mass and some sequence information and you can check your 24 proteins that is probably even okay.

However, if you're really getting hundred/thousands of IDs? You need a real way to estimate these and this great new tool (which is freely available on Github here) provides a real starting point on these calculations. And it turns out that context is very very important.

The authors pressure test this tool using a true known sample and by reanalyzing some previously published materials to show that for today's top down proteomics both on earth and out there where they're exploring, this is the way to engage...

....your results.