Monday, June 29, 2015

Is "generation loss" hurting your results?

I thought this was one isolated event but I may have just ran into this issue again and did some investigation this morning.  I don't have a full wrap on it but I wanted to get the idea out to as many people as possible.

Here is the issue in a nutshell.  A few weeks ago I visited a lab that was getting bad results here and there.  This was a lab of the very best sort.  People with more experience with me with great instrumentation, flawless methods and chromatography and thorough quality control at every level.  But their processed proteomic data looked bad in a semi-random sense.  The data coming off the many acquisition computers was automatically transferred to a big storage server maintained by the University (awesome, right?!?) then the data could be transferred to the processing PCs.  It turns out that sometime between the two transfers that HOLES got poked in their data.  No joke.  The .RAW files would have spectra after the transfer that had nothing in them.  Not are darned thing.  And this messed up data processing.  The program would see these errored spectra and flip out a little, sometimes jumping many MS/MS spectra and not getting a thing out of them.

Again, I assumed this was an isolated incident...then it possibly reared its head again while I was on vacation last week.  So its time for my first sober post in a while!

I'm no expert on this, but I've read several Wikipedia posts on this today.  And I think that we're looking at something called "generation loss" (or something similar).  You can read about it here.  In a nutshell...

...sorry.  Sober post, I swear (though I just realized how long its been since I last saw that great movie)!

Anyway...back on topic...when we compress and/or transfer data we lose stuff sometimes.  Its like the old Xerox effect.  You can only Xerox a document a certain number of times before its junk.  Data compression is one issue (I'll come back to it) but data transfer is another.  There are many ways to transfer data from one place to another and sometimes a system has to decide between two things -- speed vs. quality (there is something related here.)

A few years ago we all started moving away from one data transfer mechanism called FTP (though FTP sites are still around).  FTP is super fast, and relatively easy to use but it has no data correction native to the format.  (Supporting evidence here).  So I would have someone send me an 1GB Orbi Elite file and maybe it would get there intact...and maybe it wouldn't...  FTP can be encoded with extra security features that include autocorrection but better data transfer mechanisms exist.  What was interesting, though, was that most of the time if I had an FTP transfer error I simply couldn't open the file.  Though I don't actually have a good tool to determine if some spectra were missing.  Again, there are tons of ways to transfer data but I think from the equations in the link above that there an inverse relationship between speed and quality, particularly when data correction algorithms are used.

Right.  So that makes sense, right?  So we should transfer slower and get better data quality.  Even a Fusion only maxes out around 1GB/hour, right?  And most PCs these days have gigabit ethernet connections so that should be no problem.  However, what if you had a ton of systems transferring this much data?  And what if the data coming off the Fusion or Q Exactive was relatively small in comparison to the other data coming through?  Then you've got some tough decisions to make.

I think this is related, though.  DNA/RNA sequencers generate much more data than we do and at a much faster rate.  And they've been doing it the whole time.  Integrated into these sequencing technologies has been (and sure has to have been) data compression and transfer mechanisms (some related info here).  There is no alternative for them.  This data has to be compressed in some way.  When you are getting terabytes of data per day from a HiSeq platform you need to do something with it.  

This is where I need to speculate a little.  What if you are in a big institution and you have shared resources with a genomics core?  Would these mechanisms be automatic?  Would the central storage server run at higher speeds that would cause issues with data fidelity? Would they use some level of data compression to control storage of files above a minimum size?  I don't know.  What I do know is that 2x in the last month or so I've seen data that had lost quality.  Hopefully its a coincidence and not a pattern.

The next question, of course, is how do our universal formats like mZML and such deal with compression and transfer?

Sorry I don't have great answers here.  Definitely curious if you guys with actual knowledge on these topics can weigh in here!

Saturday, June 27, 2015

Interesting update on the peer review process

Stumbled onto an interesting perspective on where the peer review process is currently. It focuses on the $. You can check it out here.   I don't know how we could possibly improve this system but it is an interesting bit of journalism.

Friday, June 26, 2015

Also! PD 2.0 workshop in NYC!!!

Hey!!! So I've lost some emails (or had trouble sending them) but there is now also a Proteome Discoverer 2.0 workshop in NYC!

Its here on Wednesday, July 8th!  I think y'all oughta be there at 9:30 and I'll start running my mouth at 10am!  If you want to come, register with:

Marriott Courtyard New York
Manhattan/Midtown East
866 Third Avenue between 52nd and 53rd Street
New York, NY 10022

Hey, my fellow Baltimorons! Lets sit down and dig through PD 2.0!

Howdy, hon!  Want to sit down with me and get to the nitty gritty in Proteome Discoverer 2.0?  You can register here.  

I recommend that you bring a laptop and download a copy of the demo version (or a tablet or whatever that can remote login to your full PD 2.0 version.  And bring questions!

Thursday, June 25, 2015

LC-MS/MS analysis of a missense mutation that changes glycosylation!

This is a fascinating study from Ehwang Song, et. al., that shows something that I've never considered that would happen:  a simple missense mutation (only one amino acid changes) that results in a new glycosylation site.  The protein in question is a clinically used biomarker for prostate cancer called PSA.  While studying this marker previously this team ran into some glycopeptides that shouldn't exist.  So they pulled out the stops and studied it in depth.

What they ended up with is a brand new glycosylation site that can occur in the protein when one missense mutation occurs linked to kallikrein mutations.  What do I get from this paper?  A ton of terms that are all new to me, and a fantastic appreciation of how little we seem to know about even things that we've studied the crap out of.

Wednesday, June 24, 2015

Reproducibility in peptide prep, FTW!

So, I've been somewhat unabashed (context correct? WV grammar?) regarding my love of reproducible digestion techniques.  Look.  I understand. I know the one time that you did that experiment and you accidentally used 37mM AmBiC that you got 3 extra peptides and one of them was the peptide you critically needed.  Seriously, I'm not joking, I understand.  I know I've got a thing where I don't like to alkylate my peptides when someone is using the printer. It never seems like things work out when its running. I don't know.

But the truth of the matter is that we HAVE to get onto the same protocols. The survival of our field is at stake here.  According to the numbers I'm looking at in the next tab the NIH/DOD/NSF gave $12 to the next gen sequencing people for every $1 they gave us and the biggest reason was (wanna guess?) seems to be the fact that we can't reproduce what we are doing.  We can't.  And its cause we're probably the most stubborn field in all of science right now (random insert, lol).

We know without a doubt that proteomics can be reproducible (check out Sue Abbatitiello's paper here!). But the fact that we all stick to dumb things like George uses 16 hour digestions at RT and I do 30 minute at 50C and the fact that George and I have a 10% overlap in our data makes us seem just plain silly and not a great bet for giving funding to.

But there are alternatives.  Ones that are easy and awesome.  The first one (one that I'm kind of in love with) is Perfinity.  Digestion kits that rapidly provide reproducible proteomics data.  A new alternative are the SMART digestion kits from Thermo.  Lets be honest here.  If you knew the sample inside and out, could you probably alter some parameter and get a few more peptides out of your sample? Probably. But could a researcher on another continent get exactly the same results you did?  No. Kits like these are the future of the field and the sooner we hop on these bandwagons the sooner we're gonna show the next gen sequencing people that DNA and RNA are only a fraction of the answers that they need.

Sorry if this is all preachy!  I know as well as you do that proteomics is the future, but we need to work out some of our demons before we can convince the old guys who dole out the money that it is!

Monday, June 22, 2015

A quest for missing proteins! An update on where the awesome C-HPP is now

After all the press surrounding the first human proteome drafts last year, it might be easy to forget about the chromosome centric human proteome project (C-HPP).  This HUGE, multinational project represents years of work into getting a full picture of the human proteome.

This picture (and several out there like it, some that are more updated) has always made me happy.

Science sees your international boundaries and this is exactly what it thinks of it!

If you are interested in exactly where this project is now, you can check this update in JPR for 2015 (open access, w00t!)

Sunday, June 21, 2015

How to extract only your modified proteins out of Proteome Discoverer 2.0 results

Hey guys!  Have you been working with Proteome Discoverer 2.0 and ran into a specific question?  Maybe I can figure it out.  This week I popped in on a lab that had a couple of great questions.

The first one is:  How do I extract only proteins with my PTM of interest?  I figured other people might be interested in the answer so I made this video.

Keep the questions coming! I can't guarantee a great turn-around time, but I can sure give it a try.  Honestly, I'm still learning this software, just like the rest of you guys but I don't have specific questions to try out and force me to learn new things.

This video will be added to the PD 2.0 videos over there -->

Saturday, June 20, 2015

How does your method change your phosphoproteomics output?

This is a really nice analysis.  It stems from the fact that we're a field of rebels and no one will use the exact same techniques as anyone else.  The paper in question is this one from Evgeny Kanshin et. al., and it investigates the effects of different harvesting techniques on phosphoproteomics output.

The good news? Phosphoproteomics seems pretty tough.  Using different buffers and methods doesn't seem to change things all that much.  I'd bet $10 though if we did a a PCA analysis of the observations from different methods we'd see some level of batch effects, but the fact they didn't see completely different things is super encouraging!

An interesting (unexpected!) observation is that the use of some phosphate buffers appears to bias the results by causing some new activations.

Friday, June 19, 2015

Whats the biggest database you can load into Proteome Discoverer 2.0?

This question came up the other day: how big of a FASTA database can I load into PD 2.0.  After asking around, no one seemed to know, so I used the FASTA Tools to create various sized databases. It was very scientific. I'd type a bunch of things to parse on like "coli" and "strain" and "L" until I got FASTAs of various lengths parsed out of my 13GB Swissprot TREMBL database.

The biggest one that would go?  5.0GB.  5.8GB crashed out.  So...there you go!

P.S. This doesn't seem to be RAM limited (like Excel or something similar).  I've got plenty of free RAM to support 5.8GB.

EDIT 7/3/15: I sent this 5.8GB database to the programmers in Bremen and someone successfully loaded it.  This does appear to be a resource issue with my personal PC.  I'm going to clear some space (y'all quit sending me all this data! just kidding, keep it coming) and try this titration again.

Thursday, June 18, 2015

SourceForge is now listed as malicious by Google

If you've been following along in the tech/programming sector, you know we've been seeing some pretty bad stuff come out of SourceForge lately.

According to a Redditor with the handle "arromatic" many popular programs have been taken, packed full of MalWare and then re-listed on SourceForge. Unsuspecting people download the fake version and then their PC turns into a mess.

A lot of Proteomics freeware and opensourceware has been listed on SourceForge, but due to the fact that we're a pretty small subset of the population you'd think we wouldn't be a target, but I've ran into a few odd things when trying to hunt down cool programs.

Sorry if this is a silly blog post but I thought I'd share some tips that we could follow, primarily: Make sure the link makes sense.

For example...if you Google "MSFileReader" the top hit is a weird link ("something".  MSFileReader is a product of Thermo Fisher Scientific.  You get it (and virtually all other Thermo software, including Demo versions of proteomics software) at this link: 

This is a piece of software that is popular enough that it can be diverted for nasty reasons.  Not to say that link that is a top hit is malicious, but why take that chance?

For OpenSource or FreeWare produced by the great programmers in our field, always go to their website first or to the publication and follow that link exactly.  If they posted their software to SourceForge, skip past this red screen on Google.  SourceForge itself isn't bad, neither are the programmers that post their work. The con-men taking advantage of good programs are the problem and you can avoid them with just a little extra effort, but its worth it.  Cause processing data is a whole lot faster if the PC isn't all crammed full of spyware and malware!

Wednesday, June 17, 2015

Reproduce the awesome new Nature mass-tolerance search with Proteome Discoverer and Byonic

For about 6 hours my PC has sounded like a jet engine is taking off.  Why?  Because I HAVE to see if I can replicate the findings in the Nature paper from Chick et. al., with Proteome Discoverer.

Here is how I set it up.

Is it perfect?  Of course not.  But its a start.  Yes, you have to have Byonic for this, or at least something with a "delta M" or in this case "wildcard search" option.

Now...I did have my Byonic node in PD 2.0 set to "Heavy" CPU usage.  When I got back from visiting a customer this afternoon the  processor was pegged at 100% usage.  I'd have put a screenshot in but I didn't have enough processing power to actually take one (and I appear to have misplaced my phone....somebody want to call me?)  So I crashed the whole thing out and went into Byonic settings and changed them to custom and decided to only use 6 of my physical CPU cores.  This gave me the usage picture above (all 6 cores are pegged out!)

I also realized that I'd never get to the end of the search today with the full Uniprot database.  (And I actually have to do work with this PC at some point! Collaborators are waiting for data!) so I created a custom database that only contains human kinases.

The data file in question is a 200ng injection of the Pierce HeLa digest ran on a 2 hour gradient on somebody's Q Exactive Plus instrument (separated on a 25cm column, I think).

As suggested by Joel Chick in the comments of the last blog entry I dropped the number of missed cleavages (in this case, I decided to use 0, again, I've got to do some work with this thing!)

So...How'd it do?

Not terrible.  It knocked the search out in less than 1.5 hours.  It came back with a total of 875 PSMs.

I extracted the PSM data to Excel and then removed anything that was a Cys +57.  This left me with 169 wild card PSM matches.  If I follow the paper, I would line these up by mass and bin them (histogram Excel 365 function, FTW!) but I'm too lazy for that.

I was pretty much just going to answer this question:  can I perform a +500Da mass-tolerant search with Proteome Discoverer 2.0?  Yes, if I have Byonic.  Will it be painful with a full organism FASTA?  Probably, but it IS definitely possible.  The second question is, did I get new stuff?  You bet your sweet peppy!

Tuesday, June 16, 2015

What are all those other MS/MS spectra we didn't identify?!?!?

Okay. I'm thinking that this week just produced proteomics paper of the year.  This is nuts.  Deep breaths, Ben...

The paper in question?  Joel M. Chick et. al., out of Steve Gygi's lab.  Its called: A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides.

The above is a pie chart I recently started a Proteome Discoverer 2.0 workshop.  This is the number of MS/MS scans that I get in a typical Fusion run with 250ng of HeLa.  The orange section is the number of MS/MS scans that I turn into peptides using a normal SequestHT run.  Through the workshop I use tools like Preview and Byonic and gradually chip away at the blue section, finding the identity of thousands more of those MS/MS fragmentation events.

According to Chick's new paper, I'm just scratching the surface!  They went in and addressed the unmatched spectra by using HUGE windows.  They opened it up 500 Da.  500Da!  I used 10ppm.  Even in Byonic searches when I go full out wild-card I only open up the tolerance by 80Da and that takes all day.

What this study comes back with is almost 50% more peptides.  They bin the mass discrepancies out and its PTMs.  All over the place, PTMs!  The FDR is possibly a little high but this study just uncovered what a massive fraction of unidentified MS/MS events are.  Holy cow.

I can't wait to try this.  Unfortunately, even with my power PC, I don't know if I have the power to finish a search like this.  Interestingly, I had heard a rumor that OmicsPCs had built something crazy for someone at Harvard...something crazy enough that could maybe run a search with a 500 Da MS1 tolerance?  (Dave? Care to comment? I'd sure like to beta test...)  I digress...

Definitely definitely check this paper out.  It looks like the answer to SO many questions (like maybe why the Gygi lab is so famous?!?)  This is a game changer!

Monday, June 15, 2015

SAPH-ire: Structural analysis of PTM hotspots!

Continuing theme alert!  What do we do with all of this proteomics PTM data?  Maybe we run it through this awesome new program from Henry Dewhurst et al., that they call SAPH-ire, or Structural Analysis of Ptm Hotspots.

What does it do?  Well, it is a way of looking at the post translational modifications you've uncovered in a a quantitative way and groups them to give you an idea of what they may be telling you.  Do I fully get it? Honestly, not yet.  But it seems really really smart and I think I'm on the edge of getting it.  Hopefully another coffee and it'll all fall into place.

I think this kind of thinking is critical, though.  Say I do a Byonic search and I end up with my proteins group of interest and in it there were 10 phosphorylations 16 lysine acetylations and 30 -odd other PTMs.  How do I get anything out of that?  If there was a way of clustering to say that "under these conditions you found more of these modifications" thats a big step forward, right?  And what if you could tell "under my conditions (drug treatment or whatever) that it wasn't a change in a certain individual PTM, instead it was that an area of the protein got PTMylated in some way..."  Maybe we're focusing too much on the individual modification rather than the fact that the region is active.

Okay. I'm another coffee in and maybe its just the caffeine euphoria thing but I think I seriously love this paper even though I'm going to need to keep thinking on it.  Check out this figure I stole from the Supplemental:

These are PTMs searched for in their proteins of interest (G proteins!) plotted against one another.  These are the protein "hotspots" that they focus on.  There are regions of these proteins that are active, but not all in the same way, or in a simple way.  We've got multiple PTMs occurring in combination in these areas and I've never seen anything that could provide this information before.

P.S., they validate the heck out of this thing.  I'm going to revisit this because I'm starting to lean toward thinking its positively brilliant.  Now, I just need to get my hands on it...

Friday, June 12, 2015

PhoSigNet: Can we make sense of all that phospho data? set up this elegant phosphoproteomics project.  You dual step enriched, you used a brilliant LC-MS method with neutral loss triggered ETD and now you have this fantastic list of 163 phosphorylation sites that are up-regulated by your drug.  What the heck do you do with it now? can convert it to gene identifiers if you are in mouse or human and give it to Ingenuity Pathways Analysis (IPA) and maybe that'll get you something.  But that is gene-level data and its probably not going to help.  You can dig through PhosphoSite manually looking at the 163 sites that you found (or the 91 of them that have been annotated).

PhoSigNet is the ambitious endeavor of Menghuan Zhang et. al., to help us with this process.  They have taken the data from PhosphoSite and CanProVar and other databases and ended up with 200k phosphorylation sites (almost 12k of which have been validated in one way or another!)

Through an algorithm that they call ExpCluster you can upload lists of your quantified phosphorylation data and it'll try to figure it out.  Unfortunately, I'd have to dig pretty deep into my old hard drives to find a list of data to feed it (and I need to go to work!)

You can check out this awesome looking resource here.

Thursday, June 11, 2015

New proteomics discussion board?

I think the list of things I want to blog about right now is expanding at nearly the same rate as the amyloid plaques in my brain!

This is a quick one I can check off the PostIt note(s!). A postdoc in our field (anonymous?) has proposed a new central proteomics forum.  I know we have a few here and there (ABRF, sharedproteomics, BRIMS) but they are kind of disjointed. This one is hosted by StackExchange. It has a friendly interface and can be directly linked to your Google or LinkedIn or other similar accounts.

StackExchange is big enough as a discussion board in general that it can be supported technically.  There is nothing on the site at the present but it might be another good way to get connected to people with the expertise to help you when you're stuck.

You can check it out here!

Wednesday, June 10, 2015

PIA -- Open source protein inference

Until top-down proteomics really reaches its potential most of us are going to be doing shotgun proteomics.  And it has this nasty drawback.  We can make a peptide-spectral match (PSM) but sometimes that PSM can be linked to multiple proteins.  Figuring out which one it came from can be hard, if not impossible.

We've seen some good new tools lately such as FidoCT.  PIA is a new one and its open to everybody and a bunch of search engines.  You can read about it in this new paper in JPR here.

Some of the highlights are that its got an easy interface and it can take data from just about anything.  I'm toying around with the web interface now and its got data from MS-GF+ and X!Tandem (and Mascot? I forget now, but I think I saw it, but I don't want to hit the "back" button on it) and some other programs already up there for me to mess with.

The number of metrics it uses to support its inference is a little overwhelming.  This is obviously a powerful piece of software.  The web interface seems to be more of a demonstration of the power it has and the true software is downloadable and fully scalable from single PC usage to use on servers and clusters.

You can check out the PIA website here.

Tuesday, June 9, 2015

Imaging MS of the scariest thing I've ever heard of

I am a huge fan of imaging mass spectrometry.  I've never really done it, with the exception of a training course in Bremen but I'll take any opportunity to visit people who are doing it and I'm a big fanboy of the research, so I know some about it.

In a fantastic use of this technology, Brian Flatley et. al., took a look at a cancer biomarker called S100A4 and its distribution through tissue of something terrifying that I had no idea existed until I downloaded this paper.  (P.S. There were no tissues of this type in the huge cancer genomics project I did years of QC analysis on).

Working with histologists who could identify the cell types by microscopic evaluation they could break these tissues into distinct areas and then looked at outliers that changed quantitatively in different slide areas.

It really is just a stunning paper visually with some great work clearly lined up. I hope with work at this level that this disease (and all cancers in general, of course!)  will soon be something that none of us ever hear about again.

Monday, June 8, 2015

Just when you thought glycoproteomics couldn't get any more complicated....

..this study happens.

Well, maybe, I'm exaggerating. I think we're all getting a good feel for just how crazy complicated and essential our understanding of glycosylation is to biological processes.  But this is a great study to underline this fact as well as demonstrated (to me, at least) a different way of visualizing glycosylation

In the study they took mouse livers, enrich for glycosylations at the protein level,  digested everything and did dual stage HCD + product ion triggered ETD on an Orbitrap Velos.  The resulting data was searched with Protein Prospector using "an iterative searching strategy" described in this paper I haven't read (yet!).  Essentially it looks like they are doing a deltaM style search with a 2kDa modification window.  I'm assuming the iterative part is a data reduction step.  However, I'm thinking if I sucked it up and gave up the bandwidth I could pull this off with Byonic or ProsightPC.

The output is this crazy awesome histogram with glycosylation masses versus frequency.  What a nice way to summarize this data!

The paper goes further and they use some advanced versions of gene ontology stuff to figure out that the patterns of glycosylation are organelle and tissue specific. 

Sunday, June 7, 2015

High resolution discovery ported to a clinical assay for prostate cancer recurrence!

This study by Claire Tonry et. al., is an awesome example of porting a biomarker discovery project directly into a clinical assay.

In this study they followed patientswho had treatments which pushed prostate cancer into remission.  They followed the course after this and identified biomarkers from the unfortunate guys whose cancer recurred.  Their comparison gave them 65 sweet new protein biomarkers that they were able to seemlessly convert into a beautiful new clinical assay for MRM.  At 65 biomarkers, they could have used the discovery Q Exactive as the validation instrument (we're not seeing this enough yet!) but I'm certainly not going to complain.

This is a great study that shows how we can use these technologies complementarily (apparently..not a word...) to benefit patients.

Wednesday, June 3, 2015

MCP Rules again!!!!

How excited am I about this?  So excited that I'd make a bad illustration of the MCP logo and celebratory fireworks?  Yup!

Wait. What are you excited about again?  Only about this statement in MCP!!!  It is called "On Credibility, Clarity and Compliance" (I really wanted to throw in another funny C word but I couldn't come up with one.  Google suggested "cacodemomania" which is the fear that you are possessed but this is far too cool and serious.

I've always had a soft spot in my head for MCP. And not just because now that I'm on the dark side I still get access to most articles they publish.  I have appreciated MCP because of the focus on quality.  I was stunned, annoyed and possibly a little angry when MCP had to drop some of their requirements for publication, particularly the requirement that all RAW data was publicly accessible.  Seriously, though, what do we do with all this data?  We filled the venerable and powerful Tranche servers until they hardly worked. I keep just dropping more hard drives into my home PC to keep up with all the data that just the people I work with in my day job generate.

But you know what?  MCP has the straight up intestinal fortitude to say "you know what? Yes, the data is getting bigger and its going to be increasingly hard to make public, but in order to make sure the stuff in this journal is AWESOME we are going to make it required to make that data available." Is it harder for the authors?  Yup.  Is it harder for everybody at MCP?  Of course.  Is it the right decision? Hell yes it is.

Serious serious kudos to the editors for this.

Tuesday, June 2, 2015

ExpTimer -- automated alarms for all your benchwork

Yes, I have a list of 90 things to write about from ASMS, but this is really cool and I've got 2 minutes.

ExpTimer allows you to type your whole experimental protocol into your timer and get an automatic alarm when its time to (for example) add your iodoacetamide or something.

Nifty, right?!?  You can download it here.   And read about it here.

Shoutout to @PastelBio for the legwork!

Monday, June 1, 2015

Matrix assisted ionization for Orbitraps

If you are at ASMS and are want to know an awesome booth you should look in on?  I suggest the one with the great big "No MALDI" sign.

What they have is matrix assisted ionization that doesn't use a laser and connects right up to any vendor's mass spec.

If you aren't at ASMS you should check this out.  Here is their flier (sorry for the potato quality):