Now I have a protein list...what next?

Gustopheles is also confused! Where's my treat, hooman?

Do you have a big 'ol protein list from your awesome proteomics core lab and are a little confused about what to do next? This list can't be comprehensive, but maybe it will help. Please note, I've been doing proteomics for a long time. At the bottom of this page is old stuff. I'll put a warning on it.

Have you checked out the Analyst Suite from Monash University? Great place to start! 


If you have protein IDs in one column and relative quan for all your different samples in other columns, you probably want to start with LFQ-Analyst. It's very intuitive to use. 

Don't like Australians? Honestly, that's weird, but whatever. There are other options, particularly if your data was generated by something called "MaxQuant". 


or AMICA - which - update! can accept FragPipe and MaxQuant data directly - https://bioapps.maxperutzlabs.ac.at/app/amica






EVERYTHING BELOW THIS LINE IS SUPER OLD AS OF DECEMBER 2024 WHEN I'M UPDATING THIS. 

I'm going to treat "Plus/minus experiments" (now you see the protein, now you don't) and protein lists with quantification data the same way.  Plus/minus is great cause your list is small. Big protein lists are tricky cause you'll have a bunch of stuff that is 1:1.1 that you don't care about. You should use a statistically valid method for determining what fold-regulation level is significant -- or you could start by ditching everything that isn't up/down regulated by at least 2 fold. (This is first pass)

Please note: I'm mostly going to be using tools developed for genomics. You might need to convert your protein list into something resembling a gene list first. I did a little tutorial a while back here. More info can be found below as well.

First pass --  Hijack some ontology tools.


Panther is a great place to start.  It is free. It is fast. It is a genomics tool, however, and it will expect your list reduced down to a universal gene identifier. If your database was processed with a Uniprot/SwissProt style FASTA, the gene ID is embedded in the description and you can parse it out.


If the universal gene identifier isn't there, you can go to DAVID from the NIAID.  DAVID is very flexible in the identities that it can take and use for input. It also has a function where it can convert your list into any format you want it to be.

This appears to no longer exist...just scratching it out until I see if it just got lost. 

You could upload the data directly from the protein list into the Thermo Fisher Cloud.  Registration is free and you can upload up to 10GB of data into your free cloud account. The function you'll be looking for is Pathway OverRepresentation. The output looks like a weird side view of a mohawk...
...longer bars mean more significant pathways!


GOrilla is a powerful ontology enrichment decoder tool. No registration, multiple organisms, and it is killer fast. These developers put specific effort into the visualization. It isn't as sophisticated as full-out pathway analysis, but the data coming out of the GOrilla is very logical and you have an easy pull down for your significance cutoffs!


GSEA is a free resource provided by the Broad Institute. GSEA has been in use since 2003, is constantly updated and improved and has been cited in hundreds (thousands?) of papers. You need to register to use it, but only because they try to keep track of their users to shape further development of this awesome resource.


Deeper stuff -- PATHWAY ANALYSIS!



I'll be honest -- I really didn't want to start with Ingenuity Pathway Analysis (IPA), but I had to. IPA is the gold standard. Meticulously curated, continually improving, and plain old amazing software. I can't tell you how many labs I've visited in my life that didn't know they had access to this program. Most big schools and research institutions subscribe to IPA. The way I understand it is that they buy a certain number of "seats" which allows a certain number of people to use the software per unit time. If a lot of people register and there aren't many seats, you might have to do your data processing when no one else is working. (Disclaimer: I don't know how the whole seat/license thing works, that is between your institution and IPA)


I don't have extensive experience with BioNSI yet, but I can tell you that I've liked what I've seen so far!  You can download the software here.   It works within CytoScape, which is killer software and I see it used more in the literature all the time. Honestly, these pathways are a lot better looking than the ones you get from IPA.



BioCyc? What? That is a metabolomics program, right? It also has gene network analysis!  If you are coming from metabolomics and are familiar with the interface -- you don't have to learn anything at all -- drop that "gene" list in and run with the program you know. At this point, you could argue that BioCyc is clearly stronger in the small molecule realm -- but that doesn't make it bad!


ProteinCenter is a product of Thermo Scientific. Unlike every other tool I've mentioned here, this one is centered on data from the Protein level rather than the gene level. It is a commercial software that you'll have to pay for -- and it is unlikely your genomics counterparts have some spare keys they aren't using.  It takes a lot of the normal proteomics problems into account and handles them better than the other programs do (like removing redundancy) and you can tell with a little use that it is built for the data you are feeding it.


I missed STRING on the first go through!  Aesthetically, this is definitely one of the best downstream analysis programs. Check out this example!


STRING isn't just pretty, it is a fantastic resource with fully downloadable interaction networks, protein grouped into COGs and great scoring algorithms for providing a metric for the strength of your observations (or hypothesis).


This last one could be a page all alone. FunRich, ClusterpRofiler, SubPathwayMiner, KOBAS, ReactOme, TOPASeq  are all nice packages that work within the R environment that can be adapted in one way or another to look at differential protein lists and help you make sense of it.

EDIT: 9/17/17


I just ran into ClustVis and it is remarkable. It is a Shiny App that runs a number of very powerful R scripts on your data for clustering and visualization. Easy and seriously powerful. You can check it out here.



Again, this list isn't meant to be comprehensive -- and I'll probably add more later, but I hope you might find this useful!



6 comments:

  1. This is great.
    I just did a transition to second yer of my PhD in Proteomics, first year was all about getting list of proteins.
    Not long ago i realised that technical mass spec, however great it sounds, probably won't contribute to more quarter of my phd thesis. thanks for sharing this, please share more stuff keeping beginners in mind.
    We have to learn a lot :-)

    ReplyDelete
    Replies
    1. Glad you found it useful! I've gotten some great suggestions already of things I ought to add.

      Delete
  2. This is a great list! Another resource I like to use is http://string-db.org/ to see possible protein-protein interaction networks and analysis.
    -Shannon

    ReplyDelete
    Replies
    1. Thanks, Shannon! This is a great resource I'll add.

      Delete
  3. Ben, you are a life saver. As an orthopaedic surgeon scientist with a huge list of peptides, an inundated bioinformatician and directionless with the sheer number of options, I'm very grateful to you for this blog. For a translational researcher it is truly a great resource.

    ReplyDelete
  4. Try Funrich. Its a free download. Nice visualizations. http://www.funrich.org/

    ReplyDelete