I swear, I think this will be the last self-promotional post for a while. I'm just so insanely unbelievably relieved to get this off my desk and out for review. It's been a long 5 or 6 months of work on this project. Some people had seen the first preprint, but that was, in large part, just me letting the world know that I'd started doing work in this brand new field where we're allowed to do research now in the U.S.
Want to check it out? www.CannabisDraftMap.org
Oh and here is the new preprint.
It's been almost impossible in the US to do research on Cannabis plants because they've been illegal. In January or something, the federal government passed "the hemp act" and -- BOOM! people can start growing these plants with different state to state restrictions although some massive confusion exists about what is allowed federally and from state-to-state. A lot of universities now have research greenhouses!
Turns out -- no one has EVER done modern proteomics on these plants! There were some 2D-gel based studies a while back and some MALDI stuff here and there and all of a sudden I'd ended up in a position, mostly by chance where I was one of the first people who could get access to research material. Yeah! The little research I've done in my life has been mainly incremental biology or technical things. How awesome would it be to do the first comprehensive proteome on an organism people have heard of? A bunch of my friends were on the Johns Hopkins version of the Human Proteome Draft. What they did is way more important. What we did was funnier.
Everything was going really well. It turned out to be super easy to get the proteins out an peptides digested in high efficiency by just freezing it, smacking it with a hammer and then doing FASP. We did it all match between runs style. Combine/fractionate, build a library (2 hour runs HF-X) run each individual sample separately with match between runs for quan.
Then it hit us -- NO ONE HAD EVEN COMPLETED THE GENOME OF THE PLANT. Ugh. People had done some sequencing and deposited the data. So we 6 frame translated 3 (surprisingly bad) genomes, combined the data, did proteomics and went to ASMS bound and determined to figure out how to make an annotated FASTA file from genomic data.
What we learned in Atlanta:
1) Everybody wants to know how to easily combine genomics and proteomics.
2) 5 -- maybe 6 people in our field can actually do it
3) It involves using genomics tools and genomics data to filter and QC and align the data.
A) I don't know how to do
C) I don't actually want to know how to do, because -- for real -- next gen genomics data is crappy. You acquire 100x coverage over your genomes because 99% of it is crap. For real, I'm not making that up. There was a great paper to this effect years ago. I thought MacCoss was on it, but I can't find it right now... Imagine an ion trap running at 500 Hz and what that data quality would be like (yes, I made up that number). Sure, there is real data in there, but you could also say anything you wanted by lining up your hypothesis to the noise. That's why bioinformaticians and clusters are so important for next gen. You need power and experience to tell the real stuff from crap (mostly by looking for the most repeats of the evidence, particularly in short read sequencing where you might never have 100% overlap of your repeats).
I have >40M theoretical protein sequences from the next gen stuff.
Only like 400,000 non-redundant ones have matches to my high resolution MS/MS. There's my filter! Throw away 39.6M theoretical sequences!
All the stuff that is in the red circle above is also stuff I didn't know how to do, but learned how!
What do we get?
The first ever annotated protein FASTA for Cannabis plants. BOOM!
Then we could use FASTA to use all the normal tools to finish the project right.
I submit for your amusement my favorite protein bioinformatic flowchart of all time.
What did we learn?
1) Lysine acetylation is highly involved in the production of the chemicals people seem to care about in cannabis plants. Definitely the terpenes, possibly the cannabinoids (there is a noisy spectra in the supplemental that needs verified later)
2) The proteins may be glycosylated everywhere, but we need to work on it more because it looks like the second sugar in every chain is not one of the ones I know about.
3) The flowers of the plant make hundreds of unknown small molecules (there are way way more than 20 variants on the normal cannabinoids. There are hundreds!)
All MS/MS spectra are available at the site in MGF form.
We have created a Skyline spectral library containing all the PSMs.
We've also created a 40GB file containing every spectral annotation.
While we were doing this another group released some Orbi Velos based proteomics on Cannabis flowers (paper link). Since they used only the 300 or so proteins available on UniProt for the plant, they only identified maybe 180 proteins? Using our FASTA we can re-search their data and come up with around 2k-3k (more what you'd expect from their experimental design).
Oh yeah! And we made an online tool that will tell you what chromosome each protein we ID'ed came from (unfortunately the chromosomes are short-read sequencing, so it's not the most comprehensive, I've got an idea to fix that). We've also done some fun things like align our new proteins to ones where there are Swiss-Model 3D models for. Oh, and did a little proof of concept trying to figure out how to identify fake Cannabis products using rapid metabolomics. People are making all sorts of counterfeit "vape cartridges" here in the US that have made people seriously ill. Maybe metabolomics can help determine the really sophisticated counterfeits from the real thing.
The protein FASTA can be downloaded on the site, as well as all our Metabolites with our hypotheses for the identities of said metabolites. There are also two sneak peaks into cool new informatic software that Conor has been developing around his 70 hour/week job and final year of classes.
The first is a tool that can pull out spectra that contain diagnostic ions. For example, if you're interested in lysine acetylation, the ProteomeTools project showed that these peptides commonly produce a strong diagnostic ion of 126.0913. Conor's script can just pull out and count all the spectra that have those. The second is a tool kit for easy correlation analysis between metabolites, transcripts and proteins if you have quantitative data on all of them. Both of these things are python tools and bundling those into a more user friendly interface is an area of focus.
Final note (warning?) if you create a cool Protein informatics tool and you don't create a cool icon for it, I may have to do it myself.