STOP. IGNORE THE FLOWCHART ABOVE. These are bioinformatics people, they think this stuff is mandatory. I assume their conferences all have contests where the winner makes the flowchart most likely to make someone in another field throw up.
Again - don't look at it - 'cause this is legitimately important.
You know how the genomics people have been doing things for years with illustrious sounding titles like "The 1,000 Human Genome Project?" Particularly when a lot of those things kicked off and the technology was more expensive, these things absorbed HUGE amounts of research dollars. The goals were to undestand how human genomes vary across us - as a species.
And they did these things and they kept the results 100% secret from everyone forever.
I guess that's not true, but -to me, as a human proteomics reseacher -they have been less than useless. Yay, you did a bunch of stuff. Who does that help? Not me or anyone I know. Even researchers I know who focus on health disparities can't get usable data out of these things.
UNTIL NOW.
What these awesome, though flow-chart loving people did was dig into these top secret genomic databases and they assessed -
-you won't believe it -
Protein level changes across human populations! This is where it gets important.
How many peptide level variants could there possibly be in 1,000 genomes? 12? 15?
Try 54,679! Don't believe me? Here is a completely not illegally taken screenshot. Don't sue me!
Almost FIFTY-FIVE THOUSAND PEPTIDE VARIANTS?!?
How many are you looking for in your data? One? Yeah, me too. I mean, unless we're doing deep cancer genomics and then we search for 2 million. Why not normal variants?!?
Okay - are you thinking - "big deal, I probably need to spend the next 10 days downloading klugey python scripts written by proteomics people and finding out that my Docker thing is from 2017? How on earth does this help me?"
And this is where this is super legit.
Go here. https://zenodo.org/records/12671302
Download this -
Use 7-zip or something to unzip it twice. (I don't know, it's right there with the flowchart competition, bioinforomatics people have contests to see who can Zip things the most number of times. Bonus - as in here - instead of naming each Zip .zip you can name them weird things. The first thing you unzip is .gz, then it will make a .tar, and you also unzip that - and you'll get the whole reason I've written this entire thing -
You get a FASTA FILE that represents common peptide level variants that appear in human beings across our population!
Yeah, it's pretty big. 104MB and 157k entries. But you're encapsulating some much larger percentage of normal human genetics now!
100% check out the paper. They did other smart stuff and there are other (possibly superior files depending on your application.)
If you're using FragPipe (you should be!) check out this advice from Alexey!
And check out this additional resource from his team here!