To facilitate genome-based representation and analysis of proteomics data, we developed a new bioinformatics framework, also includes two R packages, and documents, respectively. proteomics and proteogenomics research. Mass-spectrometry-based shotgun proteomics technology offers undergone rapid developments during the past decade. Recent studies possess shown deep proteome protection with the recognition of more than 10,000 proteins (1C5). Moreover, large-scale integrative proteogenomic studies have started to harness the complementary advantages of the proteomics and Schisandrin C IC50 genomics systems (6C8). To facilitate the exchange and posting of the rapidly growing body of proteomics data, the Human being Proteome Corporation Proteomics Standards Initiative offers defined community requirements for data representation, including standard data types for reporting peptide and protein identification results (9). However, although peptide and protein recognition relies primarily on protein databases derived from the research genome sequence, genomic locations of recognized peptides are not reported by popular mass spectrometry data analysis software, which limits genome-based interpretation and analysis of proteomics data and hinders effective proteogenomic data integration. First, without knowing genomic locations of the recognized peptides, some important questions are remaining hanging. For example, peptides that map to multiple proteins introduce ambiguity in protein inference. Those mapping to the same genomic locus can benefit from a gene-level instead of a protein-level inference; however, it is unclear how many and which peptides map to multiple proteins derived from the same genomic locus. As another example, exonCexon junction peptides are important for the understanding of alternate splicing and protein isoform difficulty, but it is definitely difficult to determine how many and which peptides span more than one exon with existing data types. Furthermore, Schisandrin C IC50 although a major goal in proteomics is definitely to accomplish a comprehensive protection of the coding genome, calculating the sequence protection ratio for the whole coding genome is definitely cumbersome with existing data types. Second, with proteins serving as the data organization unit inside a data analysis report, it is difficult to perform data integration across multiple proteomics studies. Studies could use Rabbit Polyclonal to RRS1 different research protein databases with inconsistent protein annotations for database searching, therefore data integration usually requires re-searching of the uncooked data against a common research database. In addition, although gene-centric reports are required by many downstream pathway and network analysis tools, additional efforts are required to derive them from protein-centric reports. Moreover, it remains hard to communicate proteomics data to the genomics community. Integrating a protein-centric statement with data generated from genomics or transcriptomics analyses is definitely a barrier to proteogenomic analysis. As Schisandrin C IC50 proteogenomics is definitely rapidly becoming a good and important study field (10C13), it is critical to possess a new data format and assisting tools that enable clean integration across proteomics, genomics, and transcriptomics data. Recently, several software tools have been published to facilitate the visualization of peptides in genome browsers, including iPiG (14), CAPER (15), and PG Nexus (16), among others (17C19). These tools address a critical need of genome browser-based visualization of proteomics data; however, although a genome-based representation of proteomics data introduces novel data analysis and interpretation opportunities that go beyond visualization; these opportunities possess barely been explored. In a recent study, the sequence positioning/map (SAM) file format developed in the next-generation sequencing field Schisandrin C IC50 was used in the tool PG Nexus (16) Schisandrin C IC50 to allow covisualizing proteomic data with genomes and transcriptomes. However, although a primary goal of the SAM format is definitely to provide a well-defined interface between sequence positioning and downstream analyses (20), this important feature has not been exploited in PG Nexus. Moreover, there has been no attempt to incorporate proteomics-specific info into the SAM format. To provide an integrated means to fix help genome-based representation and analysis of proteomics data, we developed is the protein BAM (is built upon the success of the SAM format and its compressed binary.