The concept of utilizing the immune system to battle cancer is very attractive and studied extensively. Indeed, neoantigens can result from somatic mutations, against which patients can be vaccinated1,2,3,4,5,6,7,8,9,10,11. Recent evidence suggests that frame shift mutations, that result in peptides which are completely new to the body, can be highly immunogenic12,13,14,15. The immune response to neoantigen vaccination, including the possible predictive value of epitope selection has been studied in great detail8,13,16,17,18,19,20,21, and there is no doubt about the promise of neoantigen-directed immunotherapy. The quest for common antigens, however, has been disappointing, since virtually all mutations are private. One can derive algorithms that predict likely good epitopes, but still every case is different. Here we report that frame shift mutations, which are also mostly unique among patients and tumors, nevertheless converge to neo open reading frame peptides (NOPs) from their translation products, that result in common neoantigens in large groups of cancer patients.

We have analyzed 10,186 cancer genomes from 33 tumor types of the TCGA (The Cancer Genome Atlas22) and focused on the 143,444 frame shift mutations represented in this cohort (see Table S1). Translation of these mutations after re-annotation to a RefSeq annotation, starting in the protein reading frame, can lead to 70,439 unique peptides that are 10 or more amino acids in length (a cut-off we have set at a size sufficient to shape a distinct epitope in the context of MHC (Fig. 1a). The list of genes most commonly represented in the cohort and containing such frame shift mutations is headed nearly exclusively by tumor driver genes, such as NF1, RB, BRCA2 (Fig. 1b and Table S2) whose whole or partial loss of function apparently contributes to tumorigenesis. Note that a priori frame shift mutations are expected to result in loss of gene function more than a random SNV, and more independent of the precise position. In conclusion, NOPs initiated from a frameshift mutation and of a significant size are prevalent in tumors, and are enriched in cancer driver genes.

Figure 1

Figure 1

Frame shift initiated translation in the TCGA (n = 10,186) cohort is of sufficient size for immune presentation. (a) Peptide length distribution of frame shift mutation initiated translation up to the first encountered stop codon. Dark shades are unique FS mutations, light shade indicates the total sum (unique FS multiplied by number of patients containing that FS). (b) Gene distribution of peptides with length 10 or longer and encountered in up to 10 patients.

Alignment of the translated NOP products onto the protein sequence reveals that a wide array of different frame shift mutations translate in a common downstream stretch of neo open reading frame peptides (‘NOPs’), as dictated by the −1 and +1 alternative reading frames. While we initially screened for NOPs of ten or more amino acids, their open reading frame in the out-of-frame genome often extends far beyond that search window. As a result we see (Fig. 2) that hundreds of different frame shift mutations all at different sites in the gene nevertheless converge on only a handful of NOPs. Similar patterns are found in other common driver genes (Supplementary Fig. S1).

Figure 2

Figure 2

Neo open reading frame peptides (TCGA cohort) converge on common peptide sequences. Graphical representation in an isoform of TP53, where amino acids are colored distinctly. (a) somatic single nucleotide variants, (b). positions of frame shift mutations on the −1 and the +1 frame. (c) amino acid sequence of TP53. (d) Peptide (10aa) library (n = 1,000) selection. Peptides belonging to −1 or +1 frame are separated vertically (e,f) pNOPs for the different frames followed by all encountered frame shift mutations (rows), translated to a stop codon (lines) colored by amino acid.

Figure 2 illustrates that the precise location of a frame shift does not seem to matter much; the more or less straight slope of the series of mutations found in these 10,186 tumors indicates that it is not relevant for the biological effect (presumably reduction/loss of gene function) where the precise frame shift is, as long as translation stalls in the gene before the downstream remainder of the protein is expressed.

As can also be seen in Fig. 2, all frame shift mutations alter the reading frame to one of the two alternative frames. Therefore, for potential immunogenicity the relevant information is the sequence of the alternative ORFs and more precisely, the encoded peptide sequence between 2 stop codons. We term these peptides ‘proto Neo Open Reading Frame peptides’ or pNOPs, and generated a full list of all thus defined out of frame protein encoding regions in the human genome, of 10 amino acids or longer. We refer to the total sum of all Neo-ORFs as the Neo-ORFeome. The Neo-ORFeome contains all the peptide potential that the human genome can generate after simple frame-shift induced mutations. The size of the Neo-ORFeome is 25.6 Mb.

To investigate whether or not Nonsense Mediated Decay would wipe out frame shift mRNAs, we turned to a public repository containing read coverage for a large collection of cell lines (CCLE). We processed the data in a similar fashion as for the TCGA, identified the locations of frame shifts and subsequently found that, in line with the previous literature23,24,25, at least a large proportion of expressed genes also contained the frame shift mutation within the expressed mRNAs (Supplementary Fig. S2). On the mRNA level, NOPs can be detected in RNAseq data.

We next investigated how the number of patients relates to the number of NOPs. We sorted 10-mer peptides from NOPs by the number of new patients that contain the queried peptide. Assessed per tumor type, frame shift mutations in genes with very low to absent mRNA expression were removed to avoid overestimation. Of note NOP sequences are sometimes also encountered in the normal ORFeome, presumably as result of naturally occuring isoforms (e.g. Supplementary Fig. S3). Also these peptides were excluded.

Taking into account the rules described above, and demanding that the addition of a peptide adds at least 1 new patient, we can create a library of possible ‘vaccines’ that is optimally geared towards covering the TCGA cohort, a cohort large enough that, also looking at the data presented here, it is representative of future patients (Table S3). Using this strategy 30% of all patients can be covered with a fixed collection of only 1,244 peptides of length 10 (Fig. 3). Since tumors will regularly have more than 1 frame shift mutation, one can use a ‘cocktail’ of different NOPs to optimally attack a tumor. Indeed, given a library of 1,244 peptides, 27% of the covered TCGA patients contain 2 or more ‘vaccine’ candidates. We ran the pNOPs through the NetMHC4.0 algorithm to predict MHC class 1 binding for HLA-A0201 (Table S3). 56% of the library pNOPs (with frame shift covering 2 or more TCGA patients) contain sequences that are considered to have weak to strong binding properties. This is highly enriched compared to all pNOPs (33%), where no or one TCGA frameshift mutation has been encountered (Fisher’s Exact test p < 2.2 10-16). In conclusion, using a limited pool with optimal patient inclusion of vaccines, a large proportion of patients is covered.

Figure 3

Figure 3

a recurrent peptide selection procedure can generate a ‘fixed’ library to cover up to 50% of the TCGA cohort. Graph depicts the number of unique patients accomodated by a a growing library of 10-mer peptides, picked in descending order of the number patients with that sequence in their NOPs. A peptide is only added if it adds a new patient from the cohort. Light shade depicts the number of patients containing the peptide that was included.

Strikingly, using only 6 genes (TP53, ARID1A, KMT2D, GATA3, APC, PTEN), already 10% of the complete TCGA cohort is covered (Supplement Table S4). Separating this by the various tumor types, we find that for some cancers (like Pheochromocytoma and Paraganglioma (PCPG) or Thyroid carcinoma (THCA)) the hit rate is low, while for others up to 39% can be covered even with only 10 genes (Colon adenocarcinoma (COAD) using 60 peptides, Uterine Corpus Endometrial Carcinoma (UCEC) using 90 peptides), Fig. 4 and Table S4. At saturation (using all peptides encountered more than once) 50% of TCGA is covered and more than 70% can be achieved for specific cancer types (COAD, UCEC, Lung squamous cell carcinoma (LUSC) 72%, 73%, 73% respectively). As could be expected, these roughly follow the mutational load in the respective cancer types (Table S1). In addition some frame shifted genes are highly enriched in specific tumor types (e.g. VHL, GATA3. Supplementary Fig. S4). We conclude that at saturating peptide coverage, using only very limited set of genes, a large cohort of patients can be provided with off the shelf vaccines.

Figure 4

Figure 4

For some cancers up to 70% of patients contain a recurrent NOP. TCGA cohort ratio of patients separated by tumor type that could be ‘helped’ using optimally selected peptides for genes encountered most often within a cancer. Coloring represents the ratio, using 1, 2. 10 genes, or using all encountered genes (lightest shade).

To validate the presence of NOPs, we used the targeted sequencing data on 10,129 patients from the MSK-IMPACT cohort26. For the 341–410 genes assessed in this cohort, we obtained strikingly similar results in terms of genes frequently affected by frame shifts and the NOPs that they create (Supplementary Fig. S5). Even within this limited set of genes, 86% of the library peptides (in genes targeted by MSK-IMPACT) were encountered in the patient set.

Since some cancers, like glioblastoma or pancreatic cancer, show survival expectancies after diagnosis measured in months rather than years (e.g. see27), it is of crucial importance to move as much of the work load and time line to the moment before diagnosis. Since the time of whole exome sequencing after biopsy is currently technically days, and since the scan of a resulting sequence against a public database describing these NOPs takes seconds, and the shipment of a peptide of choice days, a vaccination can be done theoretically within days and practically within a few weeks after biopsy. This makes it attractive to generate a stored and quality controlled peptide vaccine library based on the data presented here, possibly with replicates stored on several locations in the world. The synthesis in advance will – by economics of scale – reduce costs, allow for proper regulatory oversight, and can be quality certified, in addition to saving the patient time and thus provide chances. Since the vaccine repertoir is so limited, there is potential for a steep ‘learning curve’ to recognize the best vaccines. This approach will not replace other therapies, but be an additional option in the treatment repertoire. These advantages of scale also apply to other means of vaccination against these common neoantigens, by RNA- or DNA–based approaches (e.g.28), or recombinant bacteria (e.g.29).

It has not escaped our notice that the recognition of common antigens (resulting from different mutations) also suggests an approach for neoantigen application of the powerful CAR-T therapy (For recent review see30, and references therein), where the T-cells are directed not against a cell-type specific antigens (such as CD19 or CD20), but against a tumor specific neoantigen. E.g. once one functional T-cell against any of the common p53 NOPs (Fig. 2) is identified, the recognition domains can be engineered into T-cells for any future patient with such a NOP, and the constructs could similarly be deposited in an off-the-shelf library.

In this work, we have identified that various frame shift mutations can result in a source for common neo open reading frame peptides, suitable as pre-synthesized vaccines. This discovery, potentially combined with checkpoint inhibition, holds promise for future treatment strategies to help instruct our own immune system to defeat cancer.

Source