Cell lines were purchased from ATCC and were not formally authenticated, but confirmation of expected gene expression patterns were performed for RNA-seq and eCLIP experiments. Cell lines were routinely tested for mycoplasma contamination (MycoAlert, Lonza).
RNA-binding protein annotations and domains
RBPs were chosen from a previously described list of 1,072 known RBPs, proteins containing RNA-binding domains, and proteins characterized as being associated with polyadenylated RNA, based on the availability of high-quality antibodies9. Annotation of RBP function was performed by integration of published literature, with manual inspection of references for less well-established annotations. Annotation of RNA-binding domain presence was determined by UniProt Domain Descriptions, and a database of cell-essential genes was obtained from published high-throughput CRISPR screening efforts37.
Antibodies for eCLIP were pre-screened using a set of defined metrics9. A ‘biosample’ of HepG2 or K562 cells was defined as a batch of cells starting from a single unfrozen stock, passaged for less than 30 days under standard ENCODE reference conditions, and validated for high viability and non-confluence at the time of crosslinking. All cells within a biosample were pooled and UV crosslinked on ice at 400 mJoules/cm2 with 254 nm radiation. The biosample was then split into 20-million-cell aliquots for eCLIP experiments.
eCLIP experiments were performed as previously described in a detailed standard operating procedure10, which is provided as associated documentation with each eCLIP experiment on the ENCODE portal (https://www.encodeproject.org/documents/fa2a3246-6039-46ba-b960-17fe06e7876a/@@download/attachment/CLIP_SOP_v1.0.pdf). In brief, 20 million crosslinked cells were lysed and sonicated, followed by treatment with RNase I (Thermo Fisher) to fragment RNA. Antibodies were pre-coupled to species-specific (anti-rabbit IgG or anti-mouse IgG) Dynabeads (Thermo Fisher), added to lysate, and incubated overnight at 4 °C. Prior to IP washes, 2% of sample was removed to serve as the paired input sample. For IP samples, high- and low-salt washes were performed, after which RNA was dephosphorylated with FastAP (Thermo Fisher) and T4 PNK (NEB) at low pH, and a 3′ RNA adaptor was ligated with T4 RNA ligase (NEB). Ten per cent of IP and input samples were run on an analytical PAGE Bis-Tris protein gel, transferred to PVDF membrane, blocked in 5% dry milk in TBST, incubated with the same primary antibody used for IP (typically at 1:4,000 dilution), washed, incubated with secondary HRP-conjugated species-specific TrueBlot antibody (Rockland), and visualized with standard enhanced chemiluminescence imaging to validate successful IP. Ninety per cent of IP and input samples were run on an analytical PAGE Bis-Tris protein gel and transferred to nitrocellulose membranes, after which the region from the protein size to 75 kDa above protein size was excised from the membrane, treated with proteinase K (NEB) to release RNA, and concentrated by column purification (Zymo). Input samples were then dephosphorylated with FastAP (Thermo Fisher) and T4 PNK (NEB) at low pH, and a 3′ RNA adaptor was ligated with T4 RNA ligase (NEB) to synchronize with IP samples. Reverse transcription was then performed with AffinityScript (Agilent), followed by ExoSAP-IT (Affymetrix) treatment to remove unincorporated primer. RNA was then degraded by alkaline hydrolysis, and a 3′ DNA adaptor was ligated with T4 RNA ligase (NEB). qPCR was then used to determine the required amplification, followed by PCR with Q5 (NEB) and gel electrophoresis to size-select the final library. Libraries were sequenced on the HiSeq 2000, 2500, or 4000 platform (Illumina). Each ENCODE eCLIP experiment consisted of IP from two independent biosamples, along with one paired size-matched input (sampled from one of the two IP lysates before IP washes).
Experimental quality control
eCLIP experiments for the ENCODE project were performed using two biological replicates, paired with a size-matched input control subsampled from one of the two replicate samples (Extended Data Fig. 1a). Prior to sequencing, two metrics were used for assessing the quality of eCLIP experiments: successful IP of the desired RBP, and successful library generation and sequencing.
Successful IP of the targeted RBP was assayed by IP-western blot analysis. This prerequisite first requires the identification of a RBP-specific IP-grade antibody, which was previously addressed by screening over 700 antibodies to identify 438 ‘IP-grade’ antibodies against 365 RBPs in K562 cells9. Using these and other RBP antibodies validated by the RNA community, 488 eCLIP experiments were performed in K562 and HepG2 cell lines, yielding successful IP during the eCLIP procedure for 400 (82%). Fifty-one out of 270 (19%) and 37 out of 218 (17%) experiments gave failed IP-western blot results in K562 or HepG2 cells, respectively, indicating either potential sensitivity to enzymatic steps and additional buffer exchanges performed during the eCLIP procedure, or a lack of expression in HepG2 cells (Extended Data Fig. 1b, c). IP-western blot images are provided for each ENCODE eCLIP experiment as part of the antibody metadata available at https://www.encodeproject.org.
Failure to obtain high-quality amplified libraries from both replicates can indicate a failed experiment, lack of RNA binding, or lack of RBP–RNA crosslinking. First, 15 experiments (4%) that generated adaptor-only sequencing libraries in either replicate were abandoned. Next, an extrapolated PCR cycles required (eCT) metric was used to quantify library yield10. The previous eCT metric using twofold amplification per PCR cycle was modified to an accurate eCT (a-eCT) using 1.84-fold amplification per cycle on the basis of analysis of the eCLIP data resource (Supplementary Text, Supplementary Fig. 9). Thirty-six experiments that showed lower a-eCT than the average of IgG control experiments and showed no significant binding in low-depth sequencing were abandoned, leaving 349 data sets for analysis (Extended Data Fig. 1c).
Data processing and peak identification
Processing of raw eCLIP sequencing data is complex, as adaptor sequences, double-adaptor ligation products, retrotransposable elements and other multi-copy sequences, PCR duplicates, and underlying differences in RNA abundances all contribute to false negatives and false positives at both the read mapping and peak identification stages. To address these issues, a rigorous standard eCLIP processing and analysis pipeline was developed and previously published10 and is provided (including description of steps as well as commands run) as a ‘Pipeline protocol’ attached to each eCLIP data set available on the ENCODE website at https://www.encodeproject.org/documents/3b1b2762-269a-4978-902e-0e1f91615782/@@download/attachment/eCLIP_analysisSOP_v2.0.pdf (Supplementary Fig. 10a). See Supplementary Text for additional details.
To identify reproducible and significantly enriched peaks across biological replicates, a modified IDR method was used (Supplementary Text, Supplementary Fig. 10). Unless otherwise noted, the final set of reproducible and significant peaks was identified by requiring that the replicate-merged peak meet an IDR cutoff of 0.01 as well as P ≤ 0.001 and fold enrichment ≥8 (using the geometric mean of log2(fold enrichment) and –log10(P) between the two biological replicates). Finally, 57 ‘blacklist’ regions were identified that were common artefacts across multiple data sets and lacked normal peak shapes (manual inspection indicated these often contain either adaptor sequences or tRNA fragments; Supplementary Data 11). IDR peaks that overlapped these blacklist regions were removed to yield the final set of reproducible peaks used in all analyses in this manuscript (unless otherwise indicated) (Supplementary Data 4).
Annotation of peaks was based on overlap with GENCODE v19 transcripts. If a peak overlapped multiple annotation types within a single annotated gene (across one or several isoform annotations), the peak annotation was chosen in the following priority order: tRNA, miRNA, miRNA-proximal (within 500 nt), CDS, 3′UTR, 5′UTR, 5′ splice site (within 100 nt of exon), 3′ splice site (within 100 nt of exon), proximal intron (within 500 nt of splice site region), distal intron (further than 500 nt from the splice site region), followed by noncoding exonic. If the peak overlapped multiple gene annotations, the final annotation was chosen as follows: tRNA, miRNA, CDS, 3′UTR, 5′UTR, miRNA-proximal, noncoding exonic, 5′ splice site, 3′ splice site, proximal intron, distal intron. To identify RBP clusters, the fraction of peaks annotated to each class out of the total number of peaks was calculated, and hierarchical clustering was performed in MATLAB (2018a) using correlation distance and average linkage. Clusters were obtained by cutting the tree at six clusters (chosen by comparing the sum of squared error between each data set and the mean of all data sets within the cluster containing that data set, which showed a leveling off after six clusters; Extended Data Fig. 2a).
Quantification of eCLIP signal at multi-copy and other repetitive elements
A separate pipeline was developed to quantify enrichment for retrotransposable and other multi-copy elements. A database of multicopy elements was generated, including 5,606 transcripts obtained from GENCODE v19 covering 34 abundant non-coding RNAs including rRNA, snRNA, and vault RNAs as well as their pseudogenes, 606 tRNA transcripts obtained from GtRNAdb (including versions with both genome flanking sequences and including the canonical CCA tail)38, 705 human repetitive elements obtained from the RepBase database (v. 18.05)39, 501 60mer sequences containing simple repeats of all 1 to 6-nt kmers, and the rRNA precursor transcript NR_046235.1 obtained from GenBank. Each transcript was assigned to one of 185 families of multi-copy elements (for example, RNA18S, Alu, antisense Alu, simple repeat, and so on). Within each family, transcripts were given a priority value, with primary transcripts prioritized over pseudogenes.
Post-adaptor trimming paired-end sequencing reads were mapped to this repetitive element database using bowtie2 (v. 2.2.6) with options ‘-q–sensitive -a -p 3–no-mixed –reorder’ to output all mappings. Read mappings were then processed according to the following rules. First, for each read pair only mappings with the lowest mismatch score (fewest mismatches and insertions or deletions) were kept. Next, for equally scoring mappings within a repeat family described above, the mapping to the transcript with the highest priority was identified as the ‘primary’ match. Only read pairs that mapped to a single repeat family were considered, whereas read pairs that mapped with equal scores to multiple repeat families were discarded from quantification at this stage. Mapping to the reverse strand of a transcript was considered distinct from forward strand mapping, such that each family paired with a separate antisense family composed of the same transcripts with the same priority order (except for simple repeats, which were all combined into one family).
Next, repeat mappings were integrated with unique genomic mappings identified from the standard eCLIP processing pipeline (described above) as follows. If a read pair mapped both uniquely to the genome and to a repetitive element, the mapping scores were compared; if the unique genome mapping was more than two mismatches per read (24 alignment score for the read pair) better than to the repeat element, the unique genomic mapping was used; otherwise, it was discarded and only the repeat mapping was kept. Next, PCR duplicate removal was performed (similar to the standard eCLIP processing pipeline) by comparing all read pairs based on their mapping start and stop position (either within the genome or within the mapped primary repeat) and unique molecular identifier sequence, removing all but one read pair for read pairs that shared these three values. Finally, the number of post PCR-duplicate removal read pairs mapping to each multi-copy family was counted in both IP and paired input sample and normalized for sequencing depth (counting post-PCR duplicate read pairs from both unique genomic mapping and repeat mapping). In addition, to better quantify signal to RepBase elements, RepeatMasker-identified repetitive elements in the hg19 genome were obtained from the UCSC Genome Browser. Element counts for RepBase elements were determined as the sum of repeat family-mapped read pairs plus uniquely genome-mapped read pairs that overlapped RepeatMasked RepBase elements. After removing repeat-mapping elements, the remaining reads were grouped and quantified on the basis of transcript region annotations (CDS, 3′UTR, 5′UTR, proximal or distal intronic, non-coding exonic, intergenic, or antisense to GENCODE transcripts). Significance was determined by Fisher’s exact test, or Pearson’s χ2 test where appropriate.
To summarize overall eCLIP signal, a relative information content metric was applied. The relative information content of each element in each replicate was calculated as pi × log2(pi/qi), where pi and qi are the fraction of total reads in IP and input, respectively, that map to element i. A merged relative information for both replicates was calculated by defining pi as the average fraction of total reads between the two biological replicates. To cluster data sets, dimensionality reduction was performed on element-relative information from the combination of both replicates using the t-SNE algorithm in MATLAB (2018a) with correlation distance, ‘exact’ algorithm, and perplexity = 10. To identify clusters, clustering was performed in using the DBSCAN (v1.0) MATLAB package, with options epsilon = 3 and MinPts = 2.
Quantification of eCLIP signal at region level
For analyses that used binding considered at the level of regions (for example, 3′UTR, CDS, or proximal intronic), read density was counted for the indicated region for both IP and paired input, and significance was determined by Fisher’s exact test (or Yates’s χ2 test if all observed and expected values were above 5). Only regions with at least 10 reads in one of IP or input, and where at least 10 reads would be expected in the comparison data set given the total number of usable reads, were considered, and significant regions were defined as those with fold enrichment ≥4 and P ≤ 0.00001.
Individual RBPs were depleted from HepG2 or K562 cells by either RNA interference (RNAi) or CRISPR-mediated gene disruption. RNAi was performed by transducing cells with lentiviruses expressing shRNAs (TRC collection) targeting an RBP followed by puromycin selection for 5 days. CRISPR-mediated gene disruption was performed by transfecting cells with a plasmid expressing Cas9 and a guide RNA (gRNA) targeting an RBP, followed by puromycin selection for 5 days. In each case, knockdowns were performed in biological duplicate along with a pair of control knockdowns using a scrambled shRNA or gRNA. Protein was extracted from half of each sample and used to confirm knockdown of the target RBP by western blotting. RNA was extracted from half of each sample and used to perform qRT–PCR to confirm knockdown of the targeted RBP transcript. We strived to obtain a knockdown efficiency of the target protein and/or RNA of at least 50% compared to the scrambled control, and for the knockdown efficiency to be within 10% between replicates. We used the extracted RNA to prepare RNA-seq libraries with the Illumina Tru-seq stranded mRNA library preparation kit. Paired-end 100-bp reads were generated from the RNA-seq libraries to an average depth of 63 million reads per replicate, and a minimum of 20 million reads per replicate, on an Illumina HiSeq 2500.
Primary data processing
Reads were aligned to both GRCh37 using the GENCODE v19 annotations and GRCh38 using the GENCODE v24 annotations using both TopHat version 2.0.840 with Bowtie2 version 2.1.041, and STAR version 2.4.042. All analyses described in this manuscript used the GRCh37/GENCODE v19 alignments, but the GRCh38/GENCODE v24 alignments are also available at the ENCODE portal. In all cases, alignments were performed against the male reference genome sequence for HepG2 cells or the female reference genome for K562 cells and simultaneously to the ERCC spike-in sequences. The command line parameters for the TopHat alignments were: -a 8 -m 0–min-intron-length 20–max-intron-length 1000000–read-edit-dist 4–read-mismatches 4 -g 20–no-novel-juncs–no-discordant–no-mixed. In some rare cases, TopHat 2.0.8 misassigned some reads to both strands or did not assign reads to either strand. To correct these errors, we used a custom script, tophat_bam_xsA_tag_fix.pl, to properly assign the SAM flag values. Gene expression levels were quantified using RSEM (v1.2.23)43 and Cufflinks (v2.0.2)44. Only samples with a Pearson correlation coefficient on FPKM values of 0.9 or greater between replicates were used for further analysis. Samples with a correlation below 0.9 were repeated. We used the custom script (makewigglefromBAM-NH.py) to convert the single .bam alignment files into plus or minus strand and unique and multi-mapped .bam files, and then converted the intermediate .bam files into bigwig files. A single, final .bam file was generated for each RNA-seq sample by merging the .bam files that contained the aligned read with the one that contained the unmapped reads. The merged .bam and bigwig files were submitted to the ENCODE Data Coordination Center (https://www.encodeproject.org/). In total, 237 HepG2 knockdown experiments (223 shRNA and 14 CRISPR) and 235 K562 knockdown experiments (217 shRNA and 18 CRISPR) were used for further analysis.
Gene expression quantification
Salmon (v1.1.0)45 was used with the –gcBias option to normalize for local GC content and quantify transcript abundance. Transcripts were then merged to genes using tximport (v1.14.2)46, after which CQN (v1.32.0)47 was used to normalize for gene-level GC content and length biases. DESeq2 (v1.26.0)48 was then used to quantify differential expression, with differentially expressed (DE) genes defined as those with a P value < 0.05 and adjusted P (Padj) < 0.05.
For the purposes of simplifying the analysis, we considered significant differential expression to be strong if |log2(fold-change)| ≥ 2, moderate when 1< |log2(fold-change)| < 2, and weak when |log2(fold-change)| ≤ 1.
Differential alternative splicing (AS) events were analysed using rMATS (v 3.2.1.beta)49. The knockdown replicate bam files and their control replicate bam files with the Gencode v19 annotation file were analysed using rMATS, to report five types of the differential AS events: SE (skipped exon), MXE (mutually exclusive exons), A3SS (alternative 3′ splice site), A5SS (alternative 5′ splice site) and RI (retained intron). Events with |inclusion level difference| > 0.05, P < 0.05 and FDR < 0.05 were identified as significantly differentially expressed AS events.
MISO (mixture of isoforms; v misopy-0.5.2)50 was used to detect differentially processed tandem 3′ UTR events (alternatively poly(A) site usage). Four pairwise comparisons between the two knockdown samples and two controls were run using compare-miso: KD-rep1 versus CN-rep1, KD-rep1 versus CN-rep2, KD-rep2 versus CN-rep1 and KD-rep2 versus CN-rep2. Significant tandem 3′ UTR events were identified if abs(Bayes factor) ≥5 and P < 0.05 on both the more_reads(KD-rep1, KD-rep2) versus fewer_reads(CN-rep1, CN-rep2) comparison and the fewer_reads(KD-rep1, KD-rep2) versus more_reads(CN-rep1, CN-rep2) comparison.
For the purposes of simplifying the analysis, we considered significant differential alternative splicing levels to be strong if |ΔΨ| ≥ 30%, moderate when 15% ≤ |ΔΨ| < 30%, and weak if 5% < |ΔΨ| < 15%.
Batch normalization of RBP KD–RNA-seq data
Batch effects are common in large data sets and must be corrected and accounted for51. To correct for batch effects, for each batch of experiments performed on a given day, the same scrambled shRNA or gRNA was used as a non-specific control alongside a batch of experimental shRNAs or gRNAs that targeted a set of RBPs. This provided a consistent, non-specific control experiment in every batch that could be used to normalize data downstream. In addition to biological controls, if a given batch of biological samples was too large to make all the RNA-seq libraries in parallel, libraries were made from the non-specific control RNA samples in each subset of libraries made from a given biological batch. Analyses that compared eCLIP peaks with gene expression or alternative splicing changes in RNA-seq upon RBP knockdown used changes identified relative to these within-batch paired controls. However, to enable further integrated analyses, additional batch correction was performed (Supplementary Text, Supplementary Fig. 7).
RBNS experiments were performed as indicated in the protocol included on each experiment at the ENCODE portal. In brief, randomized RNA oligonucleotides (20 or 40 nt) flanked by constant adaptor sequences were synthesized and incubated with an SBP-tagged recombinant RBP (consisting minimally of all annotated RNA-binding domains) at several concentrations (typically five, ranging from 5 to 1,300 nM). RNA–protein complexes were isolated with streptavidin-conjugated affinity resin and eluted RNA was prepared for deep sequencing, resulting in 10–20 million reads per RBP pulldown concentration with a similar number of input reads sequenced per in vitro transcription reaction.
RBNS kmer enrichments (R values) were calculated as the frequency of each kmer in the pulldown library reads divided by its frequency in the input library; enrichments from the pulldown library with the highest individual kmer R value were used for each RBP. The mean and s.d. of R values were calculated across all kmers for a given k to calculate the RBNS Z-score for each kmer.
RBNS motif logos were made using the following iterative procedure for k = 5: the most enriched 5mer was given a weight equal to its excess enrichment over the input library (= R – 1), and all occurrences of that 5mer were masked in both the pulldown and input libraries to eliminate subsequent counting of lower-affinity ‘shadow’ 5mers (for example, GGGGA, shifted by 1 from GGGGG). All enrichments were then recalculated on the masked read sets to obtain the most enriched 5mer and its corresponding weight, with this process continuing until the enrichment Z-score (calculated from the original R values) was less than 3. All 5mers determined from this procedure were aligned to minimize mismatches to the most enriched 5mer, with a new motif initiated if the number of mismatches plus offsets exceeded two. The frequencies of each nucleotide in the position weight matrix, as well as the overall percentage of each motif, were determined from the weights of the individual aligned 5mers that went into that motif; empty unaligned positions before or after each aligned 5mer were assigned pseudocounts of 25% of each nucleotide, and outermost positions of the motif logo were trimmed if they had >75% unaligned positions. To improve the robustness of the motif logos, the pulldown and input read sets were each divided in half and the above procedure was performed independently on each half; only 5mers identified in corresponding motif logos from both halves were included in the alignments to make the final motif logo. In Fig. 3a, only the top RBNS motif logo is shown if there were multiple logos (all motifs displayed on the ENCODE portal within the ‘Documents’ box for each experiment).
Immunofluorescence, microscopy imaging and data processing
HepG2 cells were seeded in poly-l-lysine-coated 96-well clear bottom plates (Corning; plate number 3882 half-area microplates), at a concentration of 2,000 cells per well in DMEM + 10% FBS. After 72 h in standard growth conditions (37 °C and 5% CO2), cells were fixed with 3.7% formaldehyde, permeabilized in PBS + 0.5% Triton X-100 and blocked in PBS + 0.2% Tween-20 + 2% BSA (PBTB), all conducted for 20 min at room temperature. Primary antibodies directed against specific RBPs (all rabbit antibodies) and marker proteins were subsequently applied to the cells at a final concentration of 2 μg/ml in PBTB and incubated overnight at 4 °C. The cells were next washed three times for 10 min each in PBST and incubated with secondary antibodies (Alexa647 donkey anti-rabbit and Alexa488 donkey anti-mouse, both diluted 1:500 in PBTB) for 90 min at room temperature. After three PBTB washes, the cells were counterstained with DAPI for 5 min, washed three times in PBS and stored in PBS at 4 °C. Subcellular marker antibodies and dilutions used were as follows: rat anti-α-tubulin, MCA78G, 1:200 (Serotec, Bio-Rad); mouse anti-CD63, ab8219, 1:200 (Abcam); mouse anti-coilin, GTX11822, 1:100 (GeneTex); mouse anti-DCP1a, sc100706, 1:200 (Santa Cruz Biotechnology); mouse anti-fibrillarin, ab4566, 1:200 dilution (Abcam); mouse anti-GM130, #610822, 1:200 (Becton Dickinson); mouse anti-KDEL, ENZSPA827D, 1:200 (Enzo Life Sciences); mouse anti-phosphotyrosine, #9411S, 1:200 (NEB); mouse anti-PML, sc-966, 1:50 (Santa Cruz Biotechnology); mouse anti-SC35, GTX11826, 1:200 (GeneTex). For staining with Mitotracker (Molecular Probes, M22426), cells were incubated with 100 nM dye in tissue culture medium for 45 min at 37 °C before fixation. For staining with phalloidin (Sigma, P5282), cells were incubated with 50 μg/ml of phalloidin for 20 min before DAPI staining.
Imaging was conducted on an ImageXpress Micro high content screening system (Molecular Devices). For each RBP–marker combination, 10–20 high-resolution images were acquired in the DAPI, FITC and Cy5 channels, using a 40× objective. Automated laser-based auto-focusing and auto-exposure functions were used for sample imaging, with exposure times ranging from 250 to 3,000 ms, 100 to 500 ms and 50 to 100 ms for RBP, marker and DAPI channels, respectively. Raw unprocessed greyscale images from individual channels were acquired as high-resolution TIF files of 726 kb each. An in-house MATLAB script was developed to batch normalize image intensity values and add blue, green or red colours to the respective channels, which were subsequently merged as colour JPEG files. The final images were uploaded on a server accessible through the RBP Image Database website. A MySQL relational database (version 5.1.73) was implemented, along with a MyISAM storage engine, to store the images, data annotations and characteristics. A controlled vocabulary of descriptors was devised to document RBP subcellular localization features.
Image analysis to quantify nuclear:cytoplasmic staining ratios, or to assess the degree of RBP targeting to punctate subcellular structures (for example, Cajal bodies, nuclear speckles, nuceloli, Golgi and P-bodies), was conducted using ‘Granularity’, ‘Colocalization’ and ‘Multi Wavelength Cell Scoring’ analysis modules from the MetaXpress v3.1 software (Molecular Devices), according to the manufacturer’s recommendations. For localization categories including microtubules, actin, cell cortex, ER, focal adhesions, mitochondria and mitotic apparatus, manual localization grading was conducted by ranking candidate RBPS as strongly or weakly co-localized with respective protein markers. The Circos plot of localization co-occurrance (Extended Data Fig. 10a) was generated by drawing one line between every pair of categories for each RBP that shared both localization annotations. Nuclear annotations are indicated in purple, cytoplasmic in red, and lines between nuclear and cytoplasmic annotations are indicated in yellow.
Chromatin IP was implemented according to the ChIP Protocol optimized for RNA-binding proteins (https://www.encodeproject.org/documents/e8a2fef1-580b-45ad-b29c-fffc3d527202/@@download/attachment/ChIP-seq_Protocol_for_RNA-Binding_Proteins_ENCODE_Fu_lab_RuiXiao.pdf). In brief, before coupling with RBP antibodies, magnetic beads were equilibrated by washing with ChIP dilution buffer and blocked with glycogen, BSA and tRNA in ChIP dilution buffer. Between ten million and twenty million HepG2 and K562 cells were crosslinked in 1% formaldehyde diluted in 1× PBS for 20 min and then quenched by adding glycine. Cell nuclei were extracted by resuspending the cell pellet with cell lysis buffer with occasional inversion. Nucleus pellets resuspended in nuclear lysis buffer were sonicated with a Branson Sonifier cell disruptor. Ninety-five per cent of nuclear lysate was diluted to a final concentration of 1% triton X-100, 0.1% sodium deoxycholate and 1× proteinase inhibitor cocktail and was subjected to IP with antibody-coupled beads; the other 5% of nuclear lysate was used as input chromatin. Stringent washes were performed before elution. Input and immunoprecipitated chromatin DNAs were recovered by decrosslinking, RNase A digestion, proteinase K treatment, phenol/chloroform extraction and precipitation with ethanol. Library construction was performed using the ChIP–seq Sample Prep Kit (Illumina). DNA libraries between 200 and 400 bp were gel-purified, quantified with Qubit and sequenced on the Illumina HiSeq 2000/2500. All RBP ChIP–seq experiments were performed in duplicate. Antibodies used in RBP ChIP–seq experiments were validated by IP and shRNA or CRISPR knockdown according to ENCODE RBP antibody characterization guidelines.
RBP ChIP–seq data sets used in this study were processed by the ENCODE Data Coordinating Center with the same uniform processing pipelines described previously for transcription factor ChIP–seq (https://www.encodeproject.org/chip-seq/transcription_factor/). After removal of low-quality and PCR duplicate reads, peaks were identified with SPP and reproducible peaks across biological replicates were identified with the IDR pipeline to yield two sets (optimal and conservative) of peaks at an IDR threshold of 0.0552.
Saturation analysis of eCLIP and KD–RNA-seq data was performed by randomly shuffling the order of data sets 100 times, subsampling 1 through all data sets, and calculating the desired metrics. Gene level saturation analysis of RBP binding was calculated first by taking all unique genes that were bound by an IDR filtered peak in an eCLIP experiment. Then, each eCLIP experiment was iteratively added to the previous experiment, counting only unique genes in any experiment. Saturation analysis of differentially expressed genes from KD–RNA-seq was performed similarly, based on differentially expressed genes identified with DESeq2. Genes were identified as differentially expressed if they had a Padj of <0.05 between knockdown and control. Alternative versions of this analysis used all genes (Extended Data Fig. 2g), only genes with TPM >1 in HepG2 and K562 cells (Supplementary Fig. 13a), or only genes with TPM >1 in either HepG2 or K562 cells (Supplementary Fig. 13b), using average gene-level expression from two rRNA-depleted RNA-seq experiments in HepG2 (ENCODE accession ENCFF533XPJ, ENCFF321JIT) and K562 cells (ENCFF286GLL, ENCFF986DBN). The set of differentially expressed and bound genes was determined by taking all genes that were differentially expressed upon RBP KD that contained at least one IDR-filtered peak in the corresponding eCLIP experiment in the same cell type.
Differentially spliced events were defined as those with P < 0.05, FDR < 0.1, and change in per cent spliced in (|ΔΨ|) > 0.05 from rMATS analysis (described above). The number of unique events was defined as the number of non-overlapping events obtained upon combining all experiments for a given sampling. A differentially spliced event was considered bound if for any RBP in which the event was differentially included upon KD, there was an eCLIP peak for the same RBP in the same cell type between the start of the upstream flanking exon and the end of the downstream flanking exon for skipped exons and mutually exclusive exons, the start of the upstream flanking exon and end of the common exon region for A3SS, the start of the common exon and end of the common exon region for A5SS, and the start of the upstream and stop of the downstream exons for retained introns.
To perform saturation of transcript regions, the highest-expressed transcript for each gene was first identified using transcript-level quantification from the same rRNA-depleted RNA-seq experiments described above. The following regions were then identified: the entire unspliced transcript (pre-mRNA), all exons (exon), 5′ UTR, CDS, 3′UTR, all introns (intron), 100-nt intronic regions flanking the 5′ and 3′ splice sites (splice site), proximal intronic regions extending from 100 nt to 500 nt from the 5′ and 3′ splice site (proximal intron), and distal intronic regions extending from 500 nt and beyond from the 5′ and 3′ splice sites. Saturation calculations were then performed as described above for all genes (Supplementary Fig. 13c, e–g) or only genes with TPM > 1 in both K562 and HepG2 cells (Extended Data Fig. 2i, Supplementary Fig. 13d), and plotted as either the total number of bases covered (Supplementary Fig. 13c, d), or the fraction of covered bases divided by the total number of bases in that annotation across all genes (Extended Data Fig. 2i).
The fold-increase in bases covered was calculated by dividing the number of bases covered in a subsampling of n + 1 data sets divided by the number covered in subsampling n data sets. Analysis of the fold-increase between one and two data sets (Supplementary Fig. 13f) was determined by first taking all 73 RBPs profiled in both HepG2 and K562 cells, and calculating the fold-increase in covered bases by considering 146 comparisons including HepG2 followed by K562 and K562 followed by HepG2. Then, for each of the 146 comparisons, 10 other random data sets were chosen from the same cell type, and for each of the 10, the fold-increase in covered bases from adding that data set to the first was calculated.
To compare the fold-increase between profiling new RBPs in additional cell lines (Supplementary Fig. 13g), eCLIP data sets profiling RBFOX2, IGF2BP1, IGF2BP2, and IGF2BP3 in H9 human embryonic stem cells were obtained from the Gene Expression Omnibus (GSE78509)53, and added as the 224th data set. These were compared against profiling a new RBP in K562 or HepG2 cells (calculated by adding each of the 150 profiled RBPs as the 222nd (if it was profiled in both cell types) or 223rd (if it was profiled in only one cell type) data sets for other RBPs), or a profiled RBP done in second cell type (calculated by sampling 222 data sets and adding the 223rd).
Preservation of RBP regulation across cell types
To consider binding across cell types, first the highest-expressed transcript for each gene was identified using transcript-level quantification from the same rRNA-depleted RNA-seq experiments described above and used as representative for that gene. Next, genes were categorized on the basis of the relative expression difference between K562 and HepG2 cells: unchanged (fold-difference ≤ 1.2), weakly (1.2 < fold-difference ≤ 2), moderately (2 < fold-difference ≤ 5) or strongly (fold-difference > 5) differential (each of which required expression TPM ≥ 1 in both K562 and HepG2 cells), cell type-specific genes (TPM < 0.1 in one cell type and TPM ≥ 1 in the other), or other (containing all other genes in GENCODE v19). Peaks were then categorized on the basis of the expression change of their associated gene (Supplementary Fig. 13h).
Analysis of preservation of binding across cell types was considered in three ways. First, for each peak identified in one cell type, the fold enrichment for that region in the other cell type was calculated and considered for each gene type (Fig. 2d). For further analyses, two groups of peaks were then identified: those that were ≥4-fold enriched in the other cell type, and those that were not enriched in the other cell type. The fraction of peaks associated with a gene class that were either ≥4-fold or not enriched were then considered for each gene class separately (Fig. 2e). Second, the set of peaks that were ≥4-fold enriched (and the set not enriched) was compiled across all genes, and the fraction associated with each gene class were then reported (Extended Data Fig. 2k). Finally, peak overlap between cell types (Extended Data Fig. 2j) was calculated by determining the fraction of IDR peaks identified in one cell type that overlap (requiring at least 1 nt overlap) IDR peaks identified in the second cell type. For all comparisons, significance between groups was determined by two-sided Kolmogorov–Smirnov test.
Motif comparisons between RBNS and eCLIP
eCLIP 5mer and 6mer Z-scores (in Fig. 3b and elsewhere) were calculated as previously described54. In brief, peaks and a shuffled background set of peaks that preserved the region of binding (3′UTR, 5′UTR, CDS, exon, proximal and distal intron) were generated. EMBOSS compseq55 was used on these two peak sets and the Z-scores of the difference between real and background 5mer and 6mer frequencies were calculated.
To produce eCLIP logos in a similar manner for comparison with RBNS logos, an analogous procedure was carried out on the eCLIP peak sequences (for this analysis, eCLIP peaks with at least twofold enrichment were used): the two halves of the RBNS pulldown read set were replaced with the two eCLIP replicate peak sequence sets (each peak was extended 50 nt upstream of its 5′ end, as some RBPs have motif enrichments symmetrically around or only upstream of the peak starts), and the input RBNS sequences were replaced by random regions within the same gene as each peak that preserved peak length and transcript region (5′ and 3′ UTR peaks were chosen randomly within that region; intronic and CDS peaks were shuffled to a position within the same gene that preserved the peak start’s distance to the closest intron–exon boundary to match sequence biases resulting from CDS and splicing constraints). The enrichment Z-score threshold for 5mers included in eCLIP logos was 2.8, as this threshold produced eCLIP logos containing the most similar number of 5mers to that of the Z ≥ 3 5mer RBNS logos. Each eCLIP motif logo was filtered to include only 5mers that occurred in both of the corresponding eCLIP replicate logos. eCLIP motif logos were made separately for all eCLIP peaks, only 3′UTR peaks, only CDS peaks, and only intronic peaks, with the eCLIP logo of those 4 (or 8 if CLIP was performed in both cell types) with the highest similarity score to the RBNS logo shown in Fig. 3a, where the similarity score was the same as previously described to cluster RBNS logos (eCLIP logos for all transcript regions shown in Extended Data Fig. 3e). To determine the significance of overlap between RBNS and eCLIP, a hypergeometric test was performed with 5mers in all RBNS logos, eCLIP logo 5mers (for peaks in the region with highest similarity score to the RBNS logo), and 5mers in their intersection, relative to the background of all 1,024 5mers; overlap was deemed significant if P < 0.05. The top ‘eCLIP-only’ logo in each region was the highest eCLIP logo, if any, comprised of 5mers that had no overlap with any RBNS Z ≥ 3 5mers (always using at least the top ten RBNS 5mers if there were fewer than 10 with Z ≥ 3).
All eCLIP/RBNS comparisons were for the same RBP with the following exceptions in which the eCLIP RBP was compared to a closely related RBNS protein: KHDRBS2 eCLIP versus KHDRBS1 RBNS; PABPN1 eCLIP versus PABPN1L RBNS; PTBP1 eCLIP versus PTBP3 RBNS; PUM2 eCLIP versus PUM1 RBNS; and RBM15 eCLIP versus RBM15B RBNS.
Splicing regulatory effects of RBNS+ and RBNS– eCLIP peaks
To assess the splicing regulatory effects of RBNS+ and RBNS– eCLIP peaks for Fig. 3c, only rMATS skipped exons with a Ψ between 0.05 and 0.95 in at least one of the controls or KDs were considered for each RBP. Each eCLIP peak (extended 50 nt 5′ of the peak start) was first checked for whether it overlapped the SE, and if not then for whether it overlapped the upstream or downstream flanking 250 nt. To compare the magnitude of splicing changes upon KD for eCLIP+ versus eCLIP– skipped exons while minimizing the confounding factors of different wild-type host gene expression level and skipped exon Ψ values among these two sets of skipped exons, we created a matched set of eCLIP– skipped exons by selecting for each eCLIP+ skipped exon a skipped exon in the same decile of wild-type gene expression and wild-type Ψ for each corresponding skipped exon with an eCLIP peak. A cumulative distribution function of the ΔΨ changes upon KD was compared for the eCLIP+ versus eCLIP– skipped exons in each of the six skipped exon (SE) direction–eCLIP region combinations ([included, excluded SE] × [peak over SE, upstream intron, downstream intron]), with significance P < 0.05 for a one-sided Wilcoxon rank-sum test that |ΔΨ|SE, peak > |ΔΨ|SE, no peak. If the eCLIP+ versus eCLIP– comparison was significant, the eCLIP peaks were divided into those that did and did not contain the top RBNS 5mer. The ΔΨ values for all RBPs in each of the six skipped exon direction–eCLIP regions were combined for comparison in Fig. 3c; see Extended Data Figure 4c for RBPs that were significant in each region (12 included and 4 excluded upon KD, upstream intron eCLIP peak; 11 included and 2 excluded upon KD, skipped exon eCLIP peak; 7 included and 7 excluded upon KD, downstream intron eCLIP peak). To assess eCLIP peaks with or without the top ‘eCLIP-only’ kmer, the top 5mer from the aforementioned ‘eCLIP-only’logo was used from the first region with an eCLIP-only logo among: all peaks; CDS peaks; intron peaks; and 3′UTR peaks (the more highly enriched 5mer if eCLIP was performed in both cell types). The resulting ‘eCLIP-only’ 5mers for Extended Data Fig. 4d were: CELF1 (CUCUC), EIF4G2 (GUGUG), EWSR1 (CGCGG); FUBP3 (UUGUU); FUS (GUGUG); HNRNPC (GUCGC); HNRNPK (UCCCC); HNRNPL (none); IGF2BP1 (GUGUG); IGF2BP2 (CGCCG); KHDRBS2: (none); KHSRP (none); PABPN1L (CGCGG); PCBP2 (CGGCG); PTBP3 (GAAGA); PUM2 (UUUUU); RBFOX2 (GGGGG); RBM22 (GGUAA); SFPQ (UCCGG); SRSF5 (CGGCG); SRSF9 (CUGGA); TAF15 (AGGGA); TARDBP (GAAGA); TIA1 (CGCCG); TRA2A (GAGGG).
Overlaps between RBP binding and gene expression perturbation upon KD–RNA-seq
To increase sensitivity for gene expression analysis, significant binding was determined at the level of transcript regions (including 5′UTR, CDS, 3′UTR, and introns) instead of using peaks. To identify significant enrichment between binding and expression changes, genes with significantly enriched eCLIP signal at regions (P ≤ 0.00001 and log2(fold enrichment) ≥ 4, as described above) were overlapped with the set of genes with significantly altered expression in KD–RNA-seq (Padj < 0.05 between knockdown and control from DEseq analysis). Enrichment was calculated separately for knockdown-increased and knockdown-decreased genes, with significance determined by Fisher’s exact test (or Yates’s χ2 test if all observed and expected values were above 5). Comparisons with either knockdown-increased or knockdown-decreased genes from KD–RNA-seq were performed only if more than 10 genes showed significant changes. To avoid biases due to RNA abundance, for each comparison of a region type with each eCLIP data set, a background set of genes was created by identifying all genes for which the region type (5′UTR, CDS, 3′UTR) had at least 10 reads in one of IP or input; at least 10 reads would be expected in the opposite (IP or input) data set given the total number of usable reads. For cumulative distribution plots, genes were separated on the basis of their eCLIP fold enrichment in IP versus input for the indicated transcript region.
RBP binding correlation with knockdown-perturbed splicing (splicing maps)
RBP binding or splicing maps were generated using eCLIP-normalized (reads per million) read densities overlapped with alternatively spliced (AS) regions from rMATS JunctionCountsOnly files from the same cell type using the RBP-Maps methodology24 (Supplementary Text, Supplementary Fig. 14). Analyses described used only events with rMATS P < 0.05, FDR < 0.1, and |ΔΨ| > 0.05 in knockdown versus control RNA-seq.
Correlation between splicing maps was defined as the Pearson correlation (R) between a vector that contained both included-upon knockdown and excluded-upon knockdown RBP-responsive event eCLIP enrichment for each RBP. If an RBP had fewer than the minimum required number of events (100 for skipped exons or 50 for alternative 5′ or 3′ splice site events) for either knockdown-included or knockdown-excluded events, the correlation was calculated only using the other event type.
To generate cross-RBP splicing maps, the above approach was modified by taking the set of differentially included (or excluded) skipped exons identified in knockdown of RBP A and calculating the eCLIP splicing map separately for every other RBP within the same binding class (determined in Fig. 2a) as RBP A, including the normalization against a background of eCLIP signal for native skipped exon events (as shown for HNRNPC knockdown-included, RBFOX2 knockdown-excluded, and TIA1 knockdown-included skipped exons in Extended Data Fig. 8b, Fig. 5d, and Extended Data Fig. 8e, respectively). The average across all RBPs was then used to calculate the average cross-RBP enrichment (Extended Data Fig. 8a).
To calculate the number of RBPs bound per exon, the set of spliceosomal RBPs was taken from manual annotation of RBP functions (described above and listed in Supplementary Data 1). The number of reproducible (IDR) peaks at each position relative to splice sites was summed across all RBPs and divided by the total number of skipped or constitutive exons.
Comparison of DNA- and RNA-binding properties of RBPs
For integrative analyses, DNaseI HS data (http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeOpenChromSynth), histone modifications by ChIP–seq from ENCODE or the Broad Institute (http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeBroadHistone) and eCLIP–seq data from ENCODE (https://www.encodeproject.org) were downloaded and compared with RBP ChIP–seq data.
To explore the possibility that some RBP chromatin association events might be coupled with their direct RNA-binding activities in cells, RNA binding peaks were compared with DNA binding signals as assayed by ChIP–seq to quantify enrichment. Only eCLIP peaks in gene body regions (excluding promoter and terminator regions, defined as the 1 kb surrounding regions of TSS and TTS) were considered. ChIP–seq signals were calculated for each eCLIP peak along with surrounding regions that are ten times the length of eCLIP peak on each side. Wilcoxon rank-sum tests were then performed to see whether ChIP–seq signals were enriched at the middle regions relative to the flanking regions.
To see whether those differentially expressed genes after RBP knockdown were enriched in RBP binding at chromatin level, equal numbers of genes with similar expression level either with or without binding to the TSS region were randomly sampled, the number of differentially expressed genes after knockdown of the RBP were counted (fold-change >1.5 or <2/3, Padj < 0.05 by DESeq2), and one-tailed Fisher’s exact tests were then performed to test the dependence of RBP binding and differential expression. Odds ratio was defined as (a/b)/(c/d), where a is the number of genes with RBP ChIP–seq peaks and differential expression (or splicing) upon RBP knockdown, b is the number of genes with RBP ChIP–seq peaks but no differential expression, c is the number of genes without ChIP–seq peaks but with differential expression, and d is the number of genes without ChIP–seq peaks or differential expression. The above procedure was performed 100 times to give the distribution of the odds ratio, and a significant dependence was defined as when the null hypothesis was rejected at level of 0.05 at least 95 times. The correlation between RBP association and genes with regulated alternative splicing events (A3SS, A5SS, RI, MXE and skipped exon events) was investigated similarly.
Analysis of RBP regulatory features in subcellular space
Localization annotations and calculation of nuclear versus cytoplasmic ratio were generated from immunofluorescence imaging as described above. ‘Nuclear RBPs’ were defined as those with a nuclear:cytoplasmic ratio ≥ 2, and ‘cytoplasmic RBPs’ were defined as those with a nuclear:cytoplasmic ratio ≤ 0.5. Spliced reads were defined as reads that mapped across an annotated GENCODE v19 splice junction (extending at least ten bases into each exon) and unspliced reads were defined as reads that overlapped an exon–intron junction (extending at least ten bases into both the exon and intron regions). Significance between groups was determined by Wilcoxon rank-sum test. Prediction of RNA secondary structure was performed using the RNAfold webserver (http://rna.tbi.univie.ac.at//cgi-bin/RNAWebSuite/RNAfold.cgi)56 with default parameters. Shown is the MFE secondary structure prediction.
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.