Generation of RNA spike-ins

Synthetic RNA spike-ins (from custom-made DNA sequences corresponding to ERCC-00043, ERCC-00170, ERCC-00136, ERCC-00145, ERCC-00092, and ERCC-00002, Ambion) were generated as in ref. 1. Briefly, three 4-thiouridin (4sU)-labelled and three unlabelled RNA spike-ins were in vitro transcribed using the MEGAscript T7 Transcription Kit (Invitrogen) according to the manufacturer’s instructions. Except, in vitro transcription of 4sU-labelled spike-ins was carried out substituting 10% of UTP with 4-thio-UTP (Jena Bioscience). Spike-ins were purified individually using RNAClean XP beads (Beckman Coulter) according to the manufacturer’s instructions and quantified using the Qubit Fluorometer (Invitrogen). Quality was assessed using the Bioanalyzer 2100 (Agilent) and a final spike-in pool was generated containing equal amounts of all six synthetic spike-in RNAs.

Cell culture and stimulation

Cells were grown in RPMI 1640 medium (Gibco) supplemented with 10% foetal bovine serum (Sigma) and 1% Penicillin/Streptomycin (HyClone) at 37 °C under 5% CO2. Jurkat cells (E6.1 clone) were acquired from ATCC (TIB-152), K562 cells from DSMZ (ACC-10). K562 cells were authenticated at the DSMZ Identification Service according to standards for STR profiling (ASN-0002). Jurkat cells were purchased from ATCC shortly before use and grown at low passage numbers. Cells were routinely tested for mycoplasma contamination (MycoAlert, Lonza). On the day of the experiment, Jurkat cells were seeded to a density of 1*106 cells/ml and stimulated with 50 ng/ml phorbol 12-myristate 13-acetate (PMA) and 1 μM ionomycin (Sigma) for 30 min at 37 °C under 5% CO2.

Bulk RNA alkylation

K562 cells were labelled with 500 μM 4sU (Sigma) at 37 °C under 5% CO2 for 3 h. Total RNA was isolated using TRIzol (Life Technologies) according to the manufacturer’s instructions. RNA spike-in mix was added during RNA isolation. For bulk alkylation, 5 μg of RNA was alkylated as described in Herzog et al.8. Briefly, total RNA resuspended in 1mM DTT was treated with iodoacetamide (IAA) at 50 °C for 15 min in a final alkylation reaction of 50 mM sodium phosphate buffer pH 8.0, 10 mM IAA, and 50% DMSO. The reaction was quenched by adding DTT (20 mM final concentration) and RNA was purified using isopropanol precipitation. RNA was analysed using a Bioanalyzer 2100 system (Agilent). Low-input alkylation was performed as described below (NASC-seq).

TT-seq

Two TT-seq experiments were performed in biological duplicates, as described previously1. Briefly, Jurkat cells were treated for 15 or 30 min with solvent control (DMSO, Sigma) or with PMA and ionomycin. During the last 5 min of each time point cells were labelled in media with 500 μM 4sU (Sigma) at 37 °C under 5% CO2. Cells were harvested by centrifugation at 1400×g for 2 min. Total RNA was extracted using TRIzol (Life Technologies) according to the manufacturer’s instructions under the addition of spike-ins. RNAs were sonicated using in a Bioruptor Plus instrument (Diagenode). 4sU-labelled RNA was purified from 300 μg total fragmented RNA. Separation of labelled RNA was achieved with streptavidin beads (Miltenyi Biotec). Prior to library preparation, 4sU-labelled RNA treated with DNase (Qiagen), purified (miRNeasy Micro Kit, Qiagen), and quantified. Strand-specific libraries were prepared with the Ovation Universal RNA-Seq System (NuGEN). The size-selected and pre-amplified fragments were analysed on a Bioanalyzer 2100 (Agilent). Samples were sequenced on an Illumina NextSeq 500 instrument. Data analysis was performed essentially as in Michel et al. 15. Briefly, paired-end 75 bp reads were mapped with STAR26 (version 2.6.0c) to the hg38 (GRCh38) genome assembly (Human Genome Reference Consortium). Gene expression fold-changes upon T-cell stimulation for each time point were calculated using the R/Bioconductor implementation of DESeq227 setting lfcThreshold = 1. Differentially expressed genes were identified applying a P-value cutoff of 0.05 comparing sample to solvent control measurements. For comparison of NASC-seq data with TT-seq data, we calculated the fraction of new RNA (Figs. S1e and S2c) by taking the sum of reads from newly synthesised RNA divided by the sum of new and old reads over all cells, thereby creating an in silico bulk for better comparison with bulk TT-seq data. K562 TT-seq data was taken from Schwalb et al. 1 (GSE75792).

NASC-seq

Cells were labelled in medium with 4sU (Sigma), washed with cold PBS and sorted into lysis buffer (3 μl, 166 mM sodium phosphate pH 8.0, RNase inhibitor, spike-in RNAs) in wells of PCR plates. Plates were frozen at −80 °C until use. Streptavidin beads (MyOne Dynabeads Strepavidin C1) were washed twice with buffer 1 (0.1 M NaOH, 0.05 M NaCl), twice with buffer 2 (0.1 M NaOH), and once with 2x B&W buffer (2 M NaCl, 10 mM Tris–HCl pH7.4, 1 mM EDTA) before the binding reaction (1x B&W, 50 μM oligo-dT). Beads were incubated with agitation at room temperature for 15 min and washed twice with 1x B&W buffer. Cells were lysed at 80 °C for 3 min and beads were added to the cells in a volume of 2 μl. RNA was bound during 20 min of incubation at room temperature on a thermoshaker (Eppendorf ThermoMixer C). 5 μl of alkylation mix (20 mM IAA in DMSO) was added for a final alkylation reaction of 50 mM sodium phosphate pH 8.0, 10 mM IAA, 50% DMSO. After 15 min at 50 °C, the alkylation reaction was stopped by adding STOP solution (2x Superscript II buffer, 0.3% Tween 20, 60 mM DTT), incubating for 5 min on a magnet, and removing the supernatant from the beads containing the alkylated RNA. RT and the remaining library preparation were performed according to a modified version of Smart-seq212. The modifications included the removal of the inactivation step from the RT thermocycling programme, performing the RT on a thermoshaker (Eppendorf ThermoMixer C) and the use of custom barcoded primers for the indexing and amplification after Nextera tagmentation. The resulting libraries were sequenced on a NextSeq500 instrument (Illumina) using either single-end (75-cycle) or paired-end (2 × 150-cycle) sequencing strategies.

Computational pipeline

After sequencing, the resulting bcl files were demultiplexed to fastq files with bcl2fastq (Illumina). Nextera adapters were trimmed with TrimGalore28 (v 0.4.5). All paired-end (2 × 150-cycle) data was then aligned to the hg38 human genome using STAR26 (v 2.5). Early NASC-seq data was sequenced using a single-end (75-cycle) sequencing strategy and was aligned to hg19. For all paired-end data we removed duplicates using the MarkDuplicates command in Picard (v 2.17.6). We then annotated the gene each reads maps to using FeatureCounts29 (v 1.6.2). We further annotated all mismatches to the reference genome within each read with the location of the mismatch for the T–C or A–G mismatches depending on the strand of the annotated gene (plus and negative, respectively). We identified mismatches in positions which appear in a high frequency over many cells and marked those positions as single nucleotide variants (SNV) to be ignored. We re-annotated the mismatches for each read with the SNV positions ignored and avoided double counting in overlapping paired end reads. Cells with low numbers of mapping reads were removed; for Jurkat cell experiments, each cell was required to have 600,000 reads mapping to features, while for K562 experiments, cells were required to have 300,000 reads mapping to features.

We then estimated the probability of a given position being converted in a new read (pc) based on the background probability of a mismatch due to error (pe) for each cell. We estimated pe by calculating the mean fraction of C–T and G–A mismatches in the given cell, since we concluded that these mismatches agree with the T–C and A–G mismatches in unlabelled cells. To estimate pc we implemented the Expectation-Maximisation algorithm described in Jürges et al. 14. For each gene in each cell, we estimated the proportion of new reads, πg using the binomial mixture model described in GRAND-SLAM14 (see below). We only ran the estimation if the gene had a minimum of 16 reads mapped to ensure reliable estimates. If the gene had more than 1000 mapped reads, we subsampled down to 1000 reads to shorten runtime.

All code for the processing of NASC-seq data and the implementation of the binomial mixture model is available on GitHub (https://github.com/sandberg-lab/NASC-seq). To facilitate adoption of our computational approach we have prepared an Amazon machine image that is preloaded with our pipeline as well as example data (see GitHub for more details).

Estimating the proportion newly transcribed transcripts

To estimate the proportion of new reads for each gene, πg, we implemented the binomial mixture model described in Jürges et al. 14. In the mixture model, each mismatch is either due to a conversion with probability pc or an error with probability pe. The probability of y positions having mismatches in a read containing n positions which may be converted is

$$P\left( {y;p_{\mathrm{{e}}},p_{\mathrm{{c}}},n,\pi _{\mathrm{{g}}}} \right) = \left( {1 – \pi _{\mathrm{{g}}}} \right)B\left( {k,n,p_{\mathrm{{e}}}} \right) + \pi _{\mathrm{{g}}}B\left( {k,n,p_{\mathrm{{c}}}} \right)$$

(1)

where \(B\left( {k,n,p} \right)\) is the binomial probability mass function.

We estimated πg by building a generative model in the STAN modelling language. We use a beta prior for πg with hyperparameters α and β

$$\pi _{\mathrm{{g}}}\sim {\mathrm{{Beta}}}\left( {\alpha ,\beta } \right)$$

(2)

and estimate πg by maximising the log-likelihood

$$\mathop {\sum }\limits_i {\mathrm{ln}}\left( {P\left( {y_{\mathrm{{i}}};p_{\mathrm{{e}}},p_{\mathrm{{c}}},n_{\mathrm{{i}}},\pi _{\mathrm{{g}}}} \right)} \right)$$

(3)

where each index i indicates a read for that gene. The hyperparameters were log-transformed and both initialised at 0, while πg was initialised at 0.5. We will then estimate the mean of the beta distribution, which cannot be 0 or 1 by definition. The mode is therefore more appropriate, which we can calculate by

$$\pi _{{\mathrm{{g}}}\;{\mathrm{mode}}} = \frac{{\alpha – 1}}{{\alpha + \beta – 2}}$$

(4)

for \(\alpha ,\;\beta > 1\). The mode is 1 if α > 1, β < 1 and 0 if α < 1, β > 1. The other possible cases do not occur in our estimation procedure.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source