Chemicals and reagents

Unless stated otherwise, all chemicals and reagents were obtained from Sigma Aldrich (Buchs, Switzerland). Enzymes were obtained from New England Biolabs (Ipswich, MA, USA). Oligonucleotides (Supplementary Table 1), custom duplex DNA adapters, synthetic genes, and gene fragments were obtained from Integrated DNA Technologies (Leuven, Belgium).

Cultivation of E. coli

E. coli strains were commonly cultivated in lysogeny broth (LB) supplemented with 10 g L−1d-glucose, 50 mg L−1 kanamycin, 50 mg L−1 streptomycin, and 15 g L−1 agar, where appropriate. Rhamnose-utilization deficiency was assessed by cultivation of strains in defined mineral medium61 supplemented with 0.1 g L−1l-leucine, 0.03 g L−1l-isoleucine, 0.15 g L−1l-valine, and 10 g L−1 of either d-glucose or l-rhamnose as major carbon source. Shake flask cultures (LB) were inoculated from monoclonal pre-cultures to an initial OD600 of 0.05 and cultivated in a shaking incubator (37 °C, 200 rpm). Expression was induced at an OD600 of ~0.5 by addition of 2 g L−1l-rhamnose to the cultures. Microtiter plate cultivations were performed in sterile 96-well plates (flat bottom Nunclon™ Delta Surface, ThermoFisher Scientific, Waltham, MA, US) containing 200 μL LB per well. Wells were inoculated from monoclonal pre-cultures to an initial OD600 of 0.05 and plates were incubated in an Infinite® M1000 PRO plate reader (Tecan Group, Männedorf, Switzerland) at 37 °C without lid (orbital shaking mode, 6 mm amplitude).

Construction of plasmids

Plasmids used are listed in Supplementary Table 2. All plasmids constructed in the course of this study are based on pSEVA29162. Inserts were created relying on synthetic genes or gene fragments, which were inserted into the vector backbone by conventional restriction-ligation cloning. Maps and sequences of plasmids created in this study can be found in the Supplementary Information (Supplementary Figs. 1923). A detailed description of library construction procedures is provided below.

Assessment of Bxb1 recombination

Bxb1-mediated inversion of the discriminator was assessed by direct fluorescent measurement, counting of red- and non-fluorescent colonies after retransformation of isolated plasmid and Sanger sequencing of the discriminator region. For direct measurement of mCherry fluorescence, 1 mL samples were collected from shake flask cultivations, spun down in a tabletop centrifuge (1 min, 8000 rcf) and pellets were re-suspended in 1 mL of ice-cold, sterile-filtered phosphate-buffered saline (PBS). To ensure full chromophore maturation, samples were incubated at 4 °C overnight before measurement in an Infinite® M1000 PRO plate reader (Tecan Group, Männedorf, Switzerland) in 96-well plates (flat bottom Nunclon™ Delta Surface, ThermoFisher Scientific, 200 μL per well). Cell-specific mCherry fluorescence was determined by dividing the red fluorescence signal (excitation at λEx = 587 nm, emission at λEm = 610 nm) by the OD600 for each sample. For microtiter plate cultivations, fluorescence of mCherry and sfGFP (λEx = 485 nm, λEm = 535 nm) was directly measured in the culture broth and normalized for OD600. To assess the dynamics of Bxb1-mediated recombination directly on the DNA level, plasmid DNA was extracted from shake flask culture samples and the state of the discriminator was determined by Sanger sequencing. Furthermore, 50 ng of extracted plasmid DNA were used to re-transform E. coli TOP10 and the transformation mixture was plated on LB agar containing 50 mg L−1 kanamycin and 10 g L−1d-glucose to shut down transcription from PRha. After overnight incubation at 37 °C, plates were stored at 4 °C for maturation of the mCherry chromophore. Afterwards, colonies (at least 275) were manually counted to determine the ratio of clones that had received a plasmid copy of pASPIre1 with a flipped (red colonies) or unflipped (white colonies) discriminator upon transformation, respectively.

Construction of knockout strains

All E. coli strains used in this study are listed in Supplementary Table 2. Knockout of the genomic rhaA gene (l-rhamnose isomerase) in parent strain E. coli TOP10 was achieved using the method described by Datsenko and Wanner63. Primers 1 and 2 were used to generate the required linear DNA fragment containing the kanamycin resistance gene (kanR) from plasmid pKD1363 flanked by sequences homologous to the target locus.

Library generation

RBS libraries were generated via PCR on template plasmid pASPIre3 using forward primer 3 and degenerate reverse primers 4, 5, 6, and 7 to diversify the respective RBS region. The resulting PCR products were digested with PstI and SacI (37 °C, 4 h) and ligated into pASPIre3 pre-treated with the same restriction enzymes. The ligation mixtures were purified and used for electroporation of E. coli TOP10 ΔrhaA. Transformants were plated on several plates of LB agar (10 g L−1d-glucose, 50 mg L−1 kanamycin, 50 mg L−1 streptomycin) in order to estimate library size by colony counting and facilitate pooling of the different libraries in defined ratios. After overnight incubation (37 °C), 5 mL of LB were added to the plates and colonies were scraped off. The resulting cell suspensions were pooled and sterile glycerol was added to a final concentration of 15% (v/v). Last, OD600 of the cell suspensions was determined before library stocks were snap-frozen and stored at −80 °C until further use.

Library cultivation and sampling

Library stocks were thawed on ice and used to inoculate 600 mL pre-warmed LB (50 mg L−1 kanamycin) to an initial OD600 of 0.05 in 5 L baffled cultivation flasks. Cultures were incubated (37 °C, 200 rpm) until an OD600 of ~0.5 was reached and 2 g L−1l-rhamnose were added to induce Bxb1-mediated recombination. Samples taken throughout the cultivation were immediately mixed with an excess of ice-cold, sterile PBS for rapid cooling and then centrifuged (10 min, 4000 rcf, 4 °C), and cell pellets were frozen on dry ice until extraction of plasmid DNA was performed using a commercial kit (ZymoPURE Miniprep Kit, Zymo Research) and stored at −20 °C until further use.

NGS sample preparation

Plasmid DNA isolated from culture samples was digested with NcoI and SacI (37 °C, 4 h). After, target fragments (308 bp) containing both RBS and the attP/R site were purified by agarose gel electrophoresis and sample-specific combinations of customized, indexed DNA duplexes (Supplementary Table 3) were ligated to the sticky ends of the target fragment. For the PCR-amplified sample (Supplementary Fig. 4), NGS fragments were generated using primers 8 and 9 to specifically amplify the target region and add required overhangs for Illumina sequencing to both ends. The resulting linear DNA fragments were purified by agarose gel electrophoresis and concentration of the target was determined using capillary electrophoresis (12-capillary Fragment Analyzer, Advanced Analytical/Agilent). Afterwards, indexed samples were pooled according to their determined concentrations to adjust equal molarity for all samples and the pooled sample was subjected to NGS.

NGS

NGS was performed using an Illumina NextSeq 500 platform and a High Output kit v2.5 (75 cycles, PE 33/51) using ~20% genomic PhiX library as spike-in to increase sequence diversity. Primary data analyses were done with Illumina RTA version 2.4.11 and bcl2fastq v2.20.0.422.

Computational scripts and data sets

An annotated script for the processing of NGS data (see below) as well as the pre-processed data sets used in this study are available under: github.com/JeschekLab/uASPIre. A detailed description of the ML models is provided in the separate ML Annex, and code describing how to define and fit the SAPIENs model as well as the resulting parameters of the fitted model can be obtained under: github.com/BorgwardtLab/SAPIENs.

Processing of NGS data

The algorithms for processing of NGS data for this project were written in bash and python and are available under: http://github.com/JeschekLab/uASPIre. Briefly, forward and reverse reads retrieved from fastq files were paired and all reads with more than six consecutive unidentified nucleotides were removed. Afterwards, target fragments were selected by a 10-bp constant region (GAGCTCGCAT, max. 3 mismatches) and sequences from different samples were deconvoluted by their unique combination of two 6-bp indices (Supplementary Table 3). Next, the discriminator state was determined by searching for the presence of an attP or attR site corresponding to the sequences GGGTTTGTACCGTACAC or GCCCGGATGATCCTGAC, respectively (max. 3 mismatches, differential bases highlighted in bold). RBS sequences were determined by retrieving the 17 nucleotides upstream of the bxb1 start codon. Finally, variants with mismatches in the bxb1 CDS in more than 8% of reads were removed to exclude off-target mutations.

Internal-standard RBSs and recording of calibration curves

The internal-standard RBSs used in this study are listed in Supplementary Table 4. RBSs R1-R22 were selected from the proof-of-concept library with the goal to span the entire range of observed RBS activities. First, kinetic profiles were cropped at 720 min and only RBSs with at least 100 reads per sample were used which resulted in a set of high-quality profiles for ~9500 RBSs. Afterwards, profiles were grouped according to their dynamic behavior relying on k-medoid clustering52 (k = 25) and the RBS corresponding to the centre of each cluster was selected as representative internal-standard RBS (Supplementary Fig. 24). In addition, three weak (R23–R25) and four strong (R26–R29) RBSs handpicked from the initial library as well as two strong RBSs (R30, R31) designed using the RBS calculator41 were included. These RBSs were individually introduced into pASPIre3 by conventional cloning procedures to obtain derivatives that carry the respective RBS sequence controlling Bxb1-sfGFP translation. Activity of these RBSs was assessed by recording of the cell-specific fluorescence of three biological replicates of each variant in individual shake flask cultivations. 100 mL of pre-warmed LB (50 mg L−1 kanamycin) in 1 L baffled shake flasks were inoculated from overnight pre-cultures to an initial OD600 of 0.05. Bxb1-sfGFP expression was induced at an OD600 of ~0.5 by adding 2 g L−1l-rhamnose. Samples taken throughout the cultivation were immediately mixed with an excess of ice-cold, sterile PBS for rapid cooling and then centrifuged (10 min, 4000 rcf, 4 °C). Cell pellets were re-suspended in PBS and suspensions were stored overnight in micro centrifuge tubes at 4 °C for sfGFP maturation. Fluorescence (λEx = 485 nm, λEm = 535 nm) and OD600 were measured in technical triplicates in 96-well plates (Corning 96-well Clear Bottom Black Polystyrol, 200 µL per well) in a TECAN Infinite® M1000 PRO plate reader. Curves of cell-specific sfGFP fluorescence were obtained by normalizing the blanked fluorescence signal for the blanked OD600 measurements and subtracting the cell-specific background fluorescence of an sfGFP-less variant (empty vector control), which was included for every cultivation batch. Furthermore, a dilution series of fluorescein was included in every 96-well plate to compensate for variations of the fluorescent readout over time.

Correlation of Bxb1 recombination with Bxb1-sfGFP levels

In order to convert Bxb1-catalyzed discriminator flipping into cellular Bxb1 concentrations, we compared the recorded cellular fluorescence profiles for the 31 internal-standard RBSs with their corresponding flipping profiles as recorded by NGS. To this end, we sought to (i) establish a combination of summary statistics which exhibit a high degree of correlation between the two measured quantities across the entire range of RBS strengths, (ii) identify the best (potentially non-linear) fit between the two summary statistics, and (iii) ensure that a high degree of diversity is maintained for the representation of the discriminator flipping across the entire set of sequences in the data set. We used integral-based (i.e. area under the curve) summary statistics for the flipping profiles and slope-based representations (i.e. slope of the linear fit) for the fluorescence profiles (Supplementary Fig. 12a, b). For the flipping statistics, we also quantified the diversity of each representation by estimating the differential entropy64 of its probability density (Supplementary Fig. 12c). For each type of summary statistic, we additionally treated the time ranges over which both the fluorescence and flipping summaries are computed as additional hyperparameters to be optimized. Moreover, for each pair of candidate summary statistics, we evaluated linear, log-linear and generalized logistic fits. We quantified the quality of each pair of summary statistics using the resulting R2 of the fit as evaluated using leave-one-out cross validation on the pool of 31 internal-standard RBSs in order to compensate for potential effects of overfitting in the analysis. Moreover, the standard deviation of each summary statistic for fluorescence was computed for all internal-standard RBSs relying on the three biological replicates. Further details on the evaluation of summary statistics are provided in the ML Annex.

Optimization of sampling time points

Sampling times were optimized using the high-quality kinetic profiles obtained from the proof-of-concept RBS library (see previous section for definition of high-quality profiles). To avoid biases towards the initial sampling schedule, we first represented the profile p of each RBS by an approximation with a logistic function \(\hat p\) imputed at 5 min intervals, which was fixed as the minimal time difference between two samples (Supplementary Fig. 25a). In cases where logistic approximation was not possible (i.e. failed parameter optimization), an exponential decay function was used. Formally, RBS i is represented by its approximation\(\hat p^i = (\hat p^i_0,\hat p^i_5, \ldots ,\hat p^i_{720})\). Afterwards, optimal sampling times were greedily selected while fixing the first and last sample at 0 and 720 min after induction. The goal of the optimization was to find a set S of optimal sampling times (initially S = (0, 720)) which allows to reconstruct \(\hat p^i\) such that a linear approximation using time points in S is as close to \(\hat p^i\) as possible. Given the set of possible sampling times T = {0, 5, 10, …, 720} and the subset I of RBS profiles on which the sampling times should be inferred, the greedy optimization finds the next optimal sampling time point s* from T as follows:

$$s \ast = {\rm{argmin}}_{s\, \in \,T\backslash S}\frac{1}{{\left| T \right|}}\mathop {\sum}\limits_{i \in I} {\mathop {\sum}\limits_{t \in T \cup \{ s\} } {\left| {\hat p_t^i – \hat l_t^{i,\,S}} \right|} } ,$$

(1)

where \(\hat l^{i,\,S}\) corresponds to the linear approximation of \(\hat p^i\) using only sampling times in S (Supplementary Fig. 25b). In other words, s* is the time point that (i) is not part of the sampling schedule S yet and (ii) results in the smallest cumulative reconstruction error over all RBS profiles in I. Subsequently, S is augmented by s*, and Eq. (1) is evaluated to find the next optimal s*, until S contains the desired number of sampling times. Finally, the quality of the optimal sampling schedule for every RBS i is evaluated by computing the approximation error ri between the observed profile pi, and its linear interpolation at the optimal time points S, termed li,S (Supplementary Fig. 25c):

$$r^i = \frac{1}{{|T|}}\mathop {\sum}\limits_{t \in T} {|p_t^i – l_t^{i,S}|} .$$

(2)

This optimization was performed for different sets of RBS profiles I: we first sorted RBSs by their observed strength (i.e. difference in the fraction of flipped discriminators between first and last sample) and optimized sampling for the top 5%, 10%, 25%, 50%, and 100%, respectively. This strategy was chosen to compensate for the strong bias towards weak RBSs in the initial library. Afterwards, we computed the cumulative approximation error on the entire library (top 100%) for the sampling schedules optimized on the different subsets. We found that for seven or more samples the difference in approximation error between subsets became indistinguishable and hence chose the sampling times inferred on the top 10% of profiles for the following experiments.

Optimization of NGS loading

To increase the throughput of uASPIre, we analyzed the data from the proof-of-concept (poc) experiment, which contained kinetic data of ~10,000 RBS variants. We sought to estimate an optimal number of variants to be loaded into NGS in order to retrieve a maximized number of variants with high-quality data (i.e. above different minimal read-count thresholds θ). For this simulation, we assumed that the limiting factor is the NGS throughput and that the maximal number of valid reads (i.e. reads that pass the pre-processing pipeline quality constraints) retrieved by NGS is constant across experiments under the same experimental conditions. This simulation is based on the idea that increasing the number of RBS variants reduces the coverage and vice versa, as the maximum number of valid reads is constant. For the distribution of read counts, we assumed that it follows a log-normal distribution and that its variance is independent of the coverage. The proof-of-concept data set is composed of ~2 × 108 valid reads, which are spread among nt = 18 time points and npoc = 10,427 variants with an average coverage of cov ~1000 reads per variant per time point. If the coverage of the small data set is reduced by a factor of rc > 1, and the number of time points by a factor of rt > 1, the total number of variants that could be loaded into NGS without loss would be ninput(rc, rt) = npoc × rc × rt, by conservation of the maximal number of valid reads. However, out of these ninput(rc, rt) variants, only noutput(θ, rc, rt) < ninput(rc, rt) would pass the quality control as enforced by the minimal read threshold θ. To simulate the effect of the minimal read threshold, we downsampled the read counts of the proof-of-concept data set by a factor rc and applied to it the minimal read threshold θ resulting in a number of variants above-threshold nsimul(θ, rc) < npoc. The estimated final number of variants is therefore noutput(θ, rc, rt) = nsimul(θ, rc) × rc × rt. Figure 2e illustrates the estimation of the number of variants for rt = nt / 9, several minimal read thresholds θ and several downsampling factors rc.

RBS library design

Initial efforts for training a convolutional neural network (CNN)51 based on the proof-of-concept data set resulted in a systematic underestimation of RBS strength, in particular for strong RBSs. This is likely due to the library being skewed towards weak sequences as a result of the full randomization of the 17 bases upstream of the Bxb1 start codon (Supplementary Fig. 7). To overcome this, three libraries (High1-3) presumably enriched in moderate-to-strong RBSs were designed in silico based on the proof-of-concept data set and added to a fully randomized library (N17). Libraries High1 and High2 were designed using position probability matrices (PPMs), 2D matrices in which each element represents the proportion of times a nucleotide occurs at a given position in the sequence. To this end, RBSs from the proof-of-concept data set were grouped into 10 linearly distributed bins according to a proxy for the normalized integral of their flipping profile (IFPtrz, contained in [0, 1]), for each of which a PPM was computed. The IFPtrz was computed using the trapezoidal rule on the flipping profiles. Degenerate RBS sequences for High1 and High2 were designed with the goal to obtain PPMs that most closely resemble (minimal mean-squared error) the PPMs of the highest and second highest bin, respectively. Library High3 was designed using a genetic algorithm on the basis of predictions from an initial CNN trained on the proof-of-concept data set. The RBS sequences from the three highest IFPtrz bins were randomly mutated for 200 iterations (1–2 mutations per sequence and iteration). Only sequences for which the predicted IFPtrz was increased due to the mutations were propagated to the next iteration. At the end of this process, we calculated the PPM of the resulting pool of sequences with high predicted IFPtrz, randomly selected 20,000 sequences from this PPM, and computed the predicted IFPtrz distribution for this sub-sample. Finally, the degenerate RBS sequence of High3 was obtained by greedily minimizing the Kolmogorov–Smirnov distance between the predicted IFPtrz distribution of the sub-sample and the predicted IFPtrz distribution for the respective degenerate candidate RBS sequence. For further details regarding the computation of the IFPtrz, the CNN and the genetic algorithm please refer to the ML Annex.

Normalization of biological replicates

In order to facilitate comparison of biological replicates, we capitalized on the 31 internal-standard RBSs. These serve as internal references spanning a large range of RBS activities and allow to compensate for potential batch effects and other systematic biases between replicates. Formally, for each of the 31 internal-standard RBSs, we denote by x and y the measured normalized integral of the flipping profile (IFP) for the biological replicate to be normalized and the reference replicate, respectively. We fit either a polynomial function of degree two, \(f:\left[ {0,\,1} \right] \to {\Bbb R}\) with f(x) = I + Ax + Bx2, or its inverse f(x) = g−1(x) with g(z) = I + Az + Bz2, such that the mean-squared error between f(x) and y is minimized across the 31 measurement pairs. Moreover, we impose the following constraints on the parameters of f: first, RBSs that show no activity in one replicate should remain inactive in the other replicates (f(0) = 0). Second, RBSs whose discriminators are entirely flipped before induction in one replicate should exhibit that behavior in the other replicates (f(1) = 1). Third, the ranking of RBSs according to their strength should be preserved across replicates (f is monotonically non-decreasing in [0, 1]). It should be noted that, empirically, these assumptions appear to hold across the three biological replicates in this study. Imposing the first two constraints above reduces the number of free parameters of the polynomial function from three to one, resulting in the family of functions parametrized by A: f(x) = Ax + (1 − A)x2. Moreover, the third constraint translates into the following bounds on the set of allowed values for the free parameter A: 0 ≤ A ≤ 2. This procedure was carried out for each pair of biological replicates. The quality of the resulting fits was then evaluated on the full data sets, excluding the 31 internal-standard RBSs that were used to optimize A.

Machine learning core model

We fitted the flipping profile of each RBS with a generalized logistic function (ML Annex), integrating the fitted kinetic curves between the time points at 0 and 480 min. and normalized the integral value by dividing by 480 (min). The resulting normalized integral value (range between 0 and 1; IFP0–480 min) was used as a descriptor of RBS behavior and was selected as an exemplary target for prediction since it exhibits high correlation with cellular Bxb1-sfGFP levels and a high diversity across the RBS libraries (Supplementary Fig. 12). Initially, we defined a set of preliminary candidate deep-learning architectures for a predictive model according to standard practices12,65. These included convolutional neural networks (CNNs) with and without residual blocks, as well as multilayer perceptrons. These architectures were assessed as part of the hyperparameter selection process, which indicated superior performance of the CNN with residual blocks (ResNet)49,50 for this particular application, resulting in a model with three residual blocks of two convolutional layers and two sets of two fully connected layers. We applied three main variations to the ResNet model in order to improve predictive accuracy and additionally provide a measure for predictive uncertainty. First, we chose the negative log-likelihood, which is a proper scoring rule, as the training criterion to achieve better uncertainty estimates56. The predicted IFP0–480 min was modeled using a beta distribution, as it provides a flexible distribution with support in the interval [0, 1]. Second, the last two fully connected layers in the network were modified to output two values instead of one, thereby allowing to independently parametrize the two shape parameters of the predictive beta distribution for each input sequence. Equivalently, as the first two moments of the beta distribution are functions of the shape parameters, we were able to retrieve the mean and the standard deviation of the predictive distribution for each input sequence. Third, we used an ensemble of N = 2 × 5 ResNet models56, each trained separately with a different random initialization of network parameters, a random order of training sequences during stochastic gradient-based optimization and different architecture and optimizer hyperparameters. This third variation helped increase predictive accuracy and capture epistemic uncertainty. The final model, SAPIENs, is an ensemble composed of five ResNet models with three residual blocks of two convolutional layers, composed of 64 filters of sizes 9 and 1, respectively, followed by two sets of two fully connected layers with 64 and 1 units, respectively (weight decay parameter: 10−6, learning rate: 0.01) and five ResNet models with three residual blocks of two convolutional layers, composed of 512 filters of sizes 10 and 1, respectively, followed by two sets of two fully connected layers with 64 and 1 units, respectively (weight decay parameter: 10−6, learning rate: 0.001). In all cases, we kept a held-out test set and split the remaining data set into a training and a validation set while keeping the same proportion of strong RBSs as defined by the 15th percentile of the IFP0–480 min distribution and softplus activation functions for the two output layers. We used batch-normalization66 followed by LeakyReLU activation functions between each layer. For optimization, we used the Adam optimizer67. The model was implemented in Keras with the Tensorflow68 backend. All hyperparameters (number of filters and layers, filters sizes, number of units of the fully connected layers, weight decay, learning rate, batch size) were selected with random search69 on the basis of their performance on the validation set. Additional details about the neural network can be found in the ML Annex.

Uncertainty estimation

The measured IFP0–480 min for each RBS was modeled as a draw from a beta distribution. The mean and variance of this distribution estimated by the ResNet model (see above) correspond to the predicted IFP0–480 min value and an indication of the aleatoric uncertainty of prediction, respectively. To complement this aleatoric estimate with an estimate of epistemic uncertainty, we first used an ensemble of N = 5 ResNet models with identical architecture and optimizer hyperparameters but different random parameter initialization and ordering of the input sequences. The uncertainty estimate is therefore given by the standard deviation of the mixture of N = 5 beta distributions (Supplementary Fig. 15a–c). Furthermore, we extended this ensemble strategy at a later stage by also including M different configurations for the higher level hyperparameters, such as architecture and optimizer hyperparameters, with five ResNet models per configuration, resulting in a total of N = M × 5 ResNet models in the ensemble. Finally, a number of configurations M = 2 was fixed as a trade-off between predictive performance and computational complexity (Supplementary Fig. 15d). The reliability diagram for this final ResNet ensemble (SAPIENs, N = 2 × 5) showed well-calibrated uncertainty estimates (Fig. 4f) indicating that the uncertainty of each predicted target value seems to be accounted for. This is confirmed by the fact that the mean absolute error is positively correlated with the predicted standard deviations (Supplementary Fig. 15e). Both these results suggest that the predicted standard deviations can be used as scores to evaluate the quality of each individual prediction.

Minimal read number threshold

A minimal threshold for the number of NGS reads per RBS was determined as a quality control criterion for both training and test sets. Increasing this threshold is expected to trade off two opposite effects since it increases the average quality of the data leading to a decrease in the underlying aleatoric uncertainty but at the same time reduces the data set size available for training, which generally lowers predictive performance. To this end, we first defined six filtered data sets obtained by keeping only sequences with at least 10, 15, 20, 30, 40, or 50 reads per sampling time point. Then, we randomly split each filtered data set into training, validation and test sets as described above and made sure that for each split the high-quality training, validation and test sets were contained in the lower quality training, validation and test sets, respectively. Moreover, a test set was held out for the following prediction experiments. In order to identify an optimal lower read-count threshold, we trained a single ResNet model for 150 epochs. We randomized the search for hyperparameters69 (see above) used the same 150 sets of hyperparameters for each filtered training data set and determined the coefficient of determination on the validation set. Hence, the minimal threshold was effectively treated as a hyperparameter. This analysis indicated that a minimal threshold of 20 reads per time point was optimal for predictive performance, which saturated for lower thresholds despite the increase in overall data set size (Supplementary Fig. 9a). We kept this training/validation/test split (Split0) for the following prediction experiments. Finally, we confirmed that these conclusions were not an artifact of the random split of the original data set by repeating this analysis using five different training, validation and test-set splits (Supplementary Fig. 9b).

Evaluation and benchmarking of the prediction model

Using Split0, we evaluated our model in more detail. Importantly, this implies that the test set had not been used in previous experiments in order to avoid overfitting. First, we used random search for selecting the best combination among 150 sets of hyperparameters on the validation set (see above), let SAPIENs run for 300 epochs and used an early stopping criterion on the validation set to avoid overfitting by selecting the epoch with the best validation R2 (Fig. 4c). To compare our single ResNet models and SAPIENs to different available ML approaches, we trained different models (Fig. 4b, e) on the training set and tuned their hyperparameters by optimizing predictive performance on a subset of the validation set of Split0. The single ResNet and SAPIENs models were trained for a maximum of 150 epochs, using early stopping. A total of 100 randomly generated model architectures with 1–3 residual blocks were considered. Hyperparameters tuned for the other models were regularization strength for Ridge Regression52, number of neighbors K for k-Nearest Neighbors53, number of trees for Random Forests54, and maximum depth and learning rate for Gradient Tree Boosting55, the later also benefited from early stopping in the validation set. The impact of the training set size on predictive performance (Fig. 4e) was evaluated by training the different models on different smaller data sets, while ensuring that the training and validation sets were contained in the training and validation sets of higher sample size experiments (i.e. nested training and validation sets). Hyperparameters for all models were optimized independently for each training set size on the corresponding validation set. The effect of adding designed sub-libraries to increase the fraction of stronger RBSs in the bulk library (Fig. 3e) was further analyzed to evaluate a potential gain in predictive performance for the intermediate and strong sequences (Supplementary Fig. 14). To this end, we performed cross-analyses with the fully degenerate sub-library (N17) and the bulk library (N17+High1-3). We trained on N17 and predicted on unseen subsets of N17 and N17+High1-3, and trained N17+High1-3 and predicted on unseen subsets of N17 and N17+High1-3 (Supplementary Fig. 14a). In another set of analyses, we omitted each of the enriched sub-libraries while training by moving them to the test sets and evaluated the corresponding effect (Supplementary Fig. 14b). We trained a single ResNet model for 300 epochs for computational considerations and we used early stopping in the validation set. Hyperparameters were tuned independently for each data set and selected from 150 random configurations in the corresponding validation set. All analyses were done with the same training and validation set sizes. Comparative analyses were performed with the same test set.

Evaluation of sequence motifs and model interpretation

We analyzed the fully degenerate sub-library (N17) in order to measure the impact of the position of known motifs of influence on the RBS activity, such as start codons (AUG, UUG, GUG) or the consensus Shine–Dalgarno sequence (AGGAGG and subsequences). To this end, for each position, for each group of RBSs that presented the motif of interest at the given position, we calculated simple statistics (median, interquartile ranges, 20/80 percentiles) on the target IFP0–480 min of the sequences in the group (Fig. 5a, b; Supplementary Fig. 17c, d). We excluded from these groups RBSs that contained at least one start codon other than the one at the position of interest. We also analyzed the filters of the first convolutional layer (excluding the first skip connection) of a ResNet model of the ensemble chosen at random (Fig. 5c). To this end, the effect of each filter was evaluated by calculating Pearson’s correlation coefficient between the filter activations at each position and the flipping integral for all sequences in the test set. Consequently, each filter is represented by a vector of correlations of size 17, which corresponds to the number of positions at which the filter influence is estimated. Finally, the filter representations are then clustered in 12 groups with a complete linkage clustering method using Hamming distance as the underlying metric for comparing individual sequences in order to group filters of similar influence. The integrated gradients57 method assigns attribution scores to each base and position by computing the linear path integral between the sequence of interest and a baseline sequence chosen a priori. The attribution scores measure the effect of individual bases on the predicted IFP0–480 min, relative to a baseline. We applied the integrated gradients method to SAPIENs and chose a “blank” one-hot encoded sequence as a neutral baseline (i.e. an all-zeros array). We first used a dimension reduction method, the t-distributed stochastic neighbor embedding (tSNE) method, to visualize how sequences behave in a low-dimensional space (perplexity = 12, early exaggeration = 30) (Fig. 5d). We also averaged the attribution scores of all sequences in the test set, per base and per position, to get a better understanding of the important positions and bases, which contribute either to a high RBS activity or to a low one (Fig. 5e). Finally, in order to account for non-linearities between positions and to understand the drivers of very strong or very weak sequences, we selected the top 5% and the bottom 5% sequences in the test set after removing outlier sequences and clustered each pool with k-means according to their attribution score profiles into five clusters. The medoids of these five clusters are displayed for the strong (Fig. 5f) and weak RBSs (Supplementary Fig. 18). For in silico evolution, we selected the weakest (respectively strongest) sequence in the test set and aimed to mutate it progressively to a sequence presenting a maximum (respectively minimum) attainable RBS activity as predicted by SAPIENs (Fig. 5g, h). To do so, we considered all sequences that could result from applying one or two mutations to the current sequence and kept the strongest (respectively weakest) one in each round until no candidate exhibited a change in predicted IFP0–480 min in the desired direction.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source