In this contribution we present our workflow for model prediction in E. coli fed-batch production processes using R. The major challenges in this context are the fragmentary understanding of bioprocesses and the severely limited real-time access to process variables related to product quality and quantity. Data driven modeling of process variables in combination with model predictive process control concepts represent a potential solution to these problems. In R the statistical techniques best qualified for bioprocess data analysis and modeling are readily available.
In a benchmark study the performance of a number of machine learning methods is evaluated, i.e., random forest, neural networks, partial least squares and structured additive regression models. For that purpose a series of recombinant E. coli fed-batch production processes with varying cultivation conditions employing a comprehensive on- and offline process monitoring platform was conducted. The prediction of cell dry mass and recombinant protein based on online available process parameters and two-dimensional multi-wavelength fluorescence spectroscopy is investigated. Parameter optimization and model validation are performed in the framework of a leave-one-fermentation-out cross validation. Computations are performed using among others the R packages robfilter, boost, nnet, randomForest, pls and caret. The results clearly argue for a combined approach: neural networks as modeling technique and random forest as variable selection tool.