# Computer-based work (mostly statistics and programming)

**Guestimating GHG break-even point for biomass gasification**

**Guestimating GHG break-even point for biomass gasification**

Wood gas generation, using wood, manure, compost and like, produces CH4 under anaerobic conditions. Since renewable resources were used, biomass gas are considered sustainable etc. However, CH4 is a potent green-house gas, 84 times higher greenhouse warming potential than CO2 over 20 years (https://en.wikipedia.org/wiki/Greenhouse_gas), and all biomass gasifiers leak. In industrial settings, leakage is under 5% (https://www.umweltbundesamt.de/themen/biogasanlagen-muessen-sicherer-emissionsaermer). In developing countries, particularly for self-made biomass generators and manual methan transport (https://www.deutschlandfunk.de/mini-biogasanlagen-fuer-afrika-wirtschaftsfoerderung-statt.1773.de.html?dram:article_id=459738), such leakages can be expected to easily be in the 20-30%. The aim of this project is to compute the break-even point of biomass gasification, given the global warming potential difference between CO2 and CH4. How much leakage is acceptable before causing more problems than solving them? The key point here is to (a) establish a transparent derivation of the balance; and (b) consider different time horizons of GHG activity of the two gases.

**Suitable as:*** *MSc project, in collaboration with Prof. Stefan Pauliuk (Industrial Ecology).

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**An Individual-Based Model of African elephant demography**

In his PhD thesis, Severin Hauenstein developed a population model for African elephant. It describes survival and fecundity as a function of elephant size, rather than age (hence called an “integral projection model”, IPM). It thereby allows accommodating environmentally-driven variations in growth rates, e.g. less during droughts. Also, the carrying capacity, and hence the density-dependence of demographic rates, is integrated in this model. So far, this is so-called “mean-field approach”, in which no consideration is paid to variability between individuals: given their size, all individuals have the same model parameters.

An alternative approach to modelling population dynamics, “individual-based models” (a.k.a. agent-based models) allow representing variability among individuals. This is relevant only when the feature of interest, e.g. population size or population growth rate, is a non-linear function of the model parameters. That is the case in this demographic model. What is unclear is how much the representation of individual variability will affect model predictions.

The idea of this project is hence to re-implement the demographic model as an IBM, and compare the simulations with those of the original. One advantage of the IBM is that it is relatively easy to add further details and features. One disadvantage is that an IBM is much slower and hence more time-consuming to run repeatedly (This disadvantage should not be relevant for such a simple model).

**Suitable as:*** *MSc-project, requiring an interest in programming, preferably in python (or julia) or netlogo or C/C++ or, if need be, in R.

As BSc-project, various model improvements can be considered, e.g. integrating the currently separate calving model into the IPM, or a more GIS-related look-up functionality for extracting model parameters from the African Elephant Demographic Database.

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Literature:*** *Boult, V. L., Quaife, T., Fishlock, V., Moss, C. J., Lee, P. C., & Sibly, R. M. (2018). Individual-based modelling of elephant population dynamics using remote sensing to estimate food availability. *Ecological Modelling*, *387*, 187–195. https://doi.org/10.1016/j.ecolmodel.2018.09.010

**How does overdispersion of count data (non-independent events) affect quantitative network analysis?**

**How does overdispersion of count data (non-independent events) affect quantitative network analysis?**

Network analysis is a popular tool for understanding the complexity of ecosystems with respect to species interactions, for example those between plants and their pollinators. Quantitative networks are supposed to be more meaningful for ecosystem functions and more robust to sampling effects. However, many methods for quantitative networks assume that network data (interaction frequency) are based on independent events. Just like in regular poisson regression, this assumption may often be violated: multiple visits by the same individual, social behavior or spatiotemporal heterogeneity may lead to non-independence of interaction events, potentially strongly influencing network patterns and compromising inference. An example where such effects are particularly severe are the counts from pollen counts or fecal analysis, which are thus often not analysed in a fully quantitative way. This project has the potential to challenge conclusions of hundreds of published research papers.

**Methods:** This thesis will explore the influence of this effect on the estimation of specialization and on the significance of patterns inferred from null models. It will combine:

- data simulation using statistical models or (optionally) simple process-based models

- analysis of existing datasets (for which e.g. number of individuals interacting can be compared to the number of visits)

- exploration of solutions to the problem (e.g. log-transformation, using prevalence instead of fully quantity, hierarchical models, or own developed methods that explicitly account for overdispersion)

**Suitable as:*** *BSc or MSc thesis project

**Requirements:** strong dedication to work with R, basic programming and statistics skills using R

**Time:** can start anytime.

**Contact: Dr. Jochen Fründ**, jochen.fruend@biom.uni-freiburg.de, 0761/203-3747

**Automatising statistical analyses**

**Automatising statistical analyses**

Why does every data set require the analyst to start over with all the things she has learned during her studies? Surely much of this can be automatised!

Apart from attempts to make human-readable output from statistical analyses, efforts to automatise even simple analyses have not made it onto the market. But some parts of a statistical analysis can surely be automatised, in a supportive way. For example, after fitting a model, model diagnostics should be relatively straight-forward to carry out and report automatically. Or a comparison of the fitted model with some hyperflexibel algorithm to see whether the model could be improved in principle. Or automatic proposals for the type of distribution to use, to deal with correlated predictors, or to plot main effects?

Here is your chance to have a go! In addition to the fun of inventing and implementing algorithms to automatically do something, you will realise why some things are not yet automatised.

This project has many potential dimensions. It could focus on traditional model diagnostics, or on automatised plotting, or on comparisons of GLMs with machine learning approaches to improve model structure, or ...

If you prefer, you can look at this project differently, in the context of "analyst degrees of freedom". The idea is that in any statistical analysis the analyst faces many decisions. Some are influential, others less so. As a consequence, the final p-values of an hypothesis test may be as reported, or may be distorted by the choices made. Implementing an "automatic statistician" as an interactive pipeline allows us to go through all combinations of decisions, in a factorial design, and evaluate which steps have large (bad) and which have small effects (good) on the correctness (nominal coverage) of the final p-value.

**Suitable as:*** *BSc/MSc project

**Requirements:** Willingness to engage in R programming and abstract thinking. Frustration tolerance to error messages.

**Time:** The project can start anytime.

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Identify underlying processes in multi-state environmental data using exploratory statistics and deep learning**

**Identify underlying processes in multi-state environmental data using exploratory statistics and deep learning**

In an environmental system, system states are causally linked in complex ways. For example, soil moisture affects sap flow and photosynthesis, but more rain does not mean more sap flow. Such non-linear interrelationships can be represented, in principle, by deep neural networks. Since the monitored data comprise drivers (radiation, rainfall) as well as responses (sap flow, soil moisture), and since the relevant processes act at potentially very different time scales (minutes to weeks), it is unclear (a) what the potential deep learning offers, and (b) how to efficiently construct such networks for maximal information gain. The aim would be to then inspect the represented relationships in order to improve our understanding of the system.

In a first step, data will be simulated using an ecosystem model (e.g. Landscape-DNDC or alike), so as to be sure that the linkages between processes and scales are known.

Two approaches seem to be interesting starting points: autoencoder (AE) and reservoir computing. AE is akin to a non-linear PCA and tries to reduce the dimensionality of the data by finding a simple, if non-linear, representation. It consists of an encoder and a decoder step, where the first leads to a latent description, while the latter links this back to the data. Copula?

Reservoir computing (e.g. echo state networks), in contrast, targets dynamic systems and work through representing the input (including lagged versions of the input) in a fixed but large set of possible interactions (the reservoir). Being “fixed” means here that weights are assigned randomly. Only the output (or rather the “readout” layer) is then linked to the response variable through linear regression.

Regrettably it is unclear, which approach seems particularly suitable for the problem at hand, and in how far the combined fitting of several system states actually infers and advantage of separate state-wise modelling (i.e. building a model for each Y separately using some ML algorithm).

Data are provided by the CAOS project from hydrology, which are multiple years of 12 system states in 40 sites, assessed at hourly intervals. (Also WSL data for only 1 year, or anything from EcoSense coming up.)

**Contact: Carsten Dormann, **carsten.dormann@biom.uni-freiburg.de

**Process-integration into neural networks: using 3-PGN for PROFOUND**

**Process-integration into neural networks: using 3-PGN for PROFOUND**

Neural network are all the rage. They require representative data, however, i.e. data that describe the underlying processes well. For many environmental systems, we have a rather good process understanding, particularly in forest growth, forest C-fluxes, but also in hydrology. In this case, it would be silly to ignore this knowledge when fitting a flashy neural network to observed data.

This project shall implement and compare different ways to integrate a process model into neural networks. The basic approach has been implemented and tested for C-fluxes in a boreal forest and a simply ecophysiological model. Now, the next step is to use a somewhat more flexible forest growth model, which in principle also represents mixed stands, N-dynamics and management (3-PGN).

**Suitable as:*** *MSc-project, requires interest in “deep learning” and python. Python code and data are available for the previous process model.

**Contact:** *Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de*

**Literature:*** *Willard, J., Jia, X., Xu, S., Steinbach, M., & Kumar, V. (2021). Integrating scientific knowledge with machine learning for engineering and environmental systems. *ArXiv*, *2003.04919 [physics, stat]*. http://arxiv.org/abs/2003.04919

**State-space model for tree-ring growth**

**State-space model for tree-ring growth**

Analysis of tree-ring width is a very standardised statistical approach, but it is neither intuitive, nor would it be what I would do based on how we teach GLMs and mixed-effect models.

Actually, this kind of data is surprisingly messy: they feature temporal autocorrelation, non-linear growth, depending on both age and previous year’s growth, and environmental /stand conditions around the trees.

The approach would thus be to 1. analyse some data in the way “everybody” does, and compare that to an incrementally more complicated 2. analysis more in line with non-linear state-space models. Ideally, and dependent on the skills and progress, data should be simulated with a specific growth model in mind, and then both approaches should be compared to whether they recover the parameters used.

Data will be available from international data bases, but also from the Forest Growth & Dendrochronology lab.

**Contact:** *Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de*

**Literature:*** *

Bowman, D. M. J. S., Brienen, R. J. W., Gloor, E., Phillips, O. L., & Prior, L. D. (2013). Detecting trends in tree growth: Not so simple. *Trends in Plant Science*, *18*(1), 11–17. https://doi.org/10.1016/j.tplants.2012.08.005

Lundqvist, S.-O., Seifert, S., Grahn, T., Olsson, L., García-Gil, M. R., Karlsson, B., & Seifert, T. (2018). Age and weather effects on between and within ring variations of number, width and coarseness of tracheids and radial growth of young Norway spruce. *European Journal of Forest Research*, *137*(5), 719–743. https://doi.org/10.1007/s10342-018-1136-x

Schofield, M. R., Barker, R. J., Gelman, A., Cook, E. R., & Briffa, K. R. (2016). A model-based approach to climate reconstruction using tree-ring data. *Journal of the American Statistical Association*, *111*(513), 93–106. https://doi.org/10.1080/01621459.2015.1110524

Zhao, S., Pederson, N., D’Orangeville, L., HilleRisLambers, J., Boose, E., Penone, C., Bauer, B., Jiang, Y., & Manzanedo, R. D. (2019). The International Tree-Ring Data Bank (ITRDB) revisited: Data availability and global ecological representativity. *Journal of Biogeography*, *46*(2), 355–368. https://doi.org/10.1111/jbi.13488

**The impact of diel vertical migration on ocean carbon flux**

**The impact of diel vertical migration on ocean carbon flux**

The daily (“diel”) migration of zooplankton (and accompanying fish) from the sea surface to the deep dark during the day is the largest movement of biomass on earth. It is triggered by light, but the evolutionary cause is for zooplankton to avoid being consumed by their visually hunting predators. The consequence of DVM is that phytoplankton can reproduce largely unharmed during the day, thus assimilating more CO2 than if it was constantly grazed upon. Thus, it seems that DVM is actually not only optimizing survival of zooplankton, but also maximizing energy import into the pelagic sea. Or is it?

This theoretical ecology study aims at producing a simple predator-prey model to allow investigating the consequences of (a) switching on/off DVM, and (b) comparing tropical and polar regions of obviously very different day/night lengths. In polar regions, no DVM is observed: is this still maximizing energy import?

Models on DVM exist in the literature, but they are largely integrated into complex biogeochemical models of the ocean. This is not the aim of this “strategic” model, which should be parameterized for some processes (e.g. photosynthetic rate, foraging efficiency, migration rate), but aims to identify whether there is a detectable effect of DVM on C-fluxes.

**Suitable as:*** * BSc or MSc project, requiring interest in programming either differential or difference equations in R or Python.

**Contact:** **Carsten Dormann**, carsten.dormann@biom.uni-freiburg.de

**Literature:** Stock, C., & Dunne, J. (2010). Controls on the ratio of mesozooplankton production to primary production in marine ecosystems. Deep Sea Research Part I: Oceanographic Research Papers, 57(1), 95–112. https://doi.org/10.1016/j.dsr.2009.10.006

**The scale of tree diversity effects: develop flexible neighborhood analysis and test it with multiple ecosystem functions**

**The scale of tree diversity effects: develop flexible neighborhood analysis and test it with multiple ecosystem functions**

**Topic: **Enhancing tree diversity is a key concept to improve functioning and resilience of forests. Tree diversity experiments have been established worldwide to provide robust evidence for such effects, but often the effects are relatively weak. Tree diversity effects can be understood within the framework of associational (neighborhood) effects, but it is generally unknown how far the influence of a tree on the trees in its surrounding reaches in terms of various ecosystem functions / processes. Analyses of tree diversity experiments so far have used either diversity at the plot level or diversity of the direct neighbor trees as predictors.

This thesis will implement and test a novel, more mechanistic approach for identifying and understanding tree diversity effects, by calculating a distance-weighted neighborhood diversity. Existing datasets for various “functions” (herbivory, tree growth, etc.) are available for testing the method. In addition to providing a more powerful tool for identifying diversity effects, it will help to understand the scale of diversity effects for different processes (possibly even beyond plot boundaries).

**Suitable as:** MSc project.

**Requirements:** Ability and willingness to think conceptually about biodiversity effects. Experience and aptitude to work with R and data management.

**Time: **Start anytime

**Contact: Dr. Jochen Fründ**, jochen.fruend@biom.uni-freiburg.de, 0761/203-3747 (Biometry and Environmental System Analysis, Faculty Env. and Nat. Res.)