1. Yannick Baraud Toward a universal estimator

Abstract: We shall discuss the problem of designing an estimator of an unknown quantity based on the observation of independent data. In statistics, these estimators are generally derived from classical  strategies such as likelihood maximization, least squares minimization, Bayesian strategy, etc. and for many years, statisticians have introduced quality criteria to evaluate their performance. We shall see that, although they are widely used and popular among users of statistics, these estimation strategies have serious drawbacks and may lead to poor quality estimators even in very simple situations, i.e. situations where alternative strategies would greatly outperform them. Nevertheless, a question arises. Would it be possible to design an estimation strategy that would not suffer from any of these weaknesses and that would lead to an estimator the performance of which would be optimal in all estimation problems?

  1. Benjamin Holcblat Some thoughts about econometrics and potential interdisciplinary collaborations

Abstract: The talk is divided into two parts. In the first part, we briefly try to outline the specificities of econometrics with respect to statistics and machine learning, as well as the differences between structural econometrics and reduced-form econometrics. In the second part, types of potential interdisciplinary collaborations are presented. A research project on the empirical saddlepoint (ESP) approximation exemplifies these potential collaborations. The ESP approximation has been successfully used to approximate the finite-sample distribution of solutions to empirical moment conditions —also known as estimating equations in statistics. However, many of its basic mathematical properties have not been established. Under general assumptions, we prove the existence of the ESP approximation estimand, namely the intensity distribution of the solutions. Then, we establish (global) consistency and asymptotic normality of the ESP approximation, and we show that these asymptotic properties are robust to the presence of multiple solutions (i.e., finite-underidentification). A byproduct of the theoretical investigation is a result  of independent interest regarding the measurability of solutions to empirical moment conditions. Computational aspects of the ESP approximation are also touched upon.

  1. Jun Pang Graph Neural Networks: Recent Developments and Applications

Abstract: Graphs are a ubiquitous data structure that models objects and their relationships, such as social networks, biological protein-protein networks, recommendation systems. Learning node embeddings from a large graph has been proved as a useful approach for a wide range of network analysis tasks, including link prediction, node and graph classification. Graph neural networks (GNNs) pave a new way to deep learning on graph structured data, and have become one of the most popular paradigms to learn and exploit node embeddings. This talk will provide a brief introduction to recent developments in GNNs and review a few new applications of GNNs.

  1. Thomas Sauter Contextualization of Molecular Network Models and their Applications to Cancer Biology

Abstract: Mathematical modelling of molecular networks allows for the discovery of knowledge at the system level. However, existing modelling tools are often computation-heavy and do not offer intuitive ways to explore the model, to test hypotheses or to interpret the results biologically.

We have developed computational approaches to contextualize logical models of regulatory networks, as well as constraint based genome scale models of metabolic networks with biological measurements. These approaches are based on a probabilistic description of rule-based regulatory interactions between the different molecules, respectively linear programming of the constraint based metabolic models. The resulting Matlab toolboxes allow for automatically and efficiently building and contextualizing networks, which includes a pipeline for conducting parameter analysis, knockouts and easy and fast model investigation. The contextualized models then provide qualitative and quantitative information about the network and suggest hypotheses about biological processes.

Applications include the model guided re-sensitization of mutBRAF melanoma cells being resistant to TRAIL receptor-targeted agonist treatment, as well as the metabolic model based drug repositioning for selectively targeting colon cancer.

  1. Emma L. Schymanski Digital Detective Work: Connecting Cheminformatics, Mass Spectrometry and our Environment

Abstract: The environment and the chemicals to which we are exposed is incredibly complex, with around 100 million chemicals in the largest open chemical databases and over 70,000 in household use alone. Detectable molecules in complex samples can now be captured using high resolution mass spectrometry (HRMS), which provides a “snapshot” of all chemicals present in a sample and allows for retrospective data analysis through digital archiving. However, scientists cannot yet identify most of the tens of thousands of features in each sample, leading to critical bottlenecks in identification and data interpretation. This gives rise to the need for “digital detective work”. Unknown identification remains extremely time consuming and, in many cases, a matter of luck. Prioritizing efforts to find significant metabolites or potentially toxic substances responsible for observed biological effects are the key, which involves reconciling highly complex samples with expert knowledge and careful validation. This talk will cover European, US and worldwide community initiatives to help connect knowledge on chemistry and toxicity with environmental observations – from compound databases to spectral libraries and retrospective screening. It will touch on the challenges of standardized structure representations, data curation, deposition and communication between resources, with a focus on recent activities we have undertaken to explore empowering both PubChem (https://pubchem.ncbi.nlm.nih.gov/) and MetFrag (https://msbi.ipb-halle.de/MetFrag/) for use in exposomics studies, using examples from the group. It will show how interdisciplinary efforts and data sharing can facilitate research in metabolomics, exposomics and beyond. Finally, the data science challenges we face will be covered, to stimulate further discussion at this internal UL data science workshop.

  1. Anupam Sengupta Untangling high dimensional data to discern physical ecology of microbes

Abstract: Microorganisms such as bacteria, archaea, fungi and algae are indispensable in holding our ecosystem in balance, more so during these rapidly shifting climatic patterns. Microbes interface, exchange and communicate with their local surroundings: from simple to complex fluids, from compliant to rigid surfaces, microbes inhabit plethora of environments spanning vastly different micro-structures, dynamics, and internal energies. Yet, currently we lack a biophysical framework that could explain, generalize, and crucially, predict the if-s, the how-s, and the why-s of the microbe-environment interactions. In my team, we aim to fill this gap by interfacing soft matter physics and fluid mechanics with microbiology and genetic engineering; leading to high dimensional datasets which we harness to predict behaviour and physiology of diverse species (and combinations therein) under a range of ecologically relevant parameters. Our results indicate that biological activity, combined with the microbial traits–at individual, population and community scales–elicits emergent properties that underpin microbial behaviour and physiology, which then feeds back to biological activity. This sets the stage for leveraging the high dimensional datasets to train ML-based algorithms to discern novel mechanistic insights, and predict interactions and interrelations in multi-species consortia, which has been lacking this far. In this talk, I will discuss our current ML strategies to tackle these datasets, not just to sift effectively through the high volume of imaging, behavioural and biomolecular data, but also to reveal the full scale of mechano-genetic implications, that ultimately could lead us to the principles of microbial ecology across scales.

  1. Martin Theobald Current Topics and Trends in Big Data Analytics

Abstract: The talk presents a brief overview of current topics and trends in Big Data management and analytics. Specifically, it outlines the usage of (a) various matrix decomposition techniques for building a recommender system and for analyzing latent topics in large text collections; (b) the training of regression models, decision trees and random forests for predictive data analysis, (c) the analysis of geospatial and temporal data, (d) as well as financial risk estimation via a form of Monte Carlo simulations. All applications and use-cases are developed on top of Apache Spark and can directly be deployed on any distributed backend such as the University’s HPC platform.

  1. Alexandre Tkatchenko Machine Learning in Physics and Chemistry.

Abstract: To be added

  1. Gautam Tripathi Integrated likelihood based inference for nonlinear panel data models with unobserved effects

Abstract: Panel data models with fixed effects are widely used by economists and other social scientists to capture the effects of unobserved individual heterogeneity. In this paper, we propose a new integrated likelihood based approach for estimating panel data models when the unobserved individual effects enter the model nonlinearly. Unlike existing integrated likelihoods in the literature, the one we propose is closer to a “genuine” likelihood. Although the statistical theory for the proposed estimator is developed in an asymptotic setting where the number of individuals and the number of time periods both approach infinity, results from a simulation study suggest that our methodology can work very well even in moderately sized panels of short duration in both static and dynamic models.