A selection of my research papers. Click a title to access the paper, or expand the abstract for more details.
Usman Anwar*, Tim Bakker*, Dana Kianfar, Cristina Pinneri, Christos Louizos
* Equal contribution
arXiv preprint, 2026. Under submission at ICML, 2026
Abstract
Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice - information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives.
Corrado Rainone*, Tim Bakker*, Roland Memisevic
* Equal contribution
arXiv preprint, 2025
Abstract
Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool.
Tim Bakker
PhD thesis, University of Amsterdam, 2024
Abstract
This thesis consists of two parts. Part 1 develops reinforcement learning methods for adaptive sensing in MRI acceleration, enabling patient-specific scanning strategies that reduce scan times. Part 2 explores machine learning approaches for active learning, where models guide their own training by selecting which examples to learn from. Additional work addresses simulator optimization through policy-guided surrogate model training.
Fabio Valerio Massoli*, Tim Bakker*, Thomas Hehn, Tribhuvanesh Orekondy, Arash Behboodi
* Equal contribution
arXiv preprint, 2024
Abstract
This paper introduces a novel method for solving classes of similar black-box optimization problems by learning an active learning policy that guides a differentiable surrogate's training and uses the surrogate's gradients to optimize the simulation parameters with gradient descent. After training the policy, downstream optimization of problems involving black-box simulators requires up to ~90% fewer expensive simulator calls compared to baselines.
Teodora Pandeva, Tim Bakker, Christian A. Naesseth, Patrick Forre
TMLR, 2024. ICLR 2025 J2C Track
Abstract
We propose E-C2ST, a deep classifier two-sample test for high-dimensional data based on E-values. Unlike the more standard p-value based tests, E-value based tests have finite sample type I error guarantees, making them appropriate tools for statistical testing in practice. Our proposed E-C2ST combines ideas from existing work on split likelihood ratio tests and predictive independence testing. The resulting E-values can be used for both standard statistical testing on a fixed data set as well as anytime testing in streaming data settings. We demonstrate the utility of E-C2ST on simulated and real data. In all experiments, we observe that -- as expected -- E-C2ST's type I error stays substantially below the chosen significance level, while the p-value based baseline methods regularly fail to control for type I error appropriately. While E-C2ST has reduced power compared to the baseline methods, power empirically converges to one as dataset size increases in most settings. We further propose an adjusted E-value based test in the anytime testing framework that has increased power, while still retaining the finite sample type I error guarantees.
Tim Bakker, Herke van Hoof, Max Welling
ECML, 2023
Abstract
Pool-based active learning (AL) is a promising technology for increasing data-efficiency of machine learning models. However, surveys show that performance of recent AL methods is very sensitive to the choice of dataset and training setting, making them unsuitable for general application. In order to tackle this problem, the field Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting. In this work, we propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem with an Attentive Conditional Neural Process model. Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives, such as those that do not equally weight the error on all data points. We experimentally verify that our Neural Process model outperforms a variety of baselines in these settings. Finally, our experiments show that our model exhibits a tendency towards improved stability to changing datasets. However, performance is sensitive to choice of classifier and more work is necessary to reduce the performance gap with the myopic oracle and to improve scalability. We present our work as a proof-of-concept for LAL on nonstandard objectives and hope our analysis and modelling considerations inspire future LAL work.
Tim Bakker, Matthew Muckley, Adriana Romero-Soriano, Michal Drozdzal, Luis Pineda
MIDL, 2022
Abstract
Most current approaches to undersampled multi-coil MRI reconstruction focus on learning the reconstruction model for a fixed, equidistant acquisition trajectory. In this paper, we study the problem of joint learning of the reconstruction model together with acquisition policies. To this end, we extend the End-to-End Variational Network with learnable acquisition policies that can adapt to different data points. We validate our model on a coil-compressed version of the large scale undersampled multi-coil fastMRI dataset using two undersampling factors, 4x and 8x. Our experiments show on-par performance with the learnable non-adaptive and handcrafted equidistant strategies at 4x, and an observed improvement of more than 2% in SSIM at 8x acceleration, suggesting that potentially-adaptive k-space acquisition trajectories can improve reconstructed image quality for larger acceleration factors. However, and perhaps surprisingly, our best performing policies learn to be explicitly non-adaptive.
Sierk Kanis, Laurens Samson, Daan Bloembergen, Tim Bakker
arXiv preprint, 2021. UrbComp Best Paper Award runner-up
Abstract
In this paper we revisit some of the fundamental premises for a reinforcement learning (RL) approach to self-learning traffic lights. We propose RLight, a combination of choices that offers robust performance and good generalization to unseen traffic flows. In particular, our main contributions are threefold. Our lightweight and cluster-aware state representation leads to improved performance, we reformulate the Markov Decision Process (MDP) such that it skips redundant timesteps of yellow light, speeding up learning by 30%, and we investigate the action space and provide insight into the difference in performance between acyclic and cyclic phase transitions. Additionally, we provide insights into the generalisation of the methods to unseen traffic. Evaluations using the real-world Hangzhou traffic dataset show that RLight outperforms state-of-the-art rule-based and deep reinforcement learning algorithms, demonstrating the potential of RL-based methods to improve urban traffic flows.
Tim Bakker, Herke van Hoof, Max Welling
NeurIPS, 2020. Spotlight
Abstract
In today's clinical practice, magnetic resonance imaging (MRI) is routinely accelerated through subsampling of the associated Fourier domain. Currently, the construction of these subsampling strategies - known as experimental design - relies primarily on heuristics. We propose to learn experimental design strategies for accelerated MRI with policy gradient methods. Unexpectedly, our experiments show that a simple greedy approximation of the objective leads to solutions nearly on-par with the more general non-greedy approach. We offer a partial explanation for this phenomenon rooted in greater variance in the non-greedy objective's gradient estimates, and experimentally verify that this variance hampers non-greedy models in adapting their policies to individual MR images. We empirically show that this adaptivity is key to improving subsampling designs.