Publications

A selection of my research papers. Click a title to access the paper, or expand the abstract for more details.

Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Usman Anwar*, Tim Bakker*, Dana Kianfar, Cristina Pinneri, Christos Louizos

* Equal contribution

arXiv preprint, 2026. Under submission at ICML, 2026

Abstract

Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice - information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives.

arXiv Workshop: ICLR 2026 AI WILD Workshop: ICLR 2026 LIT Workshop: NeurIPS 2025 MechInterp Workshop version: NeurIPS 2025 Reliable ML

Replacing thinking with tool usage enables reasoning in small language models

Corrado Rainone*, Tim Bakker*, Roland Memisevic

* Equal contribution

arXiv preprint, 2025

Abstract

Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of "thoughts" expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool.

arXiv Workshop: ICML 2025 PUT Workshop: ICML 2025 WCUA

Learning adaptive sensing and active learning

Tim Bakker

PhD thesis, University of Amsterdam, 2024

Abstract

This thesis consists of two parts. Part 1 develops reinforcement learning methods for adaptive sensing in MRI acceleration, enabling patient-specific scanning strategies that reduce scan times. Part 2 explores machine learning approaches for active learning, where models guide their own training by selecting which examples to learn from. Additional work addresses simulator optimization through policy-guided surrogate model training.

PDF

Simulating, Fast and Slow: Learning Policies for Black-Box Optimization

Fabio Valerio Massoli*, Tim Bakker*, Thomas Hehn, Tribhuvanesh Orekondy, Arash Behboodi

* Equal contribution

arXiv preprint, 2024

Abstract

This paper introduces a novel method for solving classes of similar black-box optimization problems by learning an active learning policy that guides a differentiable surrogate's training and uses the surrogate's gradients to optimize the simulation parameters with gradient descent. After training the policy, downstream optimization of problems involving black-box simulators requires up to ~90% fewer expensive simulator calls compared to baselines.

arXiv Workshop version: NeurIPS 2023 ReALML Workshop version: NeurIPS 2023 Deep Inverse

E-Valuating Classifier Two-Sample Tests

Teodora Pandeva, Tim Bakker, Christian A. Naesseth, Patrick Forre

TMLR, 2024. ICLR 2025 J2C Track

Abstract

We propose E-C2ST, a deep classifier two-sample test for high-dimensional data based on E-values. Unlike the more standard p-value based tests, E-value based tests have finite sample type I error guarantees, making them appropriate tools for statistical testing in practice. Our proposed E-C2ST combines ideas from existing work on split likelihood ratio tests and predictive independence testing. The resulting E-values can be used for both standard statistical testing on a fixed data set as well as anytime testing in streaming data settings. We demonstrate the utility of E-C2ST on simulated and real data. In all experiments, we observe that -- as expected -- E-C2ST's type I error stays substantially below the chosen significance level, while the p-value based baseline methods regularly fail to control for type I error appropriately. While E-C2ST has reduced power compared to the baseline methods, power empirically converges to one as dataset size increases in most settings. We further propose an adjusted E-value based test in the anytime testing framework that has increased power, while still retaining the finite sample type I error guarantees.

TMLR 2024 arXiv Code

Learning objective-specific active learning strategies with Attentive Neural Processes

Tim Bakker, Herke van Hoof, Max Welling

ECML, 2023

Abstract

Pool-based active learning (AL) is a promising technology for increasing data-efficiency of machine learning models. However, surveys show that performance of recent AL methods is very sensitive to the choice of dataset and training setting, making them unsuitable for general application. In order to tackle this problem, the field Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting. In this work, we propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem with an Attentive Conditional Neural Process model. Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives, such as those that do not equally weight the error on all data points. We experimentally verify that our Neural Process model outperforms a variety of baselines in these settings. Finally, our experiments show that our model exhibits a tendency towards improved stability to changing datasets. However, performance is sensitive to choice of classifier and more work is necessary to reduce the performance gap with the myopic oracle and to improve scalability. We present our work as a proof-of-concept for LAL on nonstandard objectives and hope our analysis and modelling considerations inspire future LAL work.

ECML 2023 arXiv Code

On learning adaptive acquisition policies for undersampled multi-coil MRI reconstruction

Tim Bakker, Matthew Muckley, Adriana Romero-Soriano, Michal Drozdzal, Luis Pineda

MIDL, 2022

Abstract

Most current approaches to undersampled multi-coil MRI reconstruction focus on learning the reconstruction model for a fixed, equidistant acquisition trajectory. In this paper, we study the problem of joint learning of the reconstruction model together with acquisition policies. To this end, we extend the End-to-End Variational Network with learnable acquisition policies that can adapt to different data points. We validate our model on a coil-compressed version of the large scale undersampled multi-coil fastMRI dataset using two undersampling factors, 4x and 8x. Our experiments show on-par performance with the learnable non-adaptive and handcrafted equidistant strategies at 4x, and an observed improvement of more than 2% in SSIM at 8x acceleration, suggesting that potentially-adaptive k-space acquisition trajectories can improve reconstructed image quality for larger acceleration factors. However, and perhaps surprisingly, our best performing policies learn to be explicitly non-adaptive.

MIDL 2022 arXiv Code

Back to Basics: Deep Reinforcement Learning in Traffic Signal Control

Sierk Kanis, Laurens Samson, Daan Bloembergen, Tim Bakker

arXiv preprint, 2021. UrbComp Best Paper Award runner-up

Abstract

In this paper we revisit some of the fundamental premises for a reinforcement learning (RL) approach to self-learning traffic lights. We propose RLight, a combination of choices that offers robust performance and good generalization to unseen traffic flows. In particular, our main contributions are threefold. Our lightweight and cluster-aware state representation leads to improved performance, we reformulate the Markov Decision Process (MDP) such that it skips redundant timesteps of yellow light, speeding up learning by 30%, and we investigate the action space and provide insight into the difference in performance between acyclic and cyclic phase transitions. Additionally, we provide insights into the generalisation of the methods to unseen traffic. Evaluations using the real-world Hangzhou traffic dataset show that RLight outperforms state-of-the-art rule-based and deep reinforcement learning algorithms, demonstrating the potential of RL-based methods to improve urban traffic flows.

arXiv Code Video Workshop: UrbComp 2021

Experimental design for MRI by greedy policy search

Tim Bakker, Herke van Hoof, Max Welling

NeurIPS, 2020. Spotlight

Abstract

In today's clinical practice, magnetic resonance imaging (MRI) is routinely accelerated through subsampling of the associated Fourier domain. Currently, the construction of these subsampling strategies - known as experimental design - relies primarily on heuristics. We propose to learn experimental design strategies for accelerated MRI with policy gradient methods. Unexpectedly, our experiments show that a simple greedy approximation of the objective leads to solutions nearly on-par with the more general non-greedy approach. We offer a partial explanation for this phenomenon rooted in greater variance in the non-greedy objective's gradient estimates, and experimentally verify that this variance hampers non-greedy models in adapting their policies to individual MR images. We empirically show that this adaptivity is key to improving subsampling designs.

NeurIPS 2020 arXiv Code