Niklas Penzel, M.Sc.
Address: | Computer Vision Group |
Department of Mathematics and Computer Science | |
Friedrich Schiller University of Jena | |
Ernst-Abbe-Platz 2 | |
07743 Jena | |
Germany | |
Phone: | +49 (0) 3641 9 46335 |
E-mail: | niklas (dot) penzel (at) uni-jena (dot) de |
Room: | 1224 |
Links: | Google Scholar |
Curriculum Vitae
Since Dec. 2020 | Research Associate at the Computer Vision Group, Friedrich Schiller University Jena |
---|---|
2020 | Master Thesis: “The Bias Uncertainty Sampling introduces into an Active Learning System” |
2018-2020 | M.Sc. in Computer Science at the Friedrich Schiller University Jena |
2018 | Bachelor Thesis: “Lebenslanges Lernen von Klassifikationssystemen ohne Vorwissen und mit intelligenter Datenhaltung” (Lifelong Learning of Classification Systems without Previous Knowledge and with Smart Data Management) |
2015-2018 | B.Sc. in Computer Science at the Friedrich Schiller University Jena |
Research Interests
- Explainable AI
- Analyzing Model Training
- Lifelong Learning
- Deep Learning
- Super Resolution
Supervised Theses
- Phillip Rothenbeck: “SIR-based modelling of COVID-19 pandemic using PINNs”. Bachelor thesis, 2024. (joint supervision with Sai Karthikeya Vemuri)
- Maria Gogolev: “Comparing and Modifying Distributions of Latent Diffusion Models to Impose Image Properties”. Master thesis, 2024. (joint supervision with Sven Sickert and Tim Büchner)
- Konstantin Roppel: “Model Feature Attribution for Single Images using Conditional Independence Tests”. Master thesis, 2024 (joint supervision with Jan Blunk)
- Jan Blunk: “Steering Feature Usage During Neural Network Model Training”. Master thesis, 2023 (joint supervision with Paul Bodesheim)
- Tristan Piater: “Self-Attention Mechanisms for the Classification of Dermoscopic Images”. Bachelor thesis, 2022 (joint supervision with Gideon Stein)
- Maria Gogolev: “Continual fine-tuning with intelligent rehearsal selection”. Bachelor thesis, 2022 (joint supervision with Julia Böhlke)
Publications
2025
Niklas Penzel, Gideon Stein, Joachim Denzler:
Change Penalized Tuning to Reduce Pre-trained Biases.
Communications in Computer and Information Science. 2025. (in press)
[bibtex] [abstract]
Change Penalized Tuning to Reduce Pre-trained Biases.
Communications in Computer and Information Science. 2025. (in press)
[bibtex] [abstract]
Due to the data-centric approach of modern machine learning, biases present in the training data are frequently learned by deep models. It is often necessary to collect new data and retrain the models from scratch to remedy these issues, which can be expensive in critical areas such as medicine. We investigate whether it is possible to fix pre-trained model behavior using very few unbiased examples. We show that we can improve performance by tuning the models while penalizing parameter changes. Hence, we are keeping pre-trained knowledge while simultaneously correcting the harmful behavior. Toward this goal, we tune a zero-initialized copy of the frozen pre-trained network using strong parameter norms. Secondly, we introduce an early stopping scheme to modify baselines and reduce overfitting. Our approaches lead to improvements in four datasets common in the debiasing and domain shift literature. We especially see benefits in an iterative setting, where new samples are added continuously. Hence, we demonstrate the effectiveness of tuning while penalizing change to fix pre-trained models without retraining from scratch.
Tristan Piater, Niklas Penzel, Gideon Stein, Joachim Denzler:
Self-Attention for Medical Imaging - On the need for evaluations beyond mere benchmarking.
Communications in Computer and Information Science. 2025. (in press)
[bibtex] [abstract]
Self-Attention for Medical Imaging - On the need for evaluations beyond mere benchmarking.
Communications in Computer and Information Science. 2025. (in press)
[bibtex] [abstract]
A considerable amount of research has been dedicated to creating systems that aid medical professionals in labor-intensive early screening tasks, which, to this date, often leverage convolutional deep-learning architectures. Recently, several studies have explored the application of self-attention mechanisms in the field of computer vision. These studies frequently demonstrate empirical improvements over traditional, fully convolutional approaches across a range of datasets and tasks. To assess this trend for medical imaging, we enhance two commonly used convolutional architectures with various self-attention mechanisms and evaluate them on two distinct medical datasets. We compare these enhanced architectures with similarly sized convolutional and attention-based baselines and rigorously assess performance gains through statistical evaluation. Furthermore, we investigate how the inclusion of self-attention influences the features learned by these models by assessing global and local explanations of model behavior. Contrary to our expectations, after performing an appropriate hyperparameter search, self-attention-enhanced architectures show no significant improvements in balanced accuracy compared to the evaluated baselines. Further, we find that relevant global features like dermoscopic structures in skin lesion images are not properly learned by any architecture. Finally, by assessing local explanations, we find that the inherent interpretability of self-attention mechanisms does not provide additional insights. Out-of-the-box model-agnostic approaches can provide explanations that are similar or even more faithful to the actual model behavior. We conclude that simply integrating attention mechanisms is unlikely to lead to a consistent increase in performance compared to fully convolutional methods in medical imaging applications.
2024
Niklas Penzel, Gideon Stein, Joachim Denzler:
Reducing Bias in Pre-trained Models by Tuning while Penalizing Change.
International Conference on Computer Vision Theory and Applications (VISAPP). Pages 90-101. 2024.
[bibtex] [web] [doi] [abstract]
Reducing Bias in Pre-trained Models by Tuning while Penalizing Change.
International Conference on Computer Vision Theory and Applications (VISAPP). Pages 90-101. 2024.
[bibtex] [web] [doi] [abstract]
Deep models trained on large amounts of data often incorporate implicit biases present during training time. If later such a bias is discovered during inference or deployment, it is often necessary to acquire new data and retrain the model. This behavior is especially problematic in critical areas such as autonomous driving or medical decision-making. In these scenarios, new data is often expensive and hard to come by. In this work, we present a method based on change penalization that takes a pre-trained model and adapts the weights to mitigate a previously detected bias. We achieve this by tuning a zero-initialized copy of a frozen pre-trained network. Our method needs very few, in extreme cases only a single, examples that contradict the bias to increase performance. Additionally, we propose an early stopping criterion to modify baselines and reduce overfitting. We evaluate our approach on a well-known bias in skin lesion classification and three other datasets from the domain shift literature. We find that our approach works especially well with very few images. Simple fine-tuning combined with our early stopping also leads to performance benefits for a larger number of tuning samples.
Tim Büchner, Niklas Penzel, Orlando Guntinas-Lichius, Joachim Denzler:
Facing Asymmetry - Uncovering the Causal Link between Facial Symmetry and Expression Classifiers using Synthetic Interventions.
Asian Conference on Computer Vision (ACCV). 2024. (accepted at ACCV)
[bibtex] [pdf] [abstract]
Facing Asymmetry - Uncovering the Causal Link between Facial Symmetry and Expression Classifiers using Synthetic Interventions.
Asian Conference on Computer Vision (ACCV). 2024. (accepted at ACCV)
[bibtex] [pdf] [abstract]
Understanding expressions is vital for deciphering human behavior, and nowadays, end-to-end trained black box models achieve high performance. Due to the black-box nature of these models, it is unclear how they behave when applied out-of-distribution. Specifically, these models show decreased performance for unilateral facial palsy patients. We hypothesize that one crucial factor guiding the internal decision rules is facial symmetry. In this work, we use insights from causal reasoning to investigate the hypothesis. After deriving a structural causal model, we develop a synthetic interventional framework. This approach allows us to analyze how facial symmetry impacts a network's output behavior while keeping other factors fixed. All 17 investigated expression classifiers significantly lower their output activations for reduced symmetry. This result is congruent with observed behavior on real-world data from healthy subjects and facial palsy patients. As such, our investigation serves as a case study for identifying causal factors that influence the behavior of black-box models.
Tim Büchner, Niklas Penzel, Orlando Guntinas-Lichius, Joachim Denzler:
The Power of Properties: Uncovering the Influential Factors in Emotion Classification.
International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI). 2024.
[bibtex] [web] [doi] [abstract]
The Power of Properties: Uncovering the Influential Factors in Emotion Classification.
International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI). 2024.
[bibtex] [web] [doi] [abstract]
Facial expression-based human emotion recognition is a critical research area in psychology and medicine. State-of-the-art classification performance is only reached by end-to-end trained neural networks. Nevertheless, such black-box models lack transparency in their decisionmaking processes, prompting efforts to ascertain the rules that underlie classifiers’ decisions. Analyzing single inputs alone fails to expose systematic learned biases. These biases can be characterized as facial properties summarizing abstract information like age or medical conditions. Therefore, understanding a model’s prediction behaviorrequires an analysis rooted in causality along such selected properties. We demonstrate that up to 91.25% of classifier output behavior changes are statistically significant concerning basic properties. Among those are age, gender, and facial symmetry. Furthermore, the medical usage of surface electromyography significantly influences emotion prediction. We introduce a workflow to evaluate explicit properties and their impact. These insights might help medical professionals select and apply classifiers regarding their specialized data and properties.
Tristan Piater, Niklas Penzel, Gideon Stein, Joachim Denzler:
When Medical Imaging Met Self-Attention: A Love Story That Didn’t Quite Work Out.
International Conference on Computer Vision Theory and Applications (VISAPP). Pages 149-158. 2024.
[bibtex] [web] [doi] [abstract]
When Medical Imaging Met Self-Attention: A Love Story That Didn’t Quite Work Out.
International Conference on Computer Vision Theory and Applications (VISAPP). Pages 149-158. 2024.
[bibtex] [web] [doi] [abstract]
A substantial body of research has focused on developing systems that assist medical professionals during labor-intensive early screening processes, many based on convolutional deep-learning architectures. Recently, multiple studies explored the application of so-called self-attention mechanisms in the vision domain. These studies often report empirical improvements over fully convolutional approaches on various datasets and tasks. To evaluate this trend for medical imaging, we extend two widely adopted convolutional architectures with different self-attention variants on two different medical datasets. With this, we aim to specifically evaluate the possible advantages of additional self-attention. We compare our models with similarly sized convolutional and attention-based baselines and evaluate performance gains statistically. Additionally, we investigate how including such layers changes the features learned by these models during the training. Following a hyperparameter search, and contrary to our expectations, we observe no significant improvement in balanced accuracy over fully convolutional models. We also find that important features, such as dermoscopic structures in skin lesion images, are still not learned by employing self-attention. Finally, analyzing local explanations, we confirm biased feature usage. We conclude that merely incorporating attention is insufficient to surpass the performance of existing fully convolutional methods.
2023
Jan Blunk, Niklas Penzel, Paul Bodesheim, Joachim Denzler:
Beyond Debiasing: Actively Steering Feature Selection via Loss Regularization.
DAGM German Conference on Pattern Recognition (DAGM-GCPR). 2023.
[bibtex] [pdf] [abstract]
Beyond Debiasing: Actively Steering Feature Selection via Loss Regularization.
DAGM German Conference on Pattern Recognition (DAGM-GCPR). 2023.
[bibtex] [pdf] [abstract]
It is common for domain experts like physicians in medical studies to examine features for their reliability with respect to a specific domain task. When introducing machine learning, a common expectation is that machine learning models use the same features as human experts to solve a task but that is not always the case. Moreover, datasets often contain features that are known from domain knowledge to generalize badly to the real world, referred to as biases. Current debiasing methods only remove such influences. To additionally integrate the domain knowledge about well-established features into the training of a model, their relevance should be increased. We present a method that permits the manipulation of the relevance of features by actively steering the model's feature selection during the training process. That is, it allows both the discouragement of biases and encouragement of well-established features to incorporate domain knowledge about the feature reliability. We model our objectives for actively steering the feature selection process as a constrained optimization problem, which we implement via a loss regularization that is based on batch-wise feature attributions. We evaluate our approach on a novel synthetic regression dataset and a computer vision dataset. We observe that it successfully steers the features a model selects during the training process. This is a strong indicator that our method can be used to integrate domain knowledge about well-established features into a model.
Niklas Penzel, Jana Kierdorf, Ribana Roscher, Joachim Denzler:
Analyzing the Behavior of Cauliflower Harvest-Readiness Models by Investigating Feature Relevances.
ICCV Workshop on Computer Vision in Plant Phenotyping and Agriculture (CVPPA). Pages 572-581. 2023.
[bibtex] [pdf] [abstract]
Analyzing the Behavior of Cauliflower Harvest-Readiness Models by Investigating Feature Relevances.
ICCV Workshop on Computer Vision in Plant Phenotyping and Agriculture (CVPPA). Pages 572-581. 2023.
[bibtex] [pdf] [abstract]
The performance of a machine learning model is characterized by its ability to accurately represent the input-output relationship and its behavior on unseen data. A prerequisite for high performance is that causal relationships of features with the model outcome are correctly represented. This work analyses the causal relationships by investigating the relevance of features in machine learning models using conditional independence tests. For this, an attribution method based on Pearl's causality framework is employed.Our presented approach analyzes two data-driven models designed for the harvest-readiness prediction of cauliflower plants: one base model and one model where the decision process is adjusted based on local explanations. Additionally, we propose a visualization technique inspired by Partial Dependence Plots to gain further insights into the model behavior. The experiments presented in this paper find that both models learn task-relevant features during fine-tuning when compared to the ImageNet pre-trained weights. However, both models differ in their feature relevance, specifically in whether they utilize the image recording date. The experiments further show that our approach is able to reveal that the adjusted model is able to reduce the trends for the observed biases. Furthermore, the adjusted model maintains the desired behavior for the semantically meaningful feature of cauliflower head diameter, predicting higher harvest-readiness scores for higher feature realizations, which is consistent with existing domain knowledge. The proposed investigation approach can be applied to other domain-specific tasks to aid practitioners in evaluating model choices.
Niklas Penzel, Joachim Denzler:
Interpreting Art by Leveraging Pre-Trained Models.
International Conference on Machine Vision and Applications (MVA). Pages 1-6. 2023.
[bibtex] [doi] [abstract]
Interpreting Art by Leveraging Pre-Trained Models.
International Conference on Machine Vision and Applications (MVA). Pages 1-6. 2023.
[bibtex] [doi] [abstract]
In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.
2022
Niklas Penzel, Christian Reimers, Paul Bodesheim, Joachim Denzler:
Investigating Neural Network Training on a Feature Level using Conditional Independence.
ECCV Workshop on Causality in Vision (ECCV-WS). Pages 383-399. 2022.
[bibtex] [pdf] [doi] [abstract]
Investigating Neural Network Training on a Feature Level using Conditional Independence.
ECCV Workshop on Causality in Vision (ECCV-WS). Pages 383-399. 2022.
[bibtex] [pdf] [doi] [abstract]
There are still open questions about how the learned representations of deep models change during the training process. Understanding this process could aid in validating the training. Towards this goal, previous works analyze the training in the mutual information plane. We use a different approach and base our analysis on a method built on Reichenbach’s common cause principle. Using this method, we test whether the model utilizes information contained in human-defined features. Given such a set of features, we investigate how the relative feature usage changes throughout the training process. We analyze mul- tiple networks training on different tasks, including melanoma classifica- tion as a real-world application. We find that over the training, models concentrate on features containing information relevant to the task. This concentration is a form of representation compression. Crucially, we also find that the selected features can differ between training from-scratch and finetuning a pre-trained network.
2021
Christian Reimers, Niklas Penzel, Paul Bodesheim, Jakob Runge, Joachim Denzler:
Conditional Dependence Tests Reveal the Usage of ABCD Rule Features and Bias Variables in Automatic Skin Lesion Classification.
CVPR ISIC Skin Image Analysis Workshop (CVPR-WS). Pages 1810-1819. 2021.
[bibtex] [pdf] [web] [abstract]
Conditional Dependence Tests Reveal the Usage of ABCD Rule Features and Bias Variables in Automatic Skin Lesion Classification.
CVPR ISIC Skin Image Analysis Workshop (CVPR-WS). Pages 1810-1819. 2021.
[bibtex] [pdf] [web] [abstract]
Skin cancer is the most common form of cancer, and melanoma is the leading cause of cancer related deaths. To improve the chances of survival, early detection of melanoma is crucial. Automated systems for classifying skin lesions can assist with initial analysis. However, if we expect people to entrust their well-being to an automatic classification algorithm, it is important to ensure that the algorithm makes medically sound decisions. We investigate this question by testing whether two state-of-the-art models use the features defined in the dermoscopic ABCD rule or whether they rely on biases. We use a method that frames supervised learning as a structural causal model, thus reducing the question whether a feature is used to a conditional dependence test. We show that this conditional dependence method yields meaningful results on data from the ISIC archive. Furthermore, we find that the selected models incorporate asymmetry, border and dermoscopic structures in their decisions but not color. Finally, we show that the same classifiers also use bias features such as the patient's age, skin color or the existence of colorful patches.
Niklas Penzel, Christian Reimers, Clemens-Alexander Brust, Joachim Denzler:
Investigating the Consistency of Uncertainty Sampling in Deep Active Learning.
DAGM German Conference on Pattern Recognition (DAGM-GCPR). Pages 159-173. 2021.
[bibtex] [pdf] [web] [doi] [abstract]
Investigating the Consistency of Uncertainty Sampling in Deep Active Learning.
DAGM German Conference on Pattern Recognition (DAGM-GCPR). Pages 159-173. 2021.
[bibtex] [pdf] [web] [doi] [abstract]
Uncertainty sampling is a widely used active learning strategy to select unlabeled examples for annotation. However, previous work hints at weaknesses of uncertainty sampling when combined with deep learning, where the amount of data is even more significant. To investigate these problems, we analyze the properties of the latent statistical estimators of uncertainty sampling in simple scenarios. We prove that uncertainty sampling converges towards some decision boundary. Additionally, we show that it can be inconsistent, leading to incorrect estimates of the optimal latent boundary. The inconsistency depends on the latent class distribution, more specifically on the class overlap. Further, we empirically analyze the variance of the decision boundary and find that the performance of uncertainty sampling is also connected to the class regions overlap. We argue that our findings could be the first step towards explaining the poor performance of uncertainty sampling combined with deep models.