Ihab Asaad, M.Sc.

Address: | Computer Vision Group |
Department of Mathematics and Computer Science | |
Friedrich Schiller University of Jena | |
Ernst-Abbe-Platz 2 | |
07743 Jena | |
Germany | |
Phone: | +49 (0) 3641 9 46335 |
E-mail: | ihab (dot) asaad (at) uni-jena (dot) de |
Room: | 1224 |
Links: | GitHub |
Curriculum Vitae
since 2023 | PhD Student | |
Project: “Sensorized Surgery: Optically guided precision surgery by real-time | ||
AI-interpreted multimodal imaging with continuous sensory feedback.” | ||
Computer Vision Group, Friedrich Schiller University Jena | ||
2022 – 2023 | M.Sc. Signal and Image Processing Methods and Applications | |
Master Thesis: “Self-supervised Learning of Speech Representations, Application | ||
to Speech Inpainting” | ||
Grenoble Institute of Technology, Phelma, France | ||
2019 – 2022 | M.Sc. Control in Technical Systems | |
Master Thesis: “Development of a Control System for an Unmanned Aerial Vehicle | ||
of the Bicopter Type” | ||
Bauman Moscow State Technical University, Russia | ||
2013 – 2018 | B. Sc. Electronic Systems Engineering | |
Higher Institute for Applied Sciences and Technology, Syria |
Research Interests
- Medical Imaging and AI
- Human-Computer Interaction
Publications
2025
Ihab Asaad, Maha Shadaydeh, Joachim Denzler:
Gradient Extrapolation for Debiased Representation Learning.
2025.
[bibtex] [pdf] [doi] [abstract]
Gradient Extrapolation for Debiased Representation Learning.
2025.
[bibtex] [pdf] [doi] [abstract]
Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations to define the target gradient as the linear extrapolation of two gradients computed from each batch's loss. It is demonstrated that the extrapolated gradient, if directed toward the gradient of the batch with fewer amount of spurious correlation, can guide the training process toward learning a debiased model. GERNE can serve as a general framework for debiasing with methods, such as ERM, reweighting, and resampling, being shown as special cases. The theoretical upper and lower bounds of the extrapolation factor are derived to ensure convergence. By adjusting this factor, GERNE can be adapted to maximize the Group-Balanced Accuracy (GBA) or the Worst-Group Accuracy. The proposed approach is validated on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baseline methods.
2024
Ihab Asaad, Maxime Jacquelin, Olivier Perrotin, Laurent Girin, Thomas Hueber:
Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting.
arXiv preprint arXiv:2405.20101. 2024.
[bibtex] [pdf] [web] [abstract]
Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting.
arXiv preprint arXiv:2405.20101. 2024.
[bibtex] [pdf] [web] [abstract]
Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.