Michal Štefánik

Researcher in Artificial Intelligence & Natural Language Processing

stefanik.m@mail.muni.cz

Welcome to my website! I am a Postdoctoral Researcher in the Language Technology group at the University of Helsinki.

My research focuses on the robustness of language models, including a reliable evaluation of models' generalization but also methods to improve models' reliability by curating better data or improving existing training methods. My research, presented at highly selective *ACL conferences, delivers better models for low-resource applications, instructional models with more human-like reasoning, or question-answering systems robust to prediction shortcuts. My research was recognized by several awards (see below) and recently by Summa Cum Laude and Dean's Award for my dissertation thesis.

I am a proud founder and leader of TransformersClub™, a platform for supporting students in pursuing their research ideas, with our members awarded several international prizes and papers in top-tier NLP/AI conferences.

Over the last six years, I have also led the delivery of many industrial applications of cutting-edge language technologies in Gauss Algorithmic, including multilingual language generation or entity recognition, all the way from the research ideas to the scalable deployments, today serving the fascinating NLP technologies to thousands of users.

Do not hesitate to connect or reach out if our research interests align!

Research

Below you can find references to my most recent publications. An overview of my research can now also be found in my dissertation 📄

2024

Self-Training Language Models in Arithmetic Reasoning

Marek Kadlčík, Michal Štefánik (equal)

In Proceedings of EMNLP 2024: Findings track & ICLR 2024 LLMAgents.

Language models achieve impressive results in tasks involving complex multistep reasoning, but scaling these capabilities further traditionally requires expensive collection of more annotated data. In this work, we explore the potential of improving the capabilities of language models using automated feedback based on the validity of their predictions in arithmetic reasoning (self-training).

We find that models can substantially improve in both single-round (offline) and online self-training. In the offline setting, supervised methods are able to deliver gains comparable to preference optimization, but in online self-training, preference optimization shows to largely outperform supervised training thanks to superior stability and robustness on unseen types of problems.

Concept-aware Data Construction Improves In-context Learning of Language Models

Michal Štefánik, Marek Kadlčík, Petr Sojka

In Proceedings of ACL 2024: Findings track & ICLR 2024 Understanding of Foundation Models · 📄 Blog post.

Previous work curating in-context learning models assumes that ICL emerges from vast over-parametrization or the scale of multitask training. However, theoretical work attributes ICL emergence to specific properties of training data and creates functional in-context learners in small-scale, synthetic settings.

Inspired by previous theories, we propose a method to construct training data requiring analogical reasoning. We show the surprising efficiency of training on such data, with resulting models significantly outperforming classic instruction tuning in 41 and 45 out of 60 tasks, while performing on par with models trained on over 1600 tasks while using only two (2) QA tasks. Our analyses attribute these improvements to enhanced robustness of our models towards shortcuts of previous ICL models and improved ability to benefit from informative demonstrations.

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

Lukáš Mikula*, Michal Štefánik*, Marek Petrovič, Petr Sojka (*equal)

In Proceedings of EACL 2024: Main track · 🥇 Best submission in NAACL DADC: Better training data challenge

In this paper, we challenge the commonly-used evaluation of robustness of language models through the lens of their out-of-distribution performance. We propose a metric to measure the reliance of model's performance on a specific feature and measure the reliance on a set of features identified as prediction shortcuts.

We find that the same prediction shortcuts are exposed in all assessed QA datasets, whereas some shortcuts identified in SQuAD are even more impactful in other datasets. This means that counterintuitively, a model relying on shortcut more might reach better OOD performance. Finally, we survey three existing debiasing methods and find that indeed, the OOD gains of these methods come in hand with increased, rather than decreased reliance on prediction shortcuts.

Conference talk ⬇️️ & Announcement of results in NAACL DADC ⬇️️

2023

Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems

Marek Kadlčík*, Michal Štefánik*, Ondřej Sotolář, Vlastimil Martinek (*equal)

In Proceedings of EMNLP 2023: Main track

Language models are notoriously inclined to make factual errors in tasks requiring arithmetic reasoning. To enable language models to circumvent this deficiency and offload a critical computation to a symbolic system, we create a collection of Calc-X datasets that demonstrate the appropriate use of a calculator in reasoning chains.

We survey and unify several existing chain-of-thought datasets into a proposed novel format, resulting in a standard collection of over 300,000 samples requiring arithmetic reasoning. Finally, we use the new collection to train open-source calculator-assisted language models and show that models trained on Calc-X almost double the accuracy of generating correct results compared to baselines, and can outperform larger models of previous work, such as Toolformer, or Llama 2. We make all Calc-X datasets and models publicly available.

Link to Marek's EMNLP talk 🎥

Can In-context Learners Learn a Reasoning Concept from Demonstrations?

Michal Štefánik, Marek Kadlčík

In Proceedings of ACL 2023 Natural Language Reasoning (NLRSE) workshop · 🥇 Best paper award

Recent work analysing the functioning of in-context learners show that instead of learning new associations from the input, models largely rely on their pre-trained knowledge, such as the sentiment of the labels.

We argue that evaluations using a randomly-chosen demonstrations can not disentangle models' reliance on pre-trained knowledge from the ability to learn new functional relations, as most of such demonstrations can provide only limited information. Hence, we propose to evaluate in-context learners with demonstrations sharing a specific, informative reasoning concept with predicted sample. We find that most of the recent in-context learners can not consistently benefit from the demonstrated concepts, irrespective of the model size. However, some models are more sensitive to concepts than others, such as T0 models, which can benefit from concepts in 7 of 8 evaluation scenarios.

Soft Alignment Objectives for Robust Adaptation of Language Generation

Michal Štefánik, Marek Kadlčík, Petr Sojka

In Proceedings of the 61th Annual Meeting of the ACL (ACL 2023): Main track · 📄 Blog post

Fine-tuning of pre-trained generative language models weakens their ability to generalize, making the open-ended deployment of these models prone to errors like hallucinations or degradations of output text quality.

In this work, we show that adapting the models with modeling the ambiguity of prediction can largely avoid over 95% of the loss caused by traditional fine-tuning.

Conference talk ⬇️️

Resources and Few-shot Learners for In-context Learning in Slavic Languages

Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka

In Proceedings of EACL SlavicNLP 2023 workshop · 🥇 Best paper award

In this work, we collect the infrastructure necessary for training and evaluation of ICL in a selection of Slavic languages: Czech, Polish, and Russian. We link a diverse set of datasets and cast these into a unified instructional format through a set of transformations and newly-crafted templates written purely in target languages.Using the newly-curated dataset, we evaluate a set of the most recent in-context learners and compare their results to the supervised baselines. Finally, we train, evaluate and publish a set of in-context learning models.

We find that the massive multitask training can be outperformed by single-task training in the target language, uncovering the potential for specializing in-context learners to the language(s) of their application.

2022

Applications of deep language models for reflective writings

Jan Nehyba, Michal Štefánik (equal)

In Education and Information Technologies (IF 5.5)

Social sciences exhibit many cognitively complex and highly qualified problems, whose resolution relies on often subjective expert judgements. One of such problems that we study in this work is a reflection analysis in the writings of student teachers.

We perform a variety of experiments on how to efficiently address data collection for applications exhibiting a great level of annotators' subjectivity. Additionally, we demonstrate a great potential in a cross-lingual transfer of multilingual language models in the analysis of reflective writing. We make our resulting datasets and models freely available.

Methods for Estimating and Improving Robustness of Language Models

Michal Štefánik

In Proceedings of NAACL 2022: Student Research Workshop

Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for simple, surface-level textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside the training domain.

We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions towards enhancing the robustness of LLMs.

Conference talk ⬇️️

Adaptor: Objective-Centric Adaptation Library for Language Models

Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka

In Proceedings of the 60th Annual Meeting of the ACL: Demonstrations

This paper introduces Adaptor library, transposing the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training from applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation and demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.

Introduction talk ⬇️️

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting

Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka, Radim Řehůřek

In JUCS: The Journal of Universal Computer Science (IF 1.3)

We propose a constrained positional model, which adapts the sparse attention mechanism from neural machine translation to improve the speed of the positional model. Our constrained model outperforms the positional model on language modelling and trains twice as fast.

2021

RegEMT: Regressive Ensemble for Machine Translation Quality Evaluation

Michal Štefánik, Vít Novotný, Petr Sojka

In Proceedings of the Sixth Conference on Machine Translation (WMT) · 🥇 Best-performing reference-free metric

This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop.

In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settings, we also demonstrate that an ensemble approach is well-applicable to unseen languages. Furthermore, we identify a strong reference-free baseline that consistently outperforms the commonly-used BLEU and METEOR measures and significantly improves our ensemble’s performance.

Ensembling Ten Math Information Retrieval Systems: MIRMU and MSM @ ARQMath 2021

Vít Novotný, Michal Štefánik, Dávid Lupták, Martin Geletka, Petr Sojka

In CLEF Lab on Answer Retrieval for Questions on Math (ARQMath@CLEF 2021) · 🥇 Best automatic CQA system

ARQMath Community Question Answering (CQA) competition challenges open-domain QA systems to find answers to a set of yet non-answered questions on Math Stack Exchange, potentially providing users with responses to all new but answerable questions.

In our submission, we create an ensemble of ten “weak” individual systems that we let vote to provide answers to unseen questions. Our submission, a best-performing among all the automated systems, shows that such ensemble approach can be more robust on unseen questions than single-model neural systems.

Experience

NLP Scientist - Team Lead

Gauss Algorithmic

Delivering the most recent NLP technologies from whiteboards to real-world users.

• Research and productisation of language technologies for diverse specialized applications, involving language generation, classification, entity recognition, search or recommendation.

• On-premise deployments of generative language models: compute optimisation, scalability, high availability assurance.

• Coordinating a diverse team of tenure NLP scientists, software developers, operations engineers, business developers, and copywriters.

• Communication with clients of diverse cultural backgrounds, public talks, and hands-on workshops at tech conferences and companies.

Case Studies

• We developed a paraphraser that can transcribe technical documentation to a more familiar language, enabling easier comprehension and searchability by visitors. Our solution enabled our customer to attract 12% more visitors to their websites.

• We delivered machine translation models specialized for chat conversations across languages of Southeast Asia. Our on-premise deployment enables our customer to save 10k+ USD per month while delivering translation quality comparable to Google Translate.

See our blogs, NLP case studies, HuggingFace models or open-source projects.

March 2021 - February 2025

Technical lead

Grant Project: Intelligent Back Office

Coordination of the research team of 5-6 researchers within a larger, multi-organization grant project focused on improving document processing quality for specialised domain(s).

Within this project, we (1) enhanced the quality of domain-specific OCR text extraction utilising semi-supervised language modelling and (2) improved the methods to utilise positional information of document segments for a more accurate identification of the named entities.

September 2021 - May 2023

Deep Learning Engineer

Gauss Algorithmic

Creating the prototypes of NLP applications for specific use-cases, in multilingual classification, named entity recognition, or language generation.

• Development of custom architectures and data pipelines.

• Streamlining the experimentation, setting up practices in model and data versioning, reproducibility, integration.

• Implementation of the essential infrastructure around Transformer-based models (classification, generation) in the early Transformers era.

Implementations in Tensorflow and later PyTorch, deployments using CI, Docker & Kubernetes in AWS and Google Cloud.

June 2018 - March 2021

Junior Software Developer

Red Hat Software

Enhancing the company search engines with semantic text representations and classification, within the Searchisko & other open-source projects.

Application of pivotal semantic representation technologies such as Word2Vec, and FastText. Implementations in Python & Java, deployments in OpenShift.

March 2015 - June 2018

Stream Processing Developer

CSIRT: Cyber Security Incident Response Team of Masaryk Univerity

Research & development of intrusion detection methods in scalable, streaming paradigms.

Implementations of unsupervised and semi-supervised Machine Learning methods in Apache Spark & Python. Most of our research is open-sourced in Stream4Flow project.

March 2016 - September 2017

Talks

Robust Language Models One Step at a Time

Kempelen Institute of Information Technologies, June 2025

Invited talk covering recent findings and potential future directions towards more robust Language Models
Presentation slides with links

Why we Should Care about Uncertainty when Training Language Models

Speech Seminar, Brno University of Technology
Invited talk showing new connections of uncertainty in training and resulting qualities of language models

Presentation slides with links

Reliable Language Models and Where to Find Them

Language Technology seminar, University of Helsinki, March 2025

Lightening talk covering most recent work towards improving LMs reliability
Presentation slides with links

Fantastically Robust Language Models and Where to Find Them

Invited talks: Informatics Colloquium at Faculty of Informatics, Masaryk University
Allegro NLP Seminar, Online

Presentation slides: 20min version & 1h version.

Language Models in Autonomous Systems

Community talk: DataMesh Brno, September 2024 (Offline)

Informal talk sharing our experience with deployments of custom models, including applications of autonomous decision agents.
Presentation slides (in Slovak)

Training for Single Correct Prediction Makes Your Model More Fragile

HumanAligned.ai Summer school in Prague, July 2024 (Offline)

A short teaser talk for our Soft Alignment Objectives presented at ACL 2023.
Presentation slides , Medium blog post

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of QA Models

EACL 2024 Main conference talk in Malta, March 2024

A 12 min overview of our work exploring the limits of evaluating robustness of language models in Question answering.

Robustness of Language Models and Perspectives of Modularization

Language Technology seminar, University of Helsinki, March 2024

A presentation of my talk covering recent methods for improving robustness of language models and the role of modularization.
Presentation slides with links to the literature.

Soft Alignment Objectives for Robust Adaptation of Language Generation

ACL 2023 conference in Toronto, Canada, July 2023

A 6 min exposition of our ACL paper proposing more robust training objectives for language generation.

Learning to Learn: Hands-on Tutorial in Using and Improving Few-shot Language Models

Workshop on Machine Learning Prague conference, in Prague, Czechia, June 2023

Organisation of the workshop covering most recent topics with hands-on tutorials in in-context learning.
Workshop github & Presentation slides with links to the literature.

Resources and Few-shot Learners for In-context Learning in Slavic Languages

EACL 2023: Slavic NLP workshop in Dubrovnik, Croatia, May 2023 · 🥇 Best paper award

A 10 min overview of our work in building and evaluating in-context learning in Slavic languages.

Learning to Learn: Training Language Models to Understand Tasks from Few Examples

Community talk: Machine Learning MeetUp Brno, October 2022 (Offline)

Informal presentation of our methodology and trained models for Few-shot in Czech and other Non-English languages.
Medium Blog & Event & Presentation with links & HuggingFace models.

Mitigating Biases of QA Models by Simple Resampling Methods

2022 Annual Conference of the NAACL: DADC Workshop, July 2022

Results announcement and recording of Supersamplers' team presentation, live on NAACL 2022 DADC workshop in Seattle.

Methods for Estimating and Improving Robustness of Language Models

NAACL 2022: SRW workshop, July 2022

A 5min talk covering my thesis proposal presented on NAACL 2022 SRW in Seattle.

Adaptor: Objective-Centric Adaptation Library for Language Models

60th Annual Meeting of the ACL, May 2022

Blitz 2min introduction video of Adaptor library presented on ACL 2022 in Dublin.

Other Activities

Supervisor

Transformers Club & others

I have supervised numerous bachelor's and master's theses on topics closely related to the robustness of language models. For the more ambitious students, I organize the weekly meetings of TransformersClub™, a shared platform for pursuing creative ideas, peer reviewing and organizing teams to address unsolved research challenges. Many of these initiatives end up as a research paper in major ML/NLP venues.

The theses that emerged from the Club or I have supervised:

Michal Spiegel (2024-): Making Language Models Robust Algorithmic Reasoners.
Josef Kuchař (2024-): Visually-grounded SVG Code Generation.
Marek Kadlčík (2022-2024): Improving Arithmetical Reasoning of Language Models: 📄 EMNLP 23 paper, 📄 EMNLP 24 paper, 📄 ICLR LLMAgents paper, 🏆 a Dean's award, 🏆 a Rector's award,
Šárka Ščavnická (2022-2024): Document Understanding through Visual Question Answering.
Dávid Meluš (2022-2024): Enhancing Quality of Optical Character Recognition with specialized language models.
Lukáš Mikula (2021-2023): Think Twice Before You Answer: Mitigating Biases of Question Answering Models; 📄 EACL-Main paper, 🥇 1st place in NAACL DADC shared task Track 2, 🏆 a Dean's award.
Marek Petrovič (2021-2022): One Bit at a Time: Impact of Quantisation on Neural Machine Translation; 🥉 3rd place in SVOC competition, 🏆 a Dean's award.
Matej Meško (2022): Evaluation of a Supervised Approach to Information Retrieval
Martin Geletka (2020-2021): Speeding up inference time of neural machine translation; 🎖 Nomination for a Dean's award.
Petr Mička (2021): Utilisation of language representations for Information Retrieval

Autumn 2020 - now

Lecturer

Course: Introduction to Information Retrieval

Over the five years, I have given practicals for 100+ MSc students on the essentials of text representations, transformers and machine learning applied in indexing and search.

I have also proposed and led the redesign of the course to a Kaggle-like competition based on a newly-created, easy-to-use benchmark framework. End-of-semester surveys found that after this redesign, 31.5% of more students agree with that the course has an enriching value for them. The results of the competition also allow us to easily reach out to the best students with further research opportunities.

Springs 2019 - 2025

Area Chair & Reviewer

*ACL Conferences, ACL Rolling Review & other

I give back to the community by volunteering for reviewing, proofreading and providing feedback to other researchers whenever it can help. Feel free to get in touch if you're interested!

Away From Keyboard

Research can be fun, but there's a lot of other good stuff to do. Whenever I can, I take my shoes 🥾, a bike 🚲 or skis ⛷ and head towards mountains ⛰☀️. I love spending time breathing fresh air with unlimited open space ☁ above my head. These (and any other) activities feel best when combined with a taste of a filtered brew ☕.

Please get in touch if your coffee is undrinkably sour for you. I'll be interested to know your source.