Michal Štefánik

Researcher in Natural Language Processing


Welcome to my webpage! I am a last-year PhD researcher at the Faculty of Informatics of Masaryk University and an NLP team lead at Gauss Algorithmic.

My research interests are in everything around robustness of the language models. Among others, this closely relates to domain adaptation, generalization, quality estimation or robust low-resource applications.

Apart from my own research, I am a founder of student Transformers Club, whose alumni were awarded with several Dean's awards, a SVOC prize or a 1st place in NAACL DADC competition.

Over the last five years, I've also contributed and led the delivery of the industrial applications of the most recent NLP research in multilingual language generation, or entity recognition, all the way from the research ideas to the scalable, containerized deployments, now daily serving the fascinating NLP technologies to end users.


Below you can find references to my most recent publications.


Concept-aware Training Improves In-context Learning of Language Models

Under review, available here.

Previous work curating in-context learning models assumes that ICL emerges from vast over-parametrization or the scale of multitask training. However, recent theoretical work attributes ICL emergence to specific properties of training data and creates functional in-context learners in small-scale, synthetic settings. Inspired by these findings, we propose a Concept-aware Training (CoAT) method constructing training scenarios such that it is beneficial for the LM to capture the analogical reasoning concepts. We measure that data sampling of CoAT consistently improves models' ICL on unseen reasoning tasks, making the in-context learners trained with CoAT on only two QA datasets to perform comparably to models trained on over 1600 tasks. Our analyses attribute some of CoAT's empirical improvements to an enhanced ability to benefit from natural concepts in demonstrations and to mitigation of models' over-reliance on their learned semantic priors.

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

Lukáš Mikula*, Michal Štefánik*, Marek Petrovič, Petr Sojka (*equal)
In Proceedings of EACL 2024: Main track · 🥇 Best submission in NAACL DADC: Better training data challenge

We propose a framework for measuring a scale of models' reliance on any identified spurious feature and measure the size of such reliance for some previously-reported features while uncovering several new ones. We assess the robustness towards a large set of known and new-found prediction biases for a variety of pre-trained models and state-of-the-art debiasing methods in Question Answering (QA) and compare it to a resampling baseline. We find that (i) the observed OOD gains of debiasing methods can not be explained by mitigation or enlargement of the addressed bias and subsequently evaluate that (ii) the biases are vastly shared among QA datasets. Our findings motivate future work to refine the reports of LLMs' robustness to a level of specific spurious correlations.

Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems

Marek Kadlčík*, Michal Štefánik*, Ondřej Sotolář, Vlastimil Martinek (*equal)

Language models are notoriously inclined to make factual errors in tasks requiring arithmetic reasoning. To enable language models to circumvent this deficiency and offload critical computation to a symbolic system, we create a collection of Calc-X datasets that demonstrate the appropriate use of a calculator in reasoning chains. We survey and unify several existing chain-of-thought datasets into a proposed novel format, resulting in a standard collection of over 300,000 samples requiring arithmetic reasoning. Finally, we use the new collection to train open-source calculator-assisted language models and show that models trained on Calc-X almost double the accuracy of generating correct results compared to baselines. We make all Calc-X datasets and models publicly available.

Can In-context Learners Learn a Reasoning Concept from Demonstrations?

Michal Štefánik, Marek Kadlčík

Recent work analysing the functioning of in-context learners show that instead of learning new associations from the input, models largely rely on their pre-trained knowledge, such as the sentiment of the labels. We argue that the in-context learning evaluations using a random demonstrations can not disentangle models' reliance on such features, as random demonstrations rarely present functional relations useful for prediction. Hence, we propose to evaluate in-context learners with demonstrations sharing with predicted sample a specific, informative reasoning concept, which we extract from human explanations. We find that most of the recent in-context learners can not consistently benefit from the demonstrated concepts, irrespective of the model size. However, some models are more sensitive to concepts than others, such as T0 models, which can benefit from concepts in 7 of 8 evaluation scenarios.

Soft Alignment Objectives for Robust Adaptation of Language Generation

Michal Štefánik, Marek Kadlčík, Petr Sojka

The traditional adaptation by continuous in-domain training weakens the model's ability to generalize to other domains, making the open-ended deployment of these models prone to errors. This work introduces two novel objectives, grounded in a semantic similarity of the generating hypothesis to the reference. We show that (1) grounding of the training signal in semantic similarity can mitigate the catastrophic forgetting of domain adaptation, while (2) in many cases improving the performance on the adapted domain, (3) with negligible additions to compute costs. In the broader sense, our objectives grounded in a soft token-level alignment pioneer the exploration of the middle ground between the efficient but narrow exact-match token-level objectives and expressive but computationally- and resource-intensive sentence-level objectives.

Conference talk ⬇️

Resources and Few-shot Learners for In-context Learning in Slavic Languages

Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka
In Proceedings of EACL SlavicNLP 2023 workshop · 🥇 Best paper award

In this work, we collect the infrastructure necessary for training and evaluation of ICL in a selection of Slavic languages: Czech, Polish, and Russian. We link a diverse set of datasets and cast these into a unified instructional format through a set of transformations and newly-crafted templates written purely in target languages.Using the newly-curated dataset, we evaluate a set of the most recent in-context learners and compare their results to the supervised baselines. Finally, we train, evaluate and publish a set of in-context learning models. We find that the massive multitask training can be outperformed by single-task training in the target language, uncovering the potential for specializing in-context learners to the language(s) of their application.


Applications of deep language models for reflective writings

Jan Nehyba*, Michal Štefánik* (*equal)

Social sciences exhibit many cognitively complex and highly qualified problems, whose resolution relies on often subjective expert judgements. One of such problems that we study in this work is a reflection analysis in the writings of student teachers. We perform a variety of experiments on how to efficiently address data collection for applications exhibiting a great level of annotators' subjectivity. Additionally, we demonstrate a great potential in a cross-lingual transfer of multilingual language models in the analysis of reflective writing. We make our resulting datasets and models freely available.

Methods for Estimating and Improving Robustness of Language Models

Michal Štefánik

Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for simple, surface-level textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside of the training domain. We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions towards enhancing the robustness of LLMs.

Conference talk ⬇️

Adaptor: Objective-Centric Adaptation Library for Language Models

Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka

This paper introduces Adaptor library, transposing the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training from applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation and demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.

Introduction talk ⬇️

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting

Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka, Radim Řehůřek

We propose a constrained positional model, which adapts the sparse attention mechanism from neural machine translation to improve the speed of the positional model. Our constrained model outperforms the positional model on language modelling and trains twice as fast.


RegEMT: Regressive Ensemble for Machine Translation Quality Evaluation

Michal Štefánik, Vít Novotný, Petr Sojka

This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop. In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settings, we also demonstrate that an ensemble approach is well-applicable to unseen languages. Furthermore, we identify a strong reference-free baseline that consistently outperforms the commonly-used BLEU and METEOR measures and significantly improves our ensemble’s performance.

Ensembling Ten Math Information Retrieval Systems: MIRMU and MSM @ ARQMath 2021

Vít Novotný, Michal Štefánik, Dávid Lupták, Martin Geletka, Petr Sojka

ARQMath Community Question Answering (CQA) competition challenges open-domain QA systems to find answers to a set of yet non-answered questions on Math Stack Exchange, potentially providing users with responses to all new but answerable questions. In our submission, we create an ensemble of ten “weak” individual systems that we let vote to provide answers to unseen questions. Our submission, a best-performing among all the automated systems, shows that such ensemble approach can be more robust on unseen questions than single-model neural systems.


NLP Scientist - Team Lead

Gauss Algorithmic
Guiding the most recent NLP research applications from whiteboards to the users.

Creating the ideas that widen the applicability of NLP to novel areas but also coordinating the delivery within the team of three tenure NLP scientists and multiple software developers, operations engineers, business developers and copywriters.

See our blogs, case studies, demos or open-source projects.
March 2021 - now

Technical lead

Intelligent Back Office: Grant Project

Coordination of the research team of 5-6 researchers within a larger, multi-organization grant project focused on improving document processing quality for specialised domain(s).

Our research objectives are (a) to enhance the quality of domain-specific OCR text extraction utilising semi-supervised language modelling and (b) to utilise relative positional information of the document segments for a more accurate extraction of the named entities.

January 2022 - May 2023

NLP Scientist

Gauss Algorithmic
Creating the prototypes of NLP applications for specific use-cases, in multilingual classification, named entity recognition, or language generation.

Responsibilities ranging from communicating the customers' expectations to stable deployments of containerized applications.

Implementations using Python & PyTorch, deployments using CI, Docker & Kubernetes in AWS and Google Cloud.

June 2018 - March 2021

Junior Software Developer

Red Hat Software
Enhancing the company search engines with semantic text representations and classification, within the Searchisko & other open-source projects.

Application of NLP technologies such as Word2Vec, and FastText. Implementations in Python & Java, deployments in OpenShift.

March 2015 - June 2018

Stream Processing Developer

CSIRT: Cyber Security Team of Masaryk Univerity
Research & development of intrusion detection methods in scalable, streaming paradigms.

Implementations of unsupervised and semi-supervised methods, Apache Spark & Python. Open-sourced in Stream4Flow project.
March 2016 - September 2017


Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

EACL 2024 Main conference talk in Malta, March 2024

A 12 min overview of our work exploring the limits of evaluating robustness of language models in Question answering.

Robustness of Language Models and Perspectives of Modularization

Language Technology seminar, Unversity of Helsinki, March 2024

A presentation of my talk covering recent methods for improving robustness of language models and the role of modularization.
Presentation slides with links to the literature.

MLMU Cover Photo

Soft Alignment Objectives for Robust Adaptation of Language Generation

ACL 2023 conference in Toronto, Canada, July 2023

A 6 min exposition of our ACL paper proposing more robust training objectives for language generation.

Learning to Learn: Hands-on Tutorial in Using and Improving Few-shot Language Models

Workshop on Machine Learning Prague conference, in Prague, Czechia, June 2023

Organisation of the workshop covering most recent topics with hands-on tutorials in in-context learning.
Workshop github & Presentation slides with links to the literature.

MLMU Cover Photo

Resources and Few-shot Learners for In-context Learning in Slavic Languages

EACL 2023: Slavic NLP workshop in Dubrovnik, Croatia, May 2023 · 🥇 Best paper award

A 10 min overview of our work in building and evaluating in-context learning in Slavic languages.

Learning to Learn: Training Language Models to Understand Tasks from Few Examples

Community talk: Machine Learning MeetUp Brno, October 2022 (Offline)

Informal presentation of our methodology and trained models for Few-shot in Czech and other Non-English languages.
Medium Blog & Event & Presentation with links & HuggingFace models.

MLMU Cover Photo

Mitigating Biases of QA Models by Simple Resampling Methods

2022 Annual Conference of the NAACL: DADC Workshop, July 2022

Results announcement and recording of Supersamplers' team presentation, live on NAACL 2022 DADC workshop in Seattle.

Methods for Estimating and Improving Robustness of Language Models

NAACL 2022: SRW workshop, July 2022

A 5min talk covering my thesis proposal presented on NAACL 2022 SRW in Seattle.

Adaptor: Objective-Centric Adaptation Library for Language Models

60th Annual Meeting of the ACL, May 2022

Blitz 2min introduction video of Adaptor library presented on ACL 2022 in Dublin.

Other Activities


Course: Introduction to Information Retrieval

Over the three years, I have given practicals for 100+ master's students on the essentials of text representations, indexing methods or machine learning approaches to a full-text search.

I have also initiated the redesign of the course to a semestral competition based on an easy-to-use evaluation framework. Given the results of semestral student surveys, the competition model enhanced the students' engagement and interest in the NLP topics. The competition results also allow us to reach out to the best students with an offer for further research cooperation.

Springs 2019 - now


Transformers Club & others

I have supervised numerous bachelor's and master's theses on topics closely related to the robustness of language models. For the more ambitious students, I organize the weekly meetings of Transformers Club, a shared platform for pursuing creative ideas, peer reviewing and organizing teams to address more complex problems.

The theses that emerged from the Club and/or I have supervised:

Autumn 2020 - now


*ACL Conferences, ACL Rolling Review & other

I give back to the community by volunteering for reviewing, proofreading and providing feedback to other researchers whenever it can help. Feel free to get in touch if you're interested!

Away From Keyboard

Reading the most recent pre-prints is fun, but there's a lot of other great stuff to see and do.

Whenever possible, I love spending time moving with unlimited open space ☁ ⬆ above, best done on a bike 🚲 or skis ⛷, with far mountain views ⛰☀️ and a taste of a filtered brew ☕ .