Michal Štefánik

Researcher in Artificial Intelligence & Natural Language Processing

stefanik.m@mail.muni.cz

Welcome to my webpage! I am a last-year PhD Researcher at the Faculty of Informatics of Masaryk University and an NLP Team Lead at Gauss Algorithmic.

My research focus is on the robustness of language models. This includes a reliable evaluation of models' generalization but also methods to improve models' robustness, by curating better data or improving existing training methods. My work enables the creation of models that perform better in low-resource settings, the training of instructional models with more human-like reasoning, or creating question-answering systems robust to prediction shortcuts.

I am a proud founder of TransformersClub™, a platform for supporting students in pursuing research ideas, with alumni awarded with international awards and papers in top-tier NLP/AI conferences.

Over the last six years, I've also contributed to and led the delivery of many industrial applications of cutting-edge language technologies, including multilingual language generation or entity recognition, all the way from the research ideas to the scalable deployments, now serving the fascinating NLP technologies to thousands of users.

I expect to obtain a PhD at the end of 2024 and am looking for a new research affiliation in the following time period!


Research

Below you can find references to my most recent publications. An overview of my research can also be found in my dissertation.


2024

Self-Training Language Models in Arithmetic Reasoning

Marek Kadlčík, Michal Štefánik (equal)
Presented in ICLR 2024 LLMAgents & a newer version to appear in Proceedings of EMNLP 2024: Findings track.

Language models achieve impressive results in tasks involving complex multistep reasoning, but scaling these capabilities further traditionally requires expensive collection of more annotated data. In this work, we explore the potential of improving the capabilities of language models using automated feedback based on the validity of their predictions in arithmetic reasoning (self-training).

We find that models can substantially improve in both single-round (offline) and online self-training. In the offline setting, supervised methods are able to deliver gains comparable to preference optimization, but in online self-training, preference optimization shows to largely outperform supervised training thanks to superior stability and robustness on unseen types of problems.

Concept-aware Data Construction Improves In-context Learning of Language Models

Michal Štefánik, Marek Kadlčík, Petr Sojka

Previous work curating in-context learning models assumes that ICL emerges from vast over-parametrization or the scale of multitask training. However, theoretical work attributes ICL emergence to specific properties of training data and creates functional in-context learners in small-scale, synthetic settings.

We propose a method to construct training data that upsample these properties. We show the superior efficiency of training on such data, with resulting models significantly outperforming traditional instruction tuning in 41 and 45 out of 60 tasks, while performing comparably to previous models trained on over 1600 tasks while using only two (2) QA tasks. Our analyses attribute these improvements to enhanced ability of our in-context learners to benefit from latent reasoning concepts presented in demonstrations and to mitigation of models' over-reliance on their learned semantic priors.

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

Lukáš Mikula*, Michal Štefánik*, Marek Petrovič, Petr Sojka (*equal)
In Proceedings of EACL 2024: Main track · 🥇 Best submission in NAACL DADC: Better training data challenge

In this paper, we challenge the commonly-used evaluation of robustness of language models through the lens of their out-of-distribution performance. We propose a metric to measure the reliance of model's performance on a specific feature and measure the reliance on a set of features identified as prediction shortcuts.

We find that the same prediction shortcuts are exposed in all assessed QA datasets, whereas some shortcuts identified in SQuAD are even more impactful in other datasets. This means that counterintuitively, a model relying on shortcut more might reach better OOD performance. Finally, we survey three existing debiasing methods and find that indeed, the OOD gains of these methods come in hand with increased, rather than decreased reliance on fragile prediction shortcuts.

Conference talk ⬇️️ & Announcement of results in NAACL DADC ⬇️

2023

Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems

Marek Kadlčík*, Michal Štefánik*, Ondřej Sotolář, Vlastimil Martinek (*equal)

Language models are notoriously inclined to make factual errors in tasks requiring arithmetic reasoning. To enable language models to circumvent this deficiency and offload a critical computation to a symbolic system, we create a collection of Calc-X datasets that demonstrate the appropriate use of a calculator in reasoning chains.

We survey and unify several existing chain-of-thought datasets into a proposed novel format, resulting in a standard collection of over 300,000 samples requiring arithmetic reasoning. Finally, we use the new collection to train open-source calculator-assisted language models and show that models trained on Calc-X almost double the accuracy of generating correct results compared to baselines, and can outperform larger models of previous work, such as Toolformer, or Llama 2. We make all Calc-X datasets and models publicly available.

Link to Marek's EMNLP talk 🎥

Can In-context Learners Learn a Reasoning Concept from Demonstrations?

Michal Štefánik, Marek Kadlčík

Recent work analysing the functioning of in-context learners show that instead of learning new associations from the input, models largely rely on their pre-trained knowledge, such as the sentiment of the labels.

We argue that evaluations using a randomly-chosen demonstrations can not disentangle models' reliance on pre-trained knowledge from the ability to learn new functional relations, as most of such demonstrations can provide only limited information. Hence, we propose to evaluate in-context learners with demonstrations sharing a specific, informative reasoning concept with predicted sample. We find that most of the recent in-context learners can not consistently benefit from the demonstrated concepts, irrespective of the model size. However, some models are more sensitive to concepts than others, such as T0 models, which can benefit from concepts in 7 of 8 evaluation scenarios.

Soft Alignment Objectives for Robust Adaptation of Language Generation

Michal Štefánik, Marek Kadlčík, Petr Sojka

Fine-tuning of pre-trained generative language models weakens their ability to generalize, making the open-ended deployment of these models prone to errors like hallucinations or degradations of output text quality.

In this work, we show that adapting the models with modeling the ambiguity of prediction can largely avoid over 95% of the loss caused by traditional fine-tuning.

Conference talk ⬇️

Resources and Few-shot Learners for In-context Learning in Slavic Languages

Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka
In Proceedings of EACL SlavicNLP 2023 workshop · 🥇 Best paper award

In this work, we collect the infrastructure necessary for training and evaluation of ICL in a selection of Slavic languages: Czech, Polish, and Russian. We link a diverse set of datasets and cast these into a unified instructional format through a set of transformations and newly-crafted templates written purely in target languages.Using the newly-curated dataset, we evaluate a set of the most recent in-context learners and compare their results to the supervised baselines. Finally, we train, evaluate and publish a set of in-context learning models.

We find that the massive multitask training can be outperformed by single-task training in the target language, uncovering the potential for specializing in-context learners to the language(s) of their application.

2022

Applications of deep language models for reflective writings

Jan Nehyba, Michal Štefánik (equal)

Social sciences exhibit many cognitively complex and highly qualified problems, whose resolution relies on often subjective expert judgements. One of such problems that we study in this work is a reflection analysis in the writings of student teachers.

We perform a variety of experiments on how to efficiently address data collection for applications exhibiting a great level of annotators' subjectivity. Additionally, we demonstrate a great potential in a cross-lingual transfer of multilingual language models in the analysis of reflective writing. We make our resulting datasets and models freely available.

Methods for Estimating and Improving Robustness of Language Models

Michal Štefánik

Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for simple, surface-level textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside the training domain.

We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions towards enhancing the robustness of LLMs.

Conference talk ⬇️

Adaptor: Objective-Centric Adaptation Library for Language Models

Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka

This paper introduces Adaptor library, transposing the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training from applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation and demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.

Introduction talk ⬇️

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting

Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka, Radim Řehůřek

We propose a constrained positional model, which adapts the sparse attention mechanism from neural machine translation to improve the speed of the positional model. Our constrained model outperforms the positional model on language modelling and trains twice as fast.

2021

RegEMT: Regressive Ensemble for Machine Translation Quality Evaluation

Michal Štefánik, Vít Novotný, Petr Sojka

This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop.

In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settings, we also demonstrate that an ensemble approach is well-applicable to unseen languages. Furthermore, we identify a strong reference-free baseline that consistently outperforms the commonly-used BLEU and METEOR measures and significantly improves our ensemble’s performance.

Ensembling Ten Math Information Retrieval Systems: MIRMU and MSM @ ARQMath 2021

Vít Novotný, Michal Štefánik, Dávid Lupták, Martin Geletka, Petr Sojka

ARQMath Community Question Answering (CQA) competition challenges open-domain QA systems to find answers to a set of yet non-answered questions on Math Stack Exchange, potentially providing users with responses to all new but answerable questions.

In our submission, we create an ensemble of ten “weak” individual systems that we let vote to provide answers to unseen questions. Our submission, a best-performing among all the automated systems, shows that such ensemble approach can be more robust on unseen questions than single-model neural systems.


Experience

NLP Scientist - Team Lead

Gauss Algorithmic
Delivering the most recent NLP technologies from whiteboards to real-world users.

• Research and productisation of language technologies for diverse specialized applications, involving language generation, classification, entity recognition, search or recommendation.

• On-premise deployments of generative language models: compute optimisation, scalability, high availability assurance.

• Coordinating a diverse team of tenure NLP scientists, software developers, operations engineers, business developers, and copywriters.

• Communication with clients of diverse cultural backgrounds, public talks, and hands-on workshops at tech conferences and companies.


Case Studies

• We developed a paraphraser that can transcribe technical documentation to a more familiar language, enabling easier comprehension and searchability by visitors. Our solution enabled our customer to attract 12% more visitors to their websites.

• We delivered machine translation models specialized for chat conversations across languages of Southeast Asia. Our on-premise deployment enables our customer to save 10k+ USD per month while delivering translation quality comparable to Google Translate.

See our blogs, NLP case studies, HuggingFace models or open-source projects.
March 2021 - now

Technical lead

Grant Project: Intelligent Back Office

Coordination of the research team of 5-6 researchers within a larger, multi-organization grant project focused on improving document processing quality for specialised domain(s).

Within this project, we (1) enhanced the quality of domain-specific OCR text extraction utilising semi-supervised language modelling and (2) improved the methods to utilise positional information of document segments for a more accurate identification of the named entities.


September 2021 - May 2023

Deep Learning Engineer

Gauss Algorithmic
Creating the prototypes of NLP applications for specific use-cases, in multilingual classification, named entity recognition, or language generation.

• Development of custom architectures and data pipelines.

• Streamlining the experimentation, setting up practices in model and data versioning, reproducibility, integration.

• Implementation of the essential infrastructure around Transformer-based models (classification, generation) in the early Transformers era.

Implementations in Tensorflow and later PyTorch, deployments using CI, Docker & Kubernetes in AWS and Google Cloud.


June 2018 - March 2021

Junior Software Developer

Red Hat Software
Enhancing the company search engines with semantic text representations and classification, within the Searchisko & other open-source projects.

Application of pivotal semantic representation technologies such as Word2Vec, and FastText. Implementations in Python & Java, deployments in OpenShift.


March 2015 - June 2018

Stream Processing Developer

CSIRT: Cyber Security Incident Response Team of Masaryk Univerity
Research & development of intrusion detection methods in scalable, streaming paradigms.

Implementations of unsupervised and semi-supervised Machine Learning methods in Apache Spark & Python. Most of our research is open-sourced in Stream4Flow project.
March 2016 - September 2017

Talks

Fantastically Robust Language Models and Where to Find Them



Invited talks: Informatics Colloquium at Faculty of Informatics, Masaryk University & Allegro NLP Seminar, Online.
Presentation slides: 20min version & 1h version.

DataMesh Presentation Cover Photo

Language Models in Autonomous Systems

Community talk: DataMesh Brno, September 2024 (Offline)

Informal talk sharing our experience with deployments of custom models, including applications of autonomous decision agents.
Presentation slides (in Slovak)

DataMesh Presentation Cover Photo

Training for Single Correct Prediction Makes Your Model More Fragile

HumanAligned.ai Summer school in Prague, July 2024 (Offline)

A short teaser talk for our Soft Alignment Objectives presented at ACL 2023.
Presentation slides , Medium blog post

HAISS Presentation Cover Photo

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of QA Models

EACL 2024 Main conference talk in Malta, March 2024

A 12 min overview of our work exploring the limits of evaluating robustness of language models in Question answering.



Robustness of Language Models and Perspectives of Modularization

Language Technology seminar, Unversity of Helsinki, March 2024

A presentation of my talk covering recent methods for improving robustness of language models and the role of modularization.
Presentation slides with links to the literature.

MLMU Cover Photo

Soft Alignment Objectives for Robust Adaptation of Language Generation

ACL 2023 conference in Toronto, Canada, July 2023

A 6 min exposition of our ACL paper proposing more robust training objectives for language generation.

See also a blogpost on the topic



Learning to Learn: Hands-on Tutorial in Using and Improving Few-shot Language Models

Workshop on Machine Learning Prague conference, in Prague, Czechia, June 2023

Organisation of the workshop covering most recent topics with hands-on tutorials in in-context learning.
Workshop github & Presentation slides with links to the literature.

MLMU Cover Photo

Resources and Few-shot Learners for In-context Learning in Slavic Languages

EACL 2023: Slavic NLP workshop in Dubrovnik, Croatia, May 2023 · 🥇 Best paper award

A 10 min overview of our work in building and evaluating in-context learning in Slavic languages.



Learning to Learn: Training Language Models to Understand Tasks from Few Examples

Community talk: Machine Learning MeetUp Brno, October 2022 (Offline)

Informal presentation of our methodology and trained models for Few-shot in Czech and other Non-English languages.
Medium Blog & Event & Presentation with links & HuggingFace models.

MLMU Cover Photo

Mitigating Biases of QA Models by Simple Resampling Methods

2022 Annual Conference of the NAACL: DADC Workshop, July 2022

Results announcement and recording of Supersamplers' team presentation, live on NAACL 2022 DADC workshop in Seattle.



Methods for Estimating and Improving Robustness of Language Models

NAACL 2022: SRW workshop, July 2022

A 5min talk covering my thesis proposal presented on NAACL 2022 SRW in Seattle.



Adaptor: Objective-Centric Adaptation Library for Language Models

60th Annual Meeting of the ACL, May 2022

Blitz 2min introduction video of Adaptor library presented on ACL 2022 in Dublin.




Other Activities

Lecturer

Course: Introduction to Information Retrieval

Over the five years, I have given practicals for 100+ MSc students on the essentials of text representations, transformers and machine learning applied in indexing and search.

I have also proposed and led the redesign of the course to an extracurricular competition based on a newly-created, easy-to-use benchmark framework. End-of-semester surveys report that after the introduction of this competition, 31.5% of more students agree with the statement that "the course has an educational value and enriches me" compared to the previous years. The results of the competition also allow us to easily identify and reach out to the best students with further research opportunities in the lab.


Springs 2019 - now

Supervisor

Transformers Club & others

I have supervised numerous bachelor's and master's theses on topics closely related to the robustness of language models. For the more ambitious students, I organize the weekly meetings of TransformersClub™, a shared platform for pursuing creative ideas, peer reviewing and organizing teams to address more complex problems. Many of these initiatives end up as a research paper in major ML/NLP venues.

The theses that emerged from the Club or I have supervised:



Autumn 2020 - now

Reviewer

*ACL Conferences, ACL Rolling Review & other

I give back to the community by volunteering for reviewing, proofreading and providing feedback to other researchers whenever it can help. Feel free to get in touch if you're interested!



Away From Keyboard




Research can be fun, but there's a lot of other good stuff to do. Whenever I can, I take my shoes 🥾, a bike 🚲 or skis ⛷ and head towards mountains ⛰☀️. I love spending time breathing fresh air with unlimited open space ☁ above my head. These (and any other) activities feel best when combined with a taste of a filtered brew ☕.

Please get in touch if your coffee is undrinkable sour for you. I'll be interested to know your source.