Michal Štefánik

Researcher in Natural Language Processing

stefanik.m@mail.muni.cz

Welcome to my webpage! I am a last-year PhD researcher at the Faculty of Informatics of Masaryk University and an NLP team lead at Gauss Algorithmic.

My research interests are in everything around robustness of language models. This includes a reliable evaluation of models' generalization, but also the methods to improve models' robustness, for instance, by creating better data, or improving existing training methods. My research enables better domain adaptation, more data-efficient low-resource training (e.g. in Machine Translation) or a creation of question-answering systems robust to prediction shortcuts.

I am also a founder of student Transformers Club, whose alumni were awarded with several Dean's awards and prizes from international competitions and shared tasks.

Over the last five years, I've also contributed and led the delivery of the industrial applications of the most recent NLP research in multilingual language generation, or entity recognition, all the way from the research ideas to the scalable, containerized deployments, nowadays serving the fascinating NLP technologies to end users.

I expect to get my PhD around October 2024 and will be looking for a new research affiliation in the following time period.


Research

Below you can find references to my most recent publications.


2024

Self-Training Language Models in Arithmetic Reasoning

Marek Kadlčík, Michal Štefánik (equal)
Presented in ICLR 2024 LLMAgents workshop & newer version under review.

Language models achieve impressive results in tasks involving complex multistep reasoning, but scaling these capabilities further traditionally requires expensive collection of more annotated data. In this work, we explore the potential of improving the capabilities of language models using automated feedback based on the validity of their predictions in arithmetic reasoning (self-training).

We find that models can substantially improve in both single-round (offline) and online self-training. In the offline setting, supervised methods are able to deliver gains comparable to preference optimization, but in online self-training, preference optimization shows to largely outperform supervised training thanks to superior stability and robustness on unseen types of problems.

Concept-aware Data Construction Improves In-context Learning of Language Models

Michal Štefánik, Marek Kadlčík, Petr Sojka
To appear in Findings of ACL 2024 · Pre-print · 📄 Blog post.

Previous work curating in-context learning models assumes that ICL emerges from vast over-parametrization or the scale of multitask training. However, theoretical work attributes ICL emergence to specific properties of training data and creates functional in-context learners in small-scale, synthetic settings.

We propose a method to construct training data that upsample these properties. We show the superior efficiency of training on such data, with resulting models significantly outperforming traditional instruction tuning in 41 and 45 out of 60 tasks, while performing comparably to previous models trained on over 1600 tasks while using only two (2) QA tasks. Our analyses attribute these improvements to enhanced ability of our in-context learners to benefit from latent reasoning concepts presented in demonstrations and to mitigation of models' over-reliance on their learned semantic priors.

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

Lukáš Mikula*, Michal Štefánik*, Marek Petrovič, Petr Sojka (*equal)
In Proceedings of EACL 2024: Main track · 🥇 Best submission in NAACL DADC: Better training data challenge

In this paper, we challenge the commonly-used evaluation of robustness of language models through the lens of their out-of-distribution performance. We propose a metric to measure the reliance of model's performance on a specific feature and measure the reliance on a set of features identified as prediction shortcuts.

We find that the same prediction shortcuts are exposed in all assessed QA datasets, whereas some shortcuts identified in SQuAD are even more impactful in other datasets. This means that counterintuitively, a model relying on shortcut more might reach better OOD performance. Finally, we survey three existing debiasing methods and find that indeed, the OOD gains of these methods come in hand with increased, rather than decreased reliance on fragile prediction shortcuts.

Conference talk ⬇️️ & Announcement of results in NAACL DADC ⬇️

2023

Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems

Marek Kadlčík*, Michal Štefánik*, Ondřej Sotolář, Vlastimil Martinek (*equal)

Language models are notoriously inclined to make factual errors in tasks requiring arithmetic reasoning. To enable language models to circumvent this deficiency and offload a critical computation to a symbolic system, we create a collection of Calc-X datasets that demonstrate the appropriate use of a calculator in reasoning chains.

We survey and unify several existing chain-of-thought datasets into a proposed novel format, resulting in a standard collection of over 300,000 samples requiring arithmetic reasoning. Finally, we use the new collection to train open-source calculator-assisted language models and show that models trained on Calc-X almost double the accuracy of generating correct results compared to baselines, and can outperform larger models of previous work, such as Toolformer, or Llama 2. We make all Calc-X datasets and models publicly available.

Link to Marek's EMNLP talk 🎥

Can In-context Learners Learn a Reasoning Concept from Demonstrations?

Michal Štefánik, Marek Kadlčík

Recent work analysing the functioning of in-context learners show that instead of learning new associations from the input, models largely rely on their pre-trained knowledge, such as the sentiment of the labels.

We argue that evaluations using a randomly-chosen demonstrations can not disentangle models' reliance on pre-trained knowledge from the ability to learn new functional relations, as most of such demonstrations can provide only limited information. Hence, we propose to evaluate in-context learners with demonstrations sharing a specific, informative reasoning concept with predicted sample. We find that most of the recent in-context learners can not consistently benefit from the demonstrated concepts, irrespective of the model size. However, some models are more sensitive to concepts than others, such as T0 models, which can benefit from concepts in 7 of 8 evaluation scenarios.

Soft Alignment Objectives for Robust Adaptation of Language Generation

Michal Štefánik, Marek Kadlčík, Petr Sojka

Fine-tuning of pre-trained generative language models weakens their ability to generalize, making the open-ended deployment of these models prone to errors like hallucinations or degradations of output text quality.

In this work, we show that adapting the models with modeling the ambiguity of prediction can largely avoid over 95% of the loss caused by traditional fine-tuning.

Conference talk ⬇️

Resources and Few-shot Learners for In-context Learning in Slavic Languages

Michal Štefánik, Marek Kadlčík, Piotr Gramacki, Petr Sojka
In Proceedings of EACL SlavicNLP 2023 workshop · 🥇 Best paper award

In this work, we collect the infrastructure necessary for training and evaluation of ICL in a selection of Slavic languages: Czech, Polish, and Russian. We link a diverse set of datasets and cast these into a unified instructional format through a set of transformations and newly-crafted templates written purely in target languages.Using the newly-curated dataset, we evaluate a set of the most recent in-context learners and compare their results to the supervised baselines. Finally, we train, evaluate and publish a set of in-context learning models.

We find that the massive multitask training can be outperformed by single-task training in the target language, uncovering the potential for specializing in-context learners to the language(s) of their application.

2022

Applications of deep language models for reflective writings

Jan Nehyba, Michal Štefánik (equal)

Social sciences exhibit many cognitively complex and highly qualified problems, whose resolution relies on often subjective expert judgements. One of such problems that we study in this work is a reflection analysis in the writings of student teachers.

We perform a variety of experiments on how to efficiently address data collection for applications exhibiting a great level of annotators' subjectivity. Additionally, we demonstrate a great potential in a cross-lingual transfer of multilingual language models in the analysis of reflective writing. We make our resulting datasets and models freely available.

Methods for Estimating and Improving Robustness of Language Models

Michal Štefánik

Despite their outstanding performance, large language models (LLMs) suffer notorious flaws related to their preference for simple, surface-level textual relations over full semantic complexity of the problem. This proposal investigates a common denominator of this problem in their weak ability to generalise outside the training domain.

We survey diverse research directions providing estimations of model generalisation ability and find that incorporating some of these measures in the training objectives leads to enhanced distributional robustness of neural models. Based on these findings, we present future research directions towards enhancing the robustness of LLMs.

Conference talk ⬇️

Adaptor: Objective-Centric Adaptation Library for Language Models

Michal Štefánik, Vít Novotný, Nikola Groverová, Petr Sojka

This paper introduces Adaptor library, transposing the traditional model-centric approach composed of pre-training + fine-tuning steps to objective-centric approach, composing the training from applications of selected objectives. We survey research directions that can benefit from enhanced objective-centric experimentation in multitask training, custom objectives development, dynamic training curricula, or domain adaptation and demonstrate the practical applicability of Adaptor in selected unsupervised domain adaptation scenarios.

Introduction talk ⬇️

When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting

Vít Novotný, Michal Štefánik, Eniafe Festus Ayetiran, Petr Sojka, Radim Řehůřek

We propose a constrained positional model, which adapts the sparse attention mechanism from neural machine translation to improve the speed of the positional model. Our constrained model outperforms the positional model on language modelling and trains twice as fast.

2021

RegEMT: Regressive Ensemble for Machine Translation Quality Evaluation

Michal Štefánik, Vít Novotný, Petr Sojka

This work introduces a simple regressive ensemble for evaluating machine translation quality based on a set of novel and established metrics. We evaluate the ensemble using a correlation to expert-based MQM scores of the WMT 2021 Metrics workshop.

In both monolingual and zero-shot cross-lingual settings, we show a significant performance improvement over single metrics. In the cross-lingual settings, we also demonstrate that an ensemble approach is well-applicable to unseen languages. Furthermore, we identify a strong reference-free baseline that consistently outperforms the commonly-used BLEU and METEOR measures and significantly improves our ensemble’s performance.

Ensembling Ten Math Information Retrieval Systems: MIRMU and MSM @ ARQMath 2021

Vít Novotný, Michal Štefánik, Dávid Lupták, Martin Geletka, Petr Sojka

ARQMath Community Question Answering (CQA) competition challenges open-domain QA systems to find answers to a set of yet non-answered questions on Math Stack Exchange, potentially providing users with responses to all new but answerable questions.

In our submission, we create an ensemble of ten “weak” individual systems that we let vote to provide answers to unseen questions. Our submission, a best-performing among all the automated systems, shows that such ensemble approach can be more robust on unseen questions than single-model neural systems.


Experience

NLP Scientist - Team Lead

Gauss Algorithmic
Guiding the most recent NLP research applications from whiteboards to the users.

Creating the ideas that widen the applicability of NLP to novel areas but also coordinating the delivery within the team of three tenure NLP scientists and multiple software developers, operations engineers, business developers and copywriters.

See our blogs, case studies, demos or open-source projects.
March 2021 - now

Technical lead

Intelligent Back Office: Grant Project

Coordination of the research team of 5-6 researchers within a larger, multi-organization grant project focused on improving document processing quality for specialised domain(s).

Our research objectives are (a) to enhance the quality of domain-specific OCR text extraction utilising semi-supervised language modelling and (b) to utilise relative positional information of the document segments for a more accurate extraction of the named entities.


January 2022 - May 2023

NLP Scientist

Gauss Algorithmic
Creating the prototypes of NLP applications for specific use-cases, in multilingual classification, named entity recognition, or language generation.

Responsibilities ranging from communicating the customers' expectations to stable deployments of containerized applications.

Implementations using Python & PyTorch, deployments using CI, Docker & Kubernetes in AWS and Google Cloud.


June 2018 - March 2021

Junior Software Developer

Red Hat Software
Enhancing the company search engines with semantic text representations and classification, within the Searchisko & other open-source projects.

Application of NLP technologies such as Word2Vec, and FastText. Implementations in Python & Java, deployments in OpenShift.


March 2015 - June 2018

Stream Processing Developer

CSIRT: Cyber Security Team of Masaryk Univerity
Research & development of intrusion detection methods in scalable, streaming paradigms.

Implementations of unsupervised and semi-supervised methods, Apache Spark & Python. Open-sourced in Stream4Flow project.
March 2016 - September 2017

Talks

Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models

EACL 2024 Main conference talk in Malta, March 2024

A 12 min overview of our work exploring the limits of evaluating robustness of language models in Question answering.



Robustness of Language Models and Perspectives of Modularization

Language Technology seminar, Unversity of Helsinki, March 2024

A presentation of my talk covering recent methods for improving robustness of language models and the role of modularization.
Presentation slides with links to the literature.

MLMU Cover Photo

Soft Alignment Objectives for Robust Adaptation of Language Generation

ACL 2023 conference in Toronto, Canada, July 2023

A 6 min exposition of our ACL paper proposing more robust training objectives for language generation.

There is also a blogpost on the topic



Learning to Learn: Hands-on Tutorial in Using and Improving Few-shot Language Models

Workshop on Machine Learning Prague conference, in Prague, Czechia, June 2023

Organisation of the workshop covering most recent topics with hands-on tutorials in in-context learning.
Workshop github & Presentation slides with links to the literature.

MLMU Cover Photo

Resources and Few-shot Learners for In-context Learning in Slavic Languages

EACL 2023: Slavic NLP workshop in Dubrovnik, Croatia, May 2023 · 🥇 Best paper award

A 10 min overview of our work in building and evaluating in-context learning in Slavic languages.



Learning to Learn: Training Language Models to Understand Tasks from Few Examples

Community talk: Machine Learning MeetUp Brno, October 2022 (Offline)

Informal presentation of our methodology and trained models for Few-shot in Czech and other Non-English languages.
Medium Blog & Event & Presentation with links & HuggingFace models.

MLMU Cover Photo

Mitigating Biases of QA Models by Simple Resampling Methods

2022 Annual Conference of the NAACL: DADC Workshop, July 2022

Results announcement and recording of Supersamplers' team presentation, live on NAACL 2022 DADC workshop in Seattle.



Methods for Estimating and Improving Robustness of Language Models

NAACL 2022: SRW workshop, July 2022

A 5min talk covering my thesis proposal presented on NAACL 2022 SRW in Seattle.



Adaptor: Objective-Centric Adaptation Library for Language Models

60th Annual Meeting of the ACL, May 2022

Blitz 2min introduction video of Adaptor library presented on ACL 2022 in Dublin.




Other Activities

Lecturer

Course: Introduction to Information Retrieval

Over the three years, I have given practicals for 100+ master's students on the essentials of text representations, indexing methods or machine learning approaches to a full-text search.

I have also initiated the redesign of the course to a semestral competition based on an easy-to-use evaluation framework. Given the results of semestral student surveys, the competition model enhanced the students' engagement and interest in the NLP topics. The competition results also allow us to reach out to the best students with an offer for further research cooperation.


Springs 2019 - now

Supervisor

Transformers Club & others

I have supervised numerous bachelor's and master's theses on topics closely related to the robustness of language models. For the more ambitious students, I organize the weekly meetings of Transformers Club, a shared platform for pursuing creative ideas, peer reviewing and organizing teams to address more complex problems. Many of these initiatives end up as a research paper in major ML/NLP venues.

The theses that emerged from the Club or I have supervised:



Autumn 2020 - now

Reviewer

*ACL Conferences, ACL Rolling Review & other

I give back to the community by volunteering for reviewing, proofreading and providing feedback to other researchers whenever it can help. Feel free to get in touch if you're interested!



Away From Keyboard




Reading the most recent pre-prints is fun, but there's a lot of other great stuff to see and do.

Whenever possible, I love spending time moving with unlimited open space ☁ ⬆ above, best done on a bike 🚲 or skis ⛷, with far mountain views ⛰☀️ and a taste of a filtered brew ☕ .