Robot Holmes and The MLington Murder Mysteries
07-19, 16:05–16:50 (Europe/Prague), Terrace 2A

We will follow master detective Robot Holmes on his way to solve one of his hardest cases so far - a series of mysterious murders in the city of MLington. The traces lead him to the Vision-Language part of town, which has been a quiet and tranquil place with few incidents until lately. For a few months the neighbourhood has been growing extensively and careless benchmark leaders are dropping dead at an alarming rate.

Robot Holmes sets out to find the cause for this new development and will gather intel on some of the most notorious of the new citizens of the Vision-Language neighbourhood and find out what makes them tick.

Join the detective Robot Holmes on an adventure through the streets of the vibrant city of MLington. Together, we will find out who is behind a series of mysterious murders in the bustling neighbourhood of Vision-Language Village (ViLaVi).

ViLaVi has seen a steady influx of new posh Vision-Language models, which are not only rapidly expanding the size of the district but also the quality of services that are offered in these streets.

On his journey Robot Holmes will compile an overview of the tasks these new models excel in and find out what makes them so good at it. Additionally he will gather details on some of the most successful of the Vision-Language models and eventually find out who or what is behind the series of murders.

By the end of our journey you will have a better overview of the rapidly expanding Vision-Language neighbourhood and will have knowledge of the most important inner workings of Vision-Language models like CLIP, OWL-ViT, and BLIP. You will also know how to run them yourself in a few lines of code with the transformers library by Hugging Face.

The talk is for everyone interested in the topic of Vision-Language models and who wants to gain some first insights into their ways of working. People who already have a profound knowledge of Vision-Language models are welcome as well, as they probably have never seen it presented as a crime story in a strangely familiar city.

Expected audience expertise


I'm Johannes, a Data Scientist at celebrate company by day and an AI storyteller by night.

After experiences in research at Fraunhofer Fokus Institute and tinkering with sensor setups for autonomous vehicles, I decided to get more hands-on and joined celebrate company, where I'm now bringing models from the NLP, CV and the Multimodal research world into production.

Since last year I occasionally lead a Computer Vision Study Group on the Hugging Face discord server, where I present papers embedded in presentations with a little geeky twist. You can find an overview of the past study groups (not all lead by me) right here: HF CV Study Group.