Word Wranglers & News Navigators: Taming GPT-3 Beast for Media Monitoring EuroPython 2023

Word Wranglers & News Navigators: Taming GPT-3 Beast for Media Monitoring
.ical

2023-07-20 11:55–12:25, South Hall 2B

The emergence of ChatGPT has led to an exponential growth of prospects and implementations in the field of Natural Language Processing (NLP). Various teams were struck with FOMO (Fear of Missing Out) and hastened to incorporate Large Language Models (LLMs) into their products. By using OpenAI models (text-curie-001, davinci, gpt-3.5-turbo), we successfully integrated them into our production on March 2, granting our users the ability to receive text summaries in their email reports and comprehend the essence of any article within our application. Three weeks later, we trained our own large language model for the same purpose. This talk will delve into our journey, exploring the lessons and insights gleaned from our hands-on experience with these cutting-edge tools.

While heaps of research focus on English texts, this talk will zoom in on using LLMs for smaller languages like Czech or Slovak. I'll share some hair-raising examples from our early days after deployment (trust me, when the LLM starts hallucinating about a local politician's resignation, your client won't be thrilled) and how we managed to tackle those challenges. I'll also chat about the quirks that come with non-English languages (more tokens, bigger models, ...) and dish about our experiments with other LLMs like OPT, BLOOM, and LLaMA. I will show how to fine-tune the general LLM to the Czech instruction set and demonstrate the danger of injecting false information. Finally, I will discuss possibilities of putting LLM to production.

Established in 2015, Monitora Media burst onto the scene and rapidly rose to prominence in the realm of media analysis and monitoring across Central Europe. We've been shaking things up with innovations like real-time media monitoring and scrollable print titles, even scoring a spot on Deloitte's Technology Fast 50 for Central Europe in the past two years. We keep an eye on print and online publications, as well as TV and radio broadcasts, and podcast transcripts – we're talking over 350 programs, 9,000 websites, and 2,500 print titles! Python and Django hold a special place in our hearts.

Expected audience expertise –

beginner

Petr Šimeček

Biostatistician by training, time series forecaster at Google, "gradient boosting guy" at Simple Finance. Currently, entangling knots on protein backbones at Masaryk University and applying large language models at Monitora Media.

Word Wranglers & News Navigators: Taming GPT-3 Beast for Media Monitoring .ical 2023-07-20 11:55–12:25, South Hall 2B

Word Wranglers & News Navigators: Taming GPT-3 Beast for Media Monitoring
.ical

2023-07-20 11:55–12:25, South Hall 2B