2023-07-20 –, South Hall 2B
Controllers deal with numbers all day long. They have to check a lot of data from different sources. Often the reports contain erroneous or missing data. Identifying outliers and suspicious data is time-consuming.
This presentation will introduce a Small Data Problem-End2End workflow using statistical tools and machine learning to make controllers' jobs easier and help them be more productive.
We will demonstrate how we used amongst others,
- scipy
- pandera
- dirty cat
- nltk
- fastnumbers
to create a self-improving system to automate the screening of reports and report outliers in advance so that they can be eliminated more quickly.
Controllers deal with numbers all day long. They have to check a lot of data from different sources. Often the reports contain erroneous or missing data. Identifying outliers and suspicious data is time-consuming.
This presentation will introduce a Small Data Problem-End2End workflow using statistical tools and machine learning to make controllers' jobs easier and help them be more productive.
It is a common business problem that the data provided is incorrect due to misunderstandings, manual input, cultural differences, typos, etc. and these errors can often be weeded out in short order.
This talk will show how heuristic data validation can help to facilitate - in our specific use case controller - automated detection of inaccuracies, outliers or input errors.
In our use case we have to deal with a lot of reports. Some of the reports contain hundreds of columns and are very individually structured. Defining data types and expectations for each individual report for each column would be too time-consuming.
We are dealing with a technically manageable number of data sets, but too many to leave to human visual control alone.
In our talk we will present strategies on how we have solved small data problems using heuristic and statistical methods.
Questions to tackle:
* Are None values ok or not, and if - why?
* Is a value an outlier or a typo?
* How much deviation is ok, or not?
* How can historical data help and to what extent?
* Which other external information can help to validate data?
We will demonstrate how we used amongst others,
- scipy
- pandera
- dirty cat
- nltk
- fastnumbers
to create a self-improving system to automate the screening of reports and report outliers in advance so that they can be eliminated more quickly.
Audience:
This presentation is intended for anyone interested in data quality management without heavy lifting. Especially small data problems.
beginner
Alexander Hendorf is responsible for data and artificial intelligence at the boutique consultancy KÖNIGSWEG in Germany. He has many years of experience in the practical application, introduction and communication of data and AI-driven strategies and decision-making processes.
Through his commitment as a speaker and chair of various international conferences as PyConDE & PyData Berlin, he is a proven expert in the field of data intelligence. He's been appointed Python Software Foundation and EuroPython fellow for this various contributions. Currently he is sitting board member of Python Software Verband (Germany) and the EuroPython Society (EPS).