Private Data Anonymization with Python, Fundamentals
2023-07-20 , Terrace 2A

How to bring large legal document repositories into the public domain without releasing private data? The fundamental concepts behind document anonymization are entity recognition, masking type, and pseudoanonymization. Using python language and a collection of libraries such as spacy, pytorch, and others we can achieve good scores of anonymization. How is this applied within a flow containing AI models for NER? Once anonymized how to improve the result by doing more text mining with python based apps and human in the loop. Although it was approved in 2016, the application of the GDPR at the European level remains a challenge in banking, legal, and other contexts. This talk covers the process of transforming pdf and docx documents into xml, processing them using regexp and spacy/torch models, and how to parse these results using AntConc and Textacy. All the ideas will be supported with the real experience of the MAPA project a European project for anonymization finished in 2022.


Based on the experience of more than 3 years anonymizing documents in different domains, the idea is to present the necessary steps in the anonymization process and how python tools are essential for it.

It will include the presentation of a European project in the field of anonymization completed in 2022 whose data is available to the entire community and which is known as MAP (https://www.elrc-share.eu/repository/search/? q=MAP)

The talk will focus its objectives on the importance of AI models to scale anonymization in environments with high volumes of documents, and how python technologies make possible a better performance of the solution and also of the team that develops it.

The following frameworks will be mentioned in the presentation: Spacy, Pytorch, FastAPI, Textacy, Pytest and other base libraries.


Expected audience expertise:

intermediate

Currently Data Scientist at Datwit US, Inc.
- Private Health Data Anonymization
- Health Data Analysis
- AI adoption in Health and Legal domains
Machine Learning Engineer at Pangeanic (until May 2023)
- AWS solution design based on async requests
- Anonymization Toolkit using NER with customized tags with Python
- Combining Machine Translation with Pseudoanonymization
- REST API design for NLP tasks with FastAPI, Flair and Spacy
- Flair and Spacy NER Model Evaluation
- Image Anonymization Service design and evaluation
- OCR service design and evaluation

Passionate MLOps and Backend Developer. Currently living and working in Belgrade, Serbia.

Work Experience:
- HTEC Group, Software Engineer
- Datwit US LLC, Data Analyst
- Pangeanic Language Technologies and Translation Services, Machine Learning Engineer
- Auge CRM S.A.S. de C.V, Back End Developer & DevOps Engineer
- OptimalBit LLC, DevOps Engineer