Oscar L. Garcell
Passionate MLOps and Backend Developer. Currently living and working in Belgrade, Serbia.
- HTEC Group, Software Engineer
- Datwit US LLC, Data Analyst
- Pangeanic Language Technologies and Translation Services, Machine Learning Engineer
- Auge CRM S.A.S. de C.V, Back End Developer & DevOps Engineer
- OptimalBit LLC, DevOps Engineer
How to bring large legal document repositories into the public domain without releasing private data? The fundamental concepts behind document anonymization are entity recognition, masking type, and pseudoanonymization. Using python language and a collection of libraries such as spacy, pytorch, and others we can achieve good scores of anonymization. How is this applied within a flow containing AI models for NER? Once anonymized how to improve the result by doing more text mining with python based apps and human in the loop. Although it was approved in 2016, the application of the GDPR at the European level remains a challenge in banking, legal, and other contexts. This talk covers the process of transforming pdf and docx documents into xml, processing them using regexp and spacy/torch models, and how to parse these results using AntConc and Textacy. All the ideas will be supported with the real experience of the MAPA project a European project for anonymization finished in 2022.