Abel Meneses Abad
Currently Data Scientist at Datwit US, Inc.
- Private Health Data Anonymization
- Health Data Analysis
- AI adoption in Health and Legal domains
Machine Learning Engineer at Pangeanic (until May 2023)
- AWS solution design based on async requests
- Anonymization Toolkit using NER with customized tags with Python
- Combining Machine Translation with Pseudoanonymization
- REST API design for NLP tasks with FastAPI, Flair and Spacy
- Flair and Spacy NER Model Evaluation
- Image Anonymization Service design and evaluation
- OCR service design and evaluation
Session
How to bring large legal document repositories into the public domain without releasing private data? The fundamental concepts behind document anonymization are entity recognition, masking type, and pseudoanonymization. Using python language and a collection of libraries such as spacy, pytorch, and others we can achieve good scores of anonymization. How is this applied within a flow containing AI models for NER? Once anonymized how to improve the result by doing more text mining with python based apps and human in the loop. Although it was approved in 2016, the application of the GDPR at the European level remains a challenge in banking, legal, and other contexts. This talk covers the process of transforming pdf and docx documents into xml, processing them using regexp and spacy/torch models, and how to parse these results using AntConc and Textacy. All the ideas will be supported with the real experience of the MAPA project a European project for anonymization finished in 2022.