Services

Semantic anonymisation for databases

Motivation

Since the arrival of the European General Data Protection Regulation (GDPR) on May 25th 2018 at the latest, people started worrying about how to handle sensitive data.

One of the main issues being addressed in GDPR is that if you store (or process) personal data of some kind, there needs to be a justification why you do it. Furthermore, if you lose personal data of some kind, this needs to be reported and made public. Usually a thing you want to avoid at all costs.

Therefore, — in theory — only a carefully selected number of administrators get access to the highly sensitive database. But in practice you need a valid dataset for your test and staging server, your developers need one to work on locally etc. If you want to read up a little more about the background, have a look at our article on Medium.

Implementation

Working frequently in the django ecosystem, we found the django-scrubber package at some point in 2018. The great thing about that tool is, that you can define all the fields which contain sensitive data and not just hash or empty them but fill it with data having the same meaning as your production state.

Imagine, we have a django model for your customer data:

# models.py

from django.db import models

class Customer(models.Model):
    first_name = models.CharField('First name', max_length=60)
    last_name = models.CharField('Last name', max_length=60)
    last_login = models.DateTimeField('Last login'))

If we take a closer look, we can see that the fields first_name and last_name contain sensitive data whereas last_login is quite uncritical.

With django-scrubber, we can define a subclass within the model like this:

# models.py

from django.db import modelsclass Customer(models.Model):
    ....

    class Scrubbers:
        first_name = scrubbers.Faker('first_name')
        last_name = scrubbers.Faker('last_name')

If we now run the management command scrub_data provided by scrubber, the package knows which fields to handle and how. When you anonymise the dataset, scrubber will pick a random first name and a random last name for every customer record you have in your database.

Scrubber utilises the Faker package which provides an abundance of helpful data types, like job descriptions, street names and many more. And it can even provide localised (language-specific) data! You can read all about your options in the Faker documentation.

In addition, scrubber itself provides a handful of useful tools like empty values or simply hashing the existing value which you can read about (here)[https://github.com/RegioHelden/django-scrubber/blob/master/README.md]. A really nice feature I would like to point out here is the value-casting. Faker only generates strings which the django ORM will not save in a field type different from char or text field. Scrubber tries to cast the faked values, so they fit to the declaration in the django model.

Custom scrubber class

Usually you want as little hassle as possible when creating an anonymised dump. So manipulating all those special cases as mentioned above and afterwards creating test users, forwarding credentials etc. is a thing you surely want to avoid.

For this reason we implemented an abstract class called AbstractScrubbingService:

# services.py

from ai_django_core.services.custom_scrubber import AbstractScrubbingService

class MyScrubbingService(AbstractScrubbingService):
    pre_scrub_functions = [
        'remove_some_data',
    ]

    post_scrub_functions = [
        'scrub_users',
        'delete_logs',
    ]

    def remove_some_data(self):
        pass

    def scrub_users(self):
        pass

    def delete_logs(self):
        pass

The service wraps the general scrubbing command and at the end truncates the scrubber data table. This table contains preprocessed information to speed up the scrubbing process. You do not need it afterwards, though. We do this so the database will be as small as possible for any kind of export.

Furthermore, you can create functions for handling data in any way which run before or after the scrubbing. You can see this in the example above. Only register the functions in the class attributes pre_scrub_functions or post_scrub_functions and implement it. Done!

Finally, you need to create a new management command called custom_scrub. Here is an example implementation:

# custom_scrub.py

from django.core.management.base import BaseCommand
from apps.core.services import MyScrubbingService

class Command(BaseCommand):
    def handle(self, *args, **options):
        scrubbing_service = MyScrubbingService()
        scrubbing_service.process()