Find your high-risk files according to AVG with our DriveScanner

In any business, it is a challenge to ensure that we keep only the documents we want and need in the future. In order to minimize the total amount of files, but more importantly, to ensure that we are not violating any AVG laws. Not only important for our Privacy Officer, but for all of us.

AVG, we love it (to hate it)

We just have to play by the rules, that sounds simple enough! But what sounds simple can be a lot more complicated in practice. Daily activities occupy our minds and may cause us to forget to clean up our drive after completing a project. And isn't the AVG sometimes just a nuisance too...? 'We just need to share that resume and share it now! And yes, there may be some contact information in that Excel file, but you delete that as soon as you don't need it anymore, right?'

Right ... not so! We are all human, which means that our actions are not always consistent with our intentions. Which does not mean that we knowingly violate AVG laws, but neither does it mean that we do not do so at all.

Your Privacy Officer may be aware of this and try to encourage all users to clean up: check your download folder, delete saved attachments, empty your recycle garbage can, clean up the project folder at the end of the project. But that doesn't mean an extra check wouldn't be an excellent idea.

Wait a minute, how much?

At Cmotions, we knew we were at risk simply because of the large number of files on our file system. Even though we only store our own project files and do not store our clients' data anywhere on our own file system. So our Privacy Officer tried to come up with rules to eliminate AVG-sensitive files as much as possible. To the other employees, it felt like these rules weren't doing what they were supposed to do, and we, as data professionals, were convinced that we could do better. That's when we came up with the idea of developing a Python package to perform these checks for us. The idea of this package was to make the job a lot easier and solve all our problems mentioned above. With just a few clicks, you should be able to see a list of files that you should check for AVG sensitive information yourself. Preferably, you should also be able to see which AVG rule was violated and how.

With this in mind, we started building our Python package "DriveScanner," and now we are proud to share our first version. It may not be perfect yet, it's work-in-progress, but what better way to improve DriveScanner than with your help? Check out our code in our repository, or just start using our package by installing pip: pip install drivescanner.

The genesis of the DriveScanner

How did this package help us? and first, it gave us insight into the number of different file types we have stored in our system. A shocking 223,976 files! Assuming that it would take about 10 to 15 minutes to check each file and knowing that we only have one Privacy Officer, we now knew for sure that it would be impossible for us to check all these files manually. By setting up the AVG scheme that automatically checks each file, we gained insight into the number of times a specific AVG violation was made for a specific file. Currently, the package scans for Dutch citizen service numbers, bank details, e-mail addresses, phone numbers, addresses in general, credentials of any kind, credit card or passport numbers. It also checks for credential tags such as login information. Optionally, the scan can also detect Named Entities in Dutch and other languages.

Based on the scan result, files are given a score based on the degree of breach. These scores allowed our Privacy Officer to filter files based on a specific breach or on an overall score.

How we used our DriveScanner

Now what? Now that we know which files contain sensitive information, it can still take a lot of time to see where and what type of breach occurred. That's why we also added the type of breach to the output table. That way our privacy officer not only knew which file to look at, but also which breach. With just a few clicks and some waiting time, we were able to scan 223,976 files for AVG violations. Not only did this help us rid some files of sensitive information, but it also saved us a lot of time. For example, we found that 90% of the files on our Drive did not require human evaluation. Of the 10% that did, we started with the Excel output and thus were able to leave out another 7% of the files. That left 3% that needed to be opened and evaluated. Still a significant number of files, but far fewer than we started with.

And you may be wondering if all those files did any harm at all? Fortunately, no! Mostly, we found some areas of improvement for the DriveScanner itself. Although some examples were correct from the DriveScanner's point of view, such as:

A file created by our DataSampler, containing fictitious personal information such as phone numbers, addresses and e-mail addresses;
A project file where we had multiple external stakeholders, with the name, phone number and email address of all reviewers included in the document.

What's next? Will you help us?

So it seems that our assumptions were correct. And why keep something so simple and powerful to ourselves? That's why we would like to share this with you. Check out our repository, pip install our package, see how it works and help us improve!

Recent posts