How do you create awareness and incorporate ethics into your data-driven organization?

For ages now, business cannot do without data: concepts such as data literacy and data-driven work have become commonplace, not to mention machine learning and AI. More and more reliance is being placed on data in making daily decisions. With this comes an increased focus on making fair and responsible data-driven decisions. This is rightfully so, as the damage of unethical decisions can be great for those involved, and thus your own organization.

This is a familiar struggle: implementing data-driven work ethically. This goes far beyond just looking at the results of a prediction model, for example. Ethical working is reflected in every step of a process, and this makes every employee responsible for it and should be engaged in it. An understanding of ethical data-driven working is essential for every employee within your organization.

Ethics as a concept may be closely related to regulations as contained in the General data protection regulation (AVG). These include, for example, gaining access to the data in the first place (who or what role may view what data?), but also what data may be used and for how long. In this article, we intentionally do not focus on these specific regulations, but instead emphasize ethics as an ongoing process.

We do this by taking you through a step-by-step plan to deploy ethics as a design principle in your data-driven processes. For this, we use the gold standard in data-driven processes as a guide: CRISP-DM. In part one of this series we will discuss this framework and give a brief explanation of each step from the perspective of the ethical questions you may encounter. In the other articles in this series, we will go deeper into each individual step each time.

CRISP-DM

CRISP-DM (Cross-Industry Standard Process for Data Mining) is a framework for data-driven processes. It provides a planned approach and focuses on collaboration among the various roles involved in such a process. CRISP-DM is widespread but does not offer out of the box guidance on ethics. If you want to know more about CRISP-DM read this article which discusses this further.

To this framework we would like to add an ethical component; a format in which the ethical blueprint of a process is established using a starting point of standard questions. By this we do not mean a bulky document that will never be looked at again, but a guide to the substantive ethical discussions that need to take place, capturing key concerns in a brief and concise manner to serve as inspiration for future projects.

Business understanding

Step one is to determine and understand the organizational goals and expected outcomes of a project. Make sure that a broad group of people from different business units and different roles are involved in this process. For example, one of the key questions that needs to be made clear is:

Who is the end user, how will they be affected by the application of this analysis/model, and what unintended side effects might occur?

Answering this question will reveal whether ethical concerns can be raised against the intended purpose. Sub-questions that may help in this regard include:

What is the purpose of the project and can any ethical concerns be thought of here?
What data do we need for this project?
Has that data been used for previous projects as well and were objections found in them?
For what purpose was the data collected and does it fit this project?
How will the results from this project be put to practical use and what impact might this have on the end user?

Ethics must be a topic that colleagues can hold each other accountable for, at every step of a process. So a single ethical appraisal per project is not enough. This is something that needs to be done on an ongoing basis.

Data understanding

This step revolves around exploring and understanding the data that is potentially relevant to the project. Here it is particularly up to the data analyst to find out how useful the data is. This includes what data is available, whether enough data is available, whether the data is fit for purpose, whether a lot of preparation needs to be done, etc. An analyst must ask herself nonstop whether she can achieve the same thing with less data (data minimization), or less sensitive data (proportionality). After all, a model or product will never be "ethical" if the input data is already biased. An analyst thus focuses on the question:

What bias is in the data beforehand, and is it usable? Is the data fairly "distributed" and representative?

In addition, it is also relevant to consider the following questions:

Is bias already known in the data and can we then take measures to minimize the bias?
Are there any tests we can do on the final result to assess the influence of bias?
Can the data have information "hidden" within it (for example, it is known that zip code area can also have a representation of income level within it)?
Does my variable represent what I am aiming for (does the definition used in collecting the data match the definition I need for this project)?

Of course, the analyst, like the rest of an organization, also needs to consider whether ethical concerns can be raised in advance of the stated business goals.

Data preparation

In this step, the data analyst actually gets to work preparing the data for the next phase. The challenges found in the previous step are addressed and resolved as much as possible, or measures are taken to minimize them. The goal of this step is to have a reliable and useful data set at the end for the next step of the process. The analyst must consider how bias can be removed from the data. For example, she focuses on the question:

Are you aware of any bias that cannot be avoided and what steps do you take to prevent bias from arising?

Here it is important to record what potential risks arise from the known bias and how the results from this step are used in the work process. It is also important to record the process of preparation itself, so that it is transparent which choices have been made.

Further questions of interest at this stage then include:

What data must be pseudonymized before it can be used?
How do you ensure that data cannot still be traced back to a person after anonymization?
Doesn't the process of preparation create unintended bias?

Modeling

In the Modeling step, again, it is mainly the data analyst's turn. Do not be blinded by the fact that this step is called Modeling; it is certainly not only about developing prediction models, but also about creating reports and analysis, or about deploying generative AI (such as GPT or other large language models). So it can also be about creating descriptive analysis. The central question here is:

Does the product produce undesirable, unfair or biased results, and do I understand how my model arrives at its prediction?

If you want to read more about how to gain insight from a classification model, read our detailed explanation of xAI (explainable AI) using SHAP here.

It is important at this stage to check each time that the analyst has validated all her assumptions. In-depth questions that help with this include:

What principles can I use to test my output for bias and ethics? Read more about how to do this using the Python package Fairlearn here.
Can I develop unit tests that provide insight into specific examples and what result they would lead to (e.g., show to a face recognition model both women and men by way of test)
What metric do I use to choose the best model? What are the implications of choosing this metric? (e.g., if you want to predict cancer accept more false positives than if you want to predict churn)

Evaluation

In this step, you evaluate the performance and outcomes of the model or analysis and determine the extent to which they meet the goals set in step one, the business understanding. Dare to ask yourself and the team here the question: do I want to publish this, and what would the reactions be? This is where the question asked earlier comes back:

What is the impact of the outcome of an analysis or model on the user?

In doing so, dwell on questions such as:

Was there no unintended bias in the data/model?
What is the consequence of publishing it?
Are the outcomes transparent and can they be explained?

Deployment

This step involves the implementation and going live of a product. Be transparent in any limitations that a model entails and may create bias. You were able to identify and document these in the previous steps. It is also important to keep a close eye on how a product develops over time and whether unwanted side effects or results may still arise. Should this be the case, is it clear and transparent why and how this can be addressed? One of the monitoring questions that should continue to be asked after going live is:

Are changes to the previous steps needed?

Realize that new factors can influence the outcome of the application and still cause an undesirable impact on end-user privacy, for example. Therefore, consider questions such as:

What are the long-term consequences of the choices made and in the deployment of the data?
How is the data stored and how does this affect its use toward the future?

Outro

In this first article in our series on ethical data-driven work, we have provided the first practical tools to create more awareness and reflect on ethics in your organization. Above all, we want to show that ethical working never lies with one person or with a specific part of the company, but rather is something your organization should be steeped in.

In our next articles, we will explore the different roles within an organization and further color the model outlined above. For example, how do you approach the conversation around ethics from the different roles?