Skip to main content

Framing the AI Fairness Question?

In many cases, emergent technologies, especially those with such wide-ranging applications as Artificial Intelligence, often prompt conversations about whether they are being used fairly and equitably – especially when they might have differential effects across sensitive groups such as race, age, or sexual orientation. In the case of binary classification algorithms, a supervised algorithm might take input of social data and predict a ‘yes’ or ‘no’ outcome for individuals for some outcome of interest. To ensure fairness, the model must not unfairly classify individuals for decisions that impact important outcomes such as state medical benefits, loan applications, or job opportunities.

A natural approach to such questions would be determining whether the model makes the ‘correct’ predictions at similar rates across sensitive demographic groups. In this blog post, BPC explains how such comparisons can lead different stakeholders to totally divergent conclusions about the same model – even when the AI model itself is unbiased in its predictions. AI systems often tradeoff the risks of using their recommendations with the benefit of increased efficiency or reduced costs. We will explore how those risks change depending on decisions around tradeoffs and how it’s essential to identify the risks and the impact of the outcomes.

To illustrate this point, we simulated data and trained a binary classification algorithm to determine whether prospective college students would graduate or not graduate based on a small selection of features that vary between two groups – group ‘A’ and group ‘B.’ The features of each simulated student are:

  • Base ability: students with a higher ability level graduate at higher levels. This does not vary between group A and group B – average base ability is equivalent between the groups.
  • High school quality: students graduate from one of two high schools, one of which significantly boosts both admission test scores (i.e., SAT) and one that does not. Students in group A have an 80% chance of attending the high-performing high school. In comparison, students in group B have only a 20% chance of attending – this can be thought of as an effect of the high-performing high school being in a neighborhood made up primarily of group A residents.
  • Admission test score: Based on their base ability and the quality of their schooling, each student receives an admission test score. The average score for group B is lower than that of group A through the effect of group A’s higher chance of attending the high-performing high school.

The trained model aims to predict whether each simulated student will graduate or not using only which high school the student attended and their test score, not their group status. The simulation’s source code, as well as supplementary material and figures describing the simulation in more detail, can be found on BPC’s GitHub page.

Read Next

Defining a ‘fair’ decision

As policymakers and agency leaders work to evaluate different AI tools, they should consider three basic metrics that model designers and practitioners routinely evaluate: precision, recall, and accuracy (Fig. 1).

  • Precision is a measure of how often the model was correct amongst positive predictions. In the context of this model, this represents how often students actually graduate when they are predicted to graduate. Equivalently, precision is the ratio of true positives to false positives.
  • Recall is a measure of how often the model correctly classifies true positives. Here, this is the percentage of the students who actually graduate that are correctly classified as positives by the model. Recall is the ratio of true positives to false negatives.
  • Accuracy is a measure of how often the model is correct in its prediction. Contrary to precision and recall, this metric is the percentage of all correct classifications, including true negatives. In the context of this model, this is the percentage of students who were correctly classified by the predictive model, whether they were predicted to graduate or drop out. In this way, accuracy does not distinguish between false negatives and false positives.

Figure 1: The four possible outcomes in a classification process, along with associated metrics. Precision restricts attention to positive classifications only, while recall restricts attention to positive outcomes only. Accuracy is a measure of correct classifications among the entire sample. 

Precision and recall emphasize two very different kinds of risk. High precision would reflect a low number of false positives, meaning that most students who were predicted to graduate ended up graduating. High recall, conversely, would reflect a low number of false negatives – meaning that most students who go on to graduate were correctly predicted to graduate by the model. The difference between these metrics may seem slight, but it is important to consider who bears the cost of these mistakes. A student who drops out of school (false positive) will directly lower the graduation rate of the university using the model. However, a potential graduate (false negative) represents societal harm in terms of the unrealized potential of a potentially productive individual.

Figure 2. The precision-recall tradeoff in the context of simulated graduation data. Each point represents a different decision boundary determining which students are predicted to graduate and which are predicted to drop out. 

Though AI models make binary decisions, classification algorithms assign each observation a likelihood of being in a positive class – in this case, a likelihood for a given student to graduate. To convert this probability to a prediction, the model designer chooses a likelihood threshold above which they will label the student a predicted graduate and below which they will label the student a predicted drop-out. At very high threshold choices, most predicted graduates end up graduating, but many would-be graduates are filtered out – in other words, precision is high, but recall is low. The exact opposite is true at lower threshold choices.

In context, the choice to prioritize one metric over the other can be relatively straightforward or highly subjective. For example,

  • The designer of a social media content recommendation model would emphasize precision to minimize the number of irrelevant or offensive posts a given user might see to keep them engaged on the site. Users of the social media service may benefit as well from seeing fewer irrelevant or offensive advertisements.
  • The designer of a cancer screening tool would rightly emphasize recall to maximize the number of cancer patients that are identified early enough to effectively treat. Patients benefit from peace of mind – exchanging potentially unneeded follow-ups for better long-term health.
  • The designer of college admissions tool could emphasize precision to ensure a high graduation rate or could emphasize recall to ensure that most applicants who could graduate are given the opportunity. More precision increases the graduation rate for the university at the expense of some students who are less likely to graduate, while more recall would reduce graduation rates but result in more total graduates – a tradeoff with differing costs to the students applying, the university, and society.

This is known as the precision-recall tradeoff, and it represents one of many design choices that can influence the behavior of AI systems in practice. By choosing a particular threshold, a designer makes an implicit choice about which types of errors are more likely to occur.

How can we determine if a model is ‘fair’?

Each of the metrics described above describes how a model performs on the aggregate level without respect to sensitive classes like race, gender, religion, or other social demographics. To determine whether the model discriminates against these distinct classes, there are a wide number of metrics to consider. For the purposes of illustration, consider two alternatives – one based on predictions, and one based on outcomes:

  • Predictive parity: For each predicted likelihood of graduation, do demographic groups graduate at the same rate? In other words, conditioned on the prediction made by the model, are the graduation outcomes similar between groups?
  • Equal opportunity: For each group, are students who go on to graduate given the same likelihood of graduation? Similarly, conditioned on the graduation outcome, are the predictions made by the model similar between groups?

Either of these definitions can be a reasonable definition of fairness, depending on perspective. Predictive parity is a reasonable definition of fairness from the modeler’s perspective, who ultimately does not know the potential outcome of every applicant in the real world. The university can, however, ensure predictive parity by making sure the students predicted to graduate end up graduating at similar rates across groups. A model with equal precision between groups will achieve ‘predictive parity’ for those predicted to graduate.

Equal opportunity more closely resembles the idea of fairness from the students and society more broadly – ensuring those who would go on to graduate are given similar risk classifications. Violating this would mean that among those who would go on to graduate, one group would have a higher predicted likelihood of graduation, ultimately leading that group to be admitted at a higher rate. A model with equal recall between groups will achieve equal opportunity for those who would graduate if given a chance to do so.

Figure 3. Decision boundaries are chosen to achieve predictive parity (horizontal line) or equal opportunity (vertical line). Percentages represent the decision boundaries chosen for each fairness approach, above which a given student will be predicted to graduate if their predicted likelihood of graduation exceeds the threshold.  

To achieve one of these fairness metrics in this context, where one group graduates at a higher rate than the other, the analyst would need to choose a different decision threshold for each group. To account for the lower graduation rate of group B, the analyst would either raise the decision boundary to reduce the number of eventual drop-outs and achieve predictive parity or lower the decision boundary to capture more possible graduates at the expense of also capturing more eventual drop-outs (Figure 3).

Figure 4. Number of students predicted to graduate or drop out by choice of decision threshold/fairness objective. Percentages on top of bars represent the graduation rate for students of a given group with a given model classification. Percentage in the gray areas between bars represents the percentage of actual graduates that are correctly classified by the model.  

Unfortunately, these two definitions of fairness are only simultaneously possible if one has a perfect classifier (meaning every prediction is correct) or if the proportion of positives is equal for every group in the data (in this case, meaning that both groups graduate at the same rate). These are extremely strong conditions that are unlikely to be met in most relevant policy spaces; as a result, these two definitions are effectively mutually exclusive.

How can we evaluate these models in the real world?

This contradiction is representative of many high-stakes algorithmic decisions made with respect to people’s lives, where those governed by a model disagree with those actually using the model on what metrics to prioritize. Predictive tools in public policy routinely make decisions in contexts where predicted outcomes vary across sensitive groups.

One such model, now discontinued, was the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) model developed by Northpointe, Inc. (now Equivant Inc). Since its inception in 1998, this tool was used to assess over 1 million defendants in New York, Wisconsin, California, and Florida, to aid judges in making sentence decisions.

In 2016 ProPublica reported that the tool was racially biased against black defendants; in a collection of data obtained from Broward County, Florida, the ProPublica team found that black defendants who did not re-offend were incorrectly predicted to re-offend twice as often as their white counterparts, violating equal opportunity. Northpointe rebutted allegations of bias, arguing that among defendants with a similar risk rating, both groups re-offended at the same rate, satisfying predictive parity.

From a purely mathematical perspective (demonstrated on BPC’s GitHub page), neither ProPublica nor Northpointe was incorrect in their calculations, but the controversy exemplifies why these comparisons cannot shed light on a model’s ‘fairness’ until experts and stakeholders can come to a consensus on which types of errors to prioritize. Ultimately, the normative decisions that determine the societal cost of a false positive or false negative should be made not by the designer of an automated tool but by all those whose lives will be affected by a certain model’s decision.

Fairness metrics are not a comprehensive solution to biased AI in isolation, and a complete AI policy agenda should consider which metrics are relevant for specific uses, acknowledging that not all types of ‘errors’ are created equal. Alongside technical approaches to explain the decisions of these models and explicit design choices to mitigate bias, fairness metrics can provide crucial information on whether algorithmic tools are impacting the civil rights of American citizens. However, it is crucial to understand how ‘fairness’ is defined by organizations using AI to understand the true costs of a model’s mistakes.

Support Research Like This

With your support, BPC can continue to fund important research like this by combining the best ideas from both parties to promote health, security, and opportunity for all Americans.

Donate Now