Location

Hilton Hawaiian Village, Honolulu, Hawaii

Event Website

https://hicss.hawaii.edu/

Start Date

3-1-2024 12:00 AM

End Date

6-1-2024 12:00 AM

Description

We consider interactions between fairness and explanations in neural networks. Fair machine learning aims to achieve equitable allocation of resources --- distributive fairness --- by balancing accuracy and error rates across protected groups or among similar individuals. Methods shown to improve distributive fairness can induce different model behavior between majority and minority groups. This divergence in behavior can be perceived as disparate treatment, undermining acceptance of the system. In this paper, we use feature attribution methods to measure the average explanations for a protected group, and show that differences can occur even when the model is fair. We prove a surprising relationship between explanations (via feature attribution) and fairness (in a regression setting), demonstrating that under moderate assumptions, there are circumstances when controlling one can influence the other. We then study this relationship experimentally by designing a novel loss term for explanations called GroupWise Attribution Divergence (GWAD) and comparing its effects with an existing family of loss terms for (distributive) fairness. We show that controlling explanation loss tends to preserve accuracy. We also find that controlling distributive fairness loss tends to also reduce explanation loss empirically, even though it is not guaranteed to do so theoretically. We also show that there are additive improvements by including both loss terms. We conclude by considering the implications for trust and policy of reasoning about fairness as manipulations of explanations.

Share

COinS
 
Jan 3rd, 12:00 AM Jan 6th, 12:00 AM

Does a Fair Model Produce Fair Explanations? Relating Distributive and Procedural Fairness

Hilton Hawaiian Village, Honolulu, Hawaii

We consider interactions between fairness and explanations in neural networks. Fair machine learning aims to achieve equitable allocation of resources --- distributive fairness --- by balancing accuracy and error rates across protected groups or among similar individuals. Methods shown to improve distributive fairness can induce different model behavior between majority and minority groups. This divergence in behavior can be perceived as disparate treatment, undermining acceptance of the system. In this paper, we use feature attribution methods to measure the average explanations for a protected group, and show that differences can occur even when the model is fair. We prove a surprising relationship between explanations (via feature attribution) and fairness (in a regression setting), demonstrating that under moderate assumptions, there are circumstances when controlling one can influence the other. We then study this relationship experimentally by designing a novel loss term for explanations called GroupWise Attribution Divergence (GWAD) and comparing its effects with an existing family of loss terms for (distributive) fairness. We show that controlling explanation loss tends to preserve accuracy. We also find that controlling distributive fairness loss tends to also reduce explanation loss empirically, even though it is not guaranteed to do so theoretically. We also show that there are additive improvements by including both loss terms. We conclude by considering the implications for trust and policy of reasoning about fairness as manipulations of explanations.

https://aisel.aisnet.org/hicss-57/sj/digital-discrimination/6