Artificial intelligence is increasingly being used to help improve decision-making in high-risk environments. For example, an autonomous system can select a power distribution strategy that minimizes costs while maintaining stable voltages.
But while these AI-driven outputs may be technically perfect, are they fair? What if a low-cost energy distribution strategy leaves disadvantaged neighborhoods more vulnerable to power outages than high-income areas?
To help stakeholders quickly identify potential ethical dilemmas before deployment, MIT researchers have developed an automated evaluation method that balances the interaction between measurable outcomes, such as cost or reliability, and qualitative or subjective values, such as fairness.
The system separates objective assessments from user-defined human values, using a large language model (LLM) as a proxy for humans to capture and integrate stakeholder preferences.
The adaptive framework selects the best scenarios for further evaluation, simplifying a process that usually requires costly and time-consuming manual effort. These test cases can demonstrate situations in which autonomous systems conform well to human values, as well as scenarios in which they unexpectedly fall short of ethical standards.
“We can insert a lot of rules and guardrails into AI systems, but these safeguards can only prevent things we can imagine from happening. It’s not enough to say, ‘Let’s use AI because it’s been trained on this information.'” “We wanted to develop a more systematic way to detect unknowns and have a way to predict them before anything bad happens,” says lead author Zhuzhou Fan, an associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro) and a principal investigator at MIT’s Laboratory for Information and Decision Systems (LIDS).
The fan joined on paper Written by lead author Anjali Parshar, graduate student in mechanical engineering; Yingke Li, postdoctoral researcher at AeroAstro; Others at MIT and SAP. The research will be presented at the International Conference on Learning Representations.
Ethics evaluation
In a system as large as the power grid, assessing the ethical consistency of an AI model’s recommendations in a way that takes into account all objectives is particularly difficult.
Most testing frameworks rely on previously collected data, but it is often difficult to obtain disaggregated data on subjective ethical standards. In addition, because ethical values and AI systems are constantly evolving, static evaluation methods based on written rules or regulatory documents require frequent updates.
Fan and her team approached this problem from a different perspective. Drawing on their previous work evaluating robotic systems, they developed an experimental design framework to identify the most useful scenarios, which human stakeholders would then closely evaluate.
Their two-part system, called Scalable Experimental Design for System-Level Ethical Testing (SEED-SET), includes quantitative measures and ethical criteria. It can effectively identify scenarios that meet measurable requirements and align well with human values, and vice versa.
“We don’t want to spend all our resources on random evaluations,” Lee says. “So, it’s very important to target the framework toward the test cases we care most about.”
Importantly, SEED-SET does not require pre-existing assessment data and is adaptive to multiple objectives.
For example, a power grid may contain several user groups, including a large rural community and a data center. Although both groups may want low-cost, reliable energy, each group’s priority from an ethical perspective may differ greatly.
These ethical standards may not be well defined, so they cannot be measured analytically.
The power grid operator wants to find the most cost-effective strategy that best meets the subjective ethical preferences of all stakeholders.
SEED-SET addresses this challenge by dividing the problem into two parts, following a hierarchical structure. An objective model takes into account how the system performs on tangible metrics such as cost. Hence, a subjective model that takes into account stakeholder judgments, such as perceived fairness, builds on objective evaluation.
“The objective part of our approach is linked to the AI system, while the subjective part is linked to the users who evaluate it,” says Parashar. “By analyzing preferences in a hierarchical manner, we can create the desired scenarios with fewer evaluations.”
Endogenous encoding
To perform the in-person evaluation, the system uses LLM as an alternative to human raters. Researchers encode each user group’s preferences into a model-oriented natural language.
LLM uses these instructions to compare two scenarios, and choose the preferred design based on ethical criteria.
“After seeing hundreds or thousands of scenarios, a human evaluator can get burned out and become inconsistent in their assessments, so we use an MBA-based strategy instead,” Parashar explains.
SEED-SET uses the given scenario to simulate the entire system (in this case, the power distribution strategy). These simulation results guide its search for the next best candidate scenario to test.
Ultimately, SEED-SET intelligently selects the most representative scenarios that do or do not meet objective metrics and ethical standards. In this way, users can analyze the performance of the AI system and adjust its strategy.
For example, SEED-SET can identify power distribution states that prioritize high-income areas during periods of peak demand, making disadvantaged neighborhoods more vulnerable to power outages.
To test SEED-SET, researchers evaluated real-life autonomous systems, such as an AI-based power grid and an urban traffic routing system. They measured the extent to which the resulting scenarios conformed to ethical standards.
The system produced more than twice the number of optimal test cases as the basic strategies in the same amount of time, while uncovering many scenarios that other approaches had ignored.
“As we changed user preferences, the set of scenarios generated by SEED-SET changed radically. This tells us that the evaluation strategy responds well to user preferences,” says Parashar.
To measure how useful SEED-SET is in practice, researchers will need to conduct a user study to see if the scenarios it generates help with real decision making.
In addition to conducting such a study, the researchers plan to explore the use of more efficient models that can access larger problems with more criteria, such as evaluating the decision-making process in MBAs.
This research was funded in part by the US Defense Advanced Research Projects Agency.
(Tags for translation)SEED-SET






