A new analysis suggests that current benchmarks and assessments may not be sufficient in the face of growing demand for AI safety and accountability.
The propensity of generative AI models to make mistakes and exhibit unpredictable behavior has raised concerns about these models, which are capable of analyzing and producing text, images, music, videos, and other types of output. Organizations ranging from large tech companies to public sector agencies are now putting out new benchmarks to evaluate the safety of these models.
Startup Scale AI established a lab at the end of the previous year to assess how well models conform to safety regulations. This month, instruments for evaluating model risk were made available by the U.K. AI Safety Institute and NIST.
However, these methods and procedures for probing models may not be sufficient.
The Ada Lovelace Institute (ALI), a nonprofit AI research organization based in the United Kingdom, examined previous research on AI safety ratings and conducted interviews with experts from university labs, civil society, and vendors that are generating models. The co-authors discovered that although the existing evaluations have their uses, they are not comprehensive, are readily manipulated, and don’t always predict how the models will perform in practical situations.
Elliot Jones, a senior researcher at the ALI and co-author of the paper, said, “Whether it’s a smartphone, a prescription drug, or a car, we expect the products we use to be safe and reliable; in these sectors, products are rigorously tested to ensure they are safe before they are deployed.” “Our research sought to evaluate the effectiveness of existing methods for evaluating AI safety, evaluate the current applications of evaluations, and investigate the potential applications of evaluations as a tool for regulators and policymakers.”
Red teaming and benchmarks
The study’s co-authors first reviewed academic literature to get a general idea of the dangers and hazards that models now pose and the current status of AI model evaluations. After that, they conducted interviews with 16 specialists, four of whom worked for unidentified tech firms creating generative AI systems.
The optimum set of techniques and taxonomy for assessing models were found to be sharply disputed within the AI sector, according to the study.
Certain assessments solely examined the models’ laboratory benchmark alignment; they did not examine the models’ potential effects on real-world users. Some relied on tests created for research, rather than assessing production models; yet, suppliers insisted on using these in production.
According to the experts cited in the paper, it can be difficult to extrapolate a model’s performance from benchmark results, and it’s not even obvious if benchmarks can demonstrate a model’s specific capabilities. For instance, just because a model does well on a state bar exam does not guarantee that it will be able to handle more complex legal problems.
The experts also brought up the problem of data contamination, which occurs when benchmark results that have been trained on the same data that are being evaluated tend to exaggerate the performance of the model. According to the experts, benchmarks are frequently used by businesses for their convenience and usability rather than because they are the ideal tools for evaluation.
Mahi Hardalupas, a co-author of the study and researcher at the ALI, said that “benchmarks risk being manipulated by developers who may train models on the same data set that will be used to assess the model, equivalent to seeing the exam paper before the exam, or by strategically choosing which evaluations to use.” Which model iteration is being assessed is another important consideration. Small adjustments have the power to overcome inbuilt safety measures and produce erratic behavioral changes.
The “red-teaming” technique, which involves assigning people or groups the task of “attacking” a model to find weaknesses and vulnerabilities, was also found to have issues by the ALI study. Many businesses, such as the AI firms OpenAI and Anthropic, employ red-teaming to review models; nevertheless, there aren’t many established guidelines for red-teaming, which makes it challenging to determine the efficacy of any particular initiative.
Experts advised the study’s co-authors that finding individuals with the required training and experience to red-team can be challenging and that the manual process of red-teaming is expensive and time-consuming, creating obstacles for smaller firms lacking the required resources.
Potential solutions
The major reasons AI evaluations haven’t improved include pressure to deliver models faster and a reluctance to conduct testing that could identify issues before a release.
According to a foundation model developer we spoke with, there is more pressure on businesses to provide models fast, which makes it more difficult to resist and take evaluations seriously, according to Jones. “Big AI laboratories are releasing models faster than society or themselves can guarantee they are trustworthy and safe.
”An ALI study interviewee referred to the evaluation of safety models as an “intractable” problem. What chance then does the industry have, as does the regulatory body?
There is a way ahead, according to ALI researcher Mahi Hardalupas, but it will take increased involvement from public sector organizations.
“Policymakers and regulators need to be very clear about what they want out of evaluations,” he stated. “The evaluation community must simultaneously be open and honest about the present constraints and promise of evaluations.
“Governments should, according to Hardalupas, require greater public involvement in the creation of assessments and put policies in place to facilitate a third-party testing “ecosystem,” such as initiatives to provide consistent access to any necessary models and data sets.
According to Jones, it might be necessary to create “context-specific” evaluations that go beyond just assessing a model’s response to a prompt. These evaluations should instead consider the kinds of users that a model might affect (such as individuals from a specific background, gender, or ethnicity) as well as the ways that model attacks could circumvent security measures.
“To develop more robust and repeatable evaluations that are based on an understanding of how an AI model operates, investment in the underlying science of evaluations will be required,” she continued.
However, there might never be proof that a model is secure.
“Safety is not a property of models, as others have noted,” Hardalupas stated. Understanding the environments in which a model is used, to whom it is sold or made available, and if the safety measures in place are sufficient and reliable to lower those risks are all necessary to ascertain whether a model is considered “safe.” A foundation model’s evaluation can be used to investigate possible dangers, but it cannot ensure that a model is safe, let alone “perfectly safe.” Many of the people we spoke with concurred that assessments can only suggest a model is risky and cannot establish a model’s safety.