Could machine learning fuel a reproducibility crisis in science?

Color 3D computed tomography (CT) scan of human lungs with cancer.

A CT scan of a tumor in human lungs. Researchers are experimenting with artificial intelligence algorithms that can detect early signs of the disease.Credit: KH Fung/SPL

From biomedicine to political science, researchers are increasingly using machine learning as a tool to make predictions based on patterns in their data. But the claims in many of these studies are likely exaggerated, according to a pair of researchers at Princeton University in New Jersey. They want to sound an alarm about what they call a “looming reproducibility crisis” in science based on machine learning.

Machine learning is sold as a tool that researchers can pick up in a few hours and use for themselves, and many are heeding that advice, says Sayash Kapoor, a machine learning researcher at Princeton. “But you wouldn’t expect a chemist to be able to learn how to run a lab using an online course,” he says. And few scientists realize that the problems they encounter when applying artificial intelligence (AI) algorithms are common to other fields, says Kapoor, who co-authored a preprint on the “crisis.”1. Peer reviewers don’t have time to analyze these models, so the academy currently lacks mechanisms to weed out irreproducible articles, he says. Kapoor and his co-author Arvind Narayanan created guidelines for scientists to avoid such pitfalls, including an explicit checklist to submit with each article.

What is reproducibility?

Kapoor and Narayanan’s definition of reproducibility is broad. He says that other teams should be able to replicate the results of a model, given the full details of the data, code, and conditions, often called computational reproducibility. something that is already a concern for the machine-learning scientists. The pair also define a model as irreproducible when researchers make errors in data analysis that mean the model is not as predictive as it claims.

Judging such errors is subjective and often requires in-depth knowledge of the field in which machine learning is applied. Some researchers whose work has been criticized by the team disagree their papers are flawed or say Kapoor’s claims are too strong. In social studies, for example, researchers have developed machine learning models that aim to predict when a country is likely to slide into civil war. Kapoor and Narayanan state that, once errors are corrected, these models perform no better than standard statistical techniques. But David Muchlinski, a political scientist at the Georgia Institute of Technology in Atlanta, whose articletwo was vetted by the pair, says the field of conflict prediction has been unfairly maligned and follow-up studies support their work.

Still, the team’s rallying cry has struck a chord. More than 1,200 people have signed up for what was initially a small online workshop on reproducibility on July 28, organized by Kapoor and colleagues, designed to find and spread solutions. “Unless we do something like this, every field will continue to run into these problems over and over again,” she says.

Over-optimism about the powers of machine learning models could prove detrimental when applying algorithms to areas like health and justice, says Momin Malik, a data scientist at the Mayo Clinic in Rochester, Minn., who will speak at the workshop. . . Unless the crisis is addressed, machine learning’s reputation could suffer, he says. β€œI’m somewhat surprised that there hasn’t been a drop in the legitimacy of machine learning. But I think it could come very soon.”

machine learning problems

Kapoor and Narayanan say similar pitfalls occur in applying machine learning to multiple sciences. The pair analyzed 20 reviews across 17 research fields and counted 329 research papers whose results could not be fully replicated due to problems in the way machine learning was applied.1.

Narayanan himself is not immune: a 2015 article on computer security that he co-authored3 he is among 329. β€œIt really is an issue that needs to be tackled collectively by this entire community,” says Kapoor.

The failures are not the fault of any individual researcher, he adds. Instead, a combination of hype around AI and inadequate checks and balances is to blame. The most prominent problem highlighted by Kapoor and Narayanan is ‘data leakage’, when the information in the dataset that a model learns about includes data that is then evaluated. If these are not completely separated, the model has effectively already seen the answers and its predictions look much better than they really are. The team has identified eight main types of data breaches that researchers can be vigilant against.

Some data leaks are subtle. For example, time leakage occurs when the training data includes points after the test data, which is a problem because the future depends on the past. As an example, Malik points to an article from 20114 which claimed that a model that analyzes the mood of Twitter users could predict the closing value of the stock market with an accuracy of 87.6%. But because the team tested the model’s predictive power using data from an earlier time period than part of its training set, the algorithm was effectively able to see into the future, he says.

The broader problems include training models on data sets that are more limited than the population they are ultimately intended to reflect, says Malik. For example, an AI that detects pneumonia on chest X-rays that was trained only on older people might be less accurate on younger people. Another problem is that algorithms often end up relying on shortcuts that don’t always work, says Jessica Hullman, a computer scientist at Northwestern University in Evanston, Illinois, who will speak at the workshop. For example, a computer vision algorithm could learn to recognize a cow from the grassy background in most images of cows, so it would fail when it came across an image of the animal on a mountain or beach.

The high accuracy of predictions in tests often misleads people into thinking the models capture the “true structure of the problem” in a human-like way, she says. The situation is similar to replication crisis in psychologywhere people put too much reliance on statistical methodsshe adds.

The hype about the capabilities of machine learning has made its results too easy for researchers to accept, says Kapoor. The word “prediction” itself is problematic, Malik says, since most predictions are back-tested and have nothing to do with predicting the future.

Data leak fix

Kapoor and Narayanan’s solution to address data leakage is for researchers to include with their manuscripts evidence that their models do not have each of the eight types of leakage. The authors suggest a template for such documentation, which they call “model information” sheets.

In the past three years, biomedicine has come a long way with a similar approach, says Xiao Liu, a clinical ophthalmologist at the University of Birmingham, UK, who has helped create reporting guidelines for studies involving AI, for example, in screening. or diagnosis. In 2019, Liu and his colleagues found that only 5% of more than 20,000 articles using AI for medical imaging were described in enough detail to discern whether they would work in a clinical setting.5. The guidelines don’t improve anyone’s models directly, but they “make it really obvious who are the people who have done well, and perhaps the people who haven’t done well,” he says, which is a resource regulators can tap into. .

Collaboration can also help, says Malik. He suggests that studies involve both specialists in the relevant discipline and researchers in machine learning, statistics, and survey sampling.

Fields where machine learning finds leads for follow-up, such as drug discovery, are likely to benefit greatly from the technology, Kapoor says. But other areas will need more work to prove useful, she adds. Although machine learning is still relatively new in many fields, researchers need to avoid the kind of crisis of confidence that followed the replication crisis in psychology a decade ago, she says. “The longer we delay it, the bigger the problem.”

Leave a Comment