Editor’s Note: This is part of a series of dispatches from the Knight Science Journalism Program’s 2021-22 Fellows.
Since I took a fateful Lyft ride in late 2017, during which the driver told me that he had been interviewed by a robot, I have been covering artificial intelligence in the world of work. I am still fascinated by this new technology, and as a Knight Science Journalism Project Fellow this year, I’ve been thinking about its implications for journalists and for society more broadly.
AI has been marketed with the promise that it will revolutionize healthcare, hiring, financial decisions, criminal justice, and many other fields. But the reality is much more complicated: Researchers and journalists have discovered that some AI tools do not work as promised and may have caused more harm than good.
I have been asked many times by other reporters how journalists can hold artificial intelligence accountable; many are feeling overwhelmed by AI, which often seems so complex that humans barely understand it. I have wrestled with this question for years and have come up with some ideas:
- Check the data: Companies might write in their marketing materials that their algorithm, for example, is 90 percent accurate. That seems like a really high accuracy rate, but how did they arrive at that number? How did they validate the algorithm? I have often seen companies build an algorithm with one large dataset. Developers might use 80 percent of the data to train the algorithm and save the other 20 percent as a “holdout sample,” which they use to test how well the newly-built algorithm works. If a company validates their algorithm with their own holdout sample, the accuracy should be high, since it’s all built on the same dataset. If a company did that, I would ask for further validation: Perhaps they have run studies using “data from the wild” or they had a third party test the algorithm?
- Check the dataset: It’s also important to understand how the dataset the company used to build the algorithm was assembled. Where did the data come from? What is in the dataset? What demographics are represented? If a dataset, for example, is built by only interviewing college students, will the data be representative of older people? (Spoiler alert: probably not.)
- Create edge cases: Test the technology yourself when that is possible. I once tested an AI tool that was marketed as being able to predict someone’s English competency during job interviews. When I had talked to other vendors, they had shared with me that if someone has a speech impairment, or if the tool runs into problems detecting a person’s voice, the software would recognize the problem and display a message. That’s what I was expecting. Here is what I did: I first did the interview answering all the questions in English and scored an 8.5 out of 9. My English level was rated “competent.” I then tested an “edge case,” answering all of the questions in German, my native language. But I still scored a 6 out of 9 on my English ability, and my English level was rated “competent,” even though I did not say one word in English. It led to a fascinating conversation with the developers of the tool.
- Benchmark the algorithm against traditional tests: Many artificial intelligence based hiring tools are marketed as a digital version of traditional assessments, such as personality tests. In this story for the Wall Street Journal, my colleagues and I compared the results of traditional personality tests with those of AI tools that attempt to predict a person’s personality based on information pulled from their social media accounts. Half of the AI results were completely different from the traditional assessment methods.
- Bring your own data set and run it against the algorithm: We did that when I reported a story for the Wall Street Journal on a company in Israel that claims their tool can find pedophiles and terrorists based on photos alone. To test the tool, we brought our own dataset of photos of convicted terrorists. We learned that the tool doesn’t work on women and in our batch, the tool identified a photo of a 9/11 hijacker as a terrorist, but not a photo of a white convicted terrorist.
Over the coming year, with support from the Pulitzer Center, I am working with other dedicated colleagues on building a more comprehensive guide explaining different methods to hold artificial intelligence accountable.
Hilke Schellmann is an Emmy-award-winning journalism professor at New York University and a freelance reporter holding artificial intelligence accountable. Her work has been published in the Wall Street Journal, The New York Times, The Guardian, MIT Technology Review, NPR amongst others. She is currently writing a book on artificial intelligence and the future of work for Hachette.