Generative AI is Software and Software needs testing
Generative AI, Machine Learning Models, Deep learning models are all software. As any software, it needs to be tested to make sure that it’s doing the right things and doing them in the right way. Technology leaders are concerned about Large Language Models (LLMs) non-determinism and hallucinations, plausible but incorrect suggestions. Leaders are also concerned about how GenAI outputs might or not be compliant to their organizations’ culture, ethics, policies and user experience. An unrealistic thought is to just put “humans in the loop” to control the outcomes, but that is an impractical and costly solution, especially when the AI is customer facing.
However, testing Generative AI is not for the faint-hearted
Testing Generative AI is not a well known practice, perhaps not known at all. GenAI is so new that there is not enough consolidated experience in the market on good and effective testing practices, GenAI infused solutions haven’t deployed at scale in production, to pose enough real risks to seriously worry about this enough. I’ve already highlighted how critical testing is when Generative AI is used to assist developers and software development teams, a Generative AI use case that in Forrester we’ve named TuringBots: AI and GenAI enabled development assistants.
Generative AI is more complex to test than all previous types of software we’ve been testing. We’ve explored how to test AI Infused applications before Generative AI through ChatGPT came to the market with two Forrester reports. We recently have updated those two documents with everything we learned so far about experiments and ideas about testing Generative AI. Check out “It’s time to seriously test your AI” (part1 and part2). We are well aware of the additional complexity Generative AI brings, with hallucinations and non determinism, and is why we have planned a dedicated research in 2024 on testing of Generative AI. So stay tuned.
Here are some starting points
Here are some initial thoughts on how to address testing of AI and Generative AI infused applications.
Leverage test benchmarks, adversarial testing, and test case prompting for LLMs.
Testing the performance of LLMs and genAI is hard because it means testing various natural language properties and expressions syntactically and semantically. Diverse free and open source benchmarks and evaluation frameworks exist to test different aspects of language like safety, ethics, and performance. For example, the OpenAI Moderation API is designed to provide a safer environment for users to interact with OpenAI APIs. It incorporates a moderation layer to filter out harmful or unsafe content, ensuring ethical use of LLMs. Benchmarking should be coupled with human testing. One approach, similar to traditional software testing, is to specify test properties in prompts that any correct output should comply with. Manual adversarial testing where possible can evolve to automated through generative adversarial networks as a third option
.
Let’s work together to improve Generative AI outcomes
If you or your team are gaining good experiences and practices, or learning about new tools that are effective in helping test Generative AI, please reach out to me at dlogiudice@forrester.com. As a Forrester client, If instead you are starting the journey of learning and planning on how to test AI infused apps, or need some insights, please do schedule an inquiry or guidance session at inquiry@forrester.com with me. We’re off to a ground breaking new technology, but if we can’t test it, it will be useless.