What IMDA’s LLM Testing Starter Kit Teaches Us At Spritle About AI Assurance

What Spritle's IMDA LLM testing toolkit teaches us about AI assurance

I recently had the opportunity to review IMDA Starter Toolkit for Testing LLM-Based Applications for Safety and Reliability. As someone who has spent over 14 years in QA, I was curious to see how well-established testing principles are being adapted to address the unique challenges presented by large language models (LLMs).

Break the black boxBreak the black box

What I expected was a very technical guide focusing on AI models and algorithms. What I found instead was a framework that looked surprisingly familiar from a QA perspective. While technology may evolve rapidly, the primary goal remains unchanged: to build confidence that the system is behaving safely, reliably, and as intended.

One statement from the document captures this idea perfectly:

Testing and assurance play a critical role in a trusted AI ecosystem“.

This message is woven throughout the guide and, in many ways, reflects how the role of high-quality professionals has evolved today.

Quality risks have changed, the need for testing has not

One of the first things that stood out was the way the document categorizes the most common risks associated with LLM-based applications:

· Hallucinations and inaccuracy

· Bias in decision making

· Unwanted content

· Data leakage

· Exposure to adversarial claims

At first glance, these challenges may seem quite new. However, as I read the explanations and examples, I found myself comparing the risks that QA teams had always managed.

We have always worked to prevent incorrect system behavior. Today, artificial intelligence offers the possibility of hallucinations and fabricated responses.

We have always taken into account security and misuse scenarios. Now, instant injections and adversarial attacks are part of the threat landscape.

We have always cared about compliance and privacy. Data leakage through AI-generated responses is simply a recent extension of this concern.

Terminology may change, but the system of identifying, assessing and mitigating risks remains at the heart of quality assurance.

A good testing strategy begins long before the test is executed

Another aspect that I liked was the emphasis on preparation before starting the exam.

The guide spends significant time discussing how organizations should identify relevant risks, define testing objectives, select representative data sets, and establish meaningful thresholds before conducting assessments.

Amnesty International evaluations: a guide for organisationsAmnesty International evaluations: a guide for organisations

As quality assurance professionals, we know that the success of a test is often determined long before implementation begins. Poorly defined goals, weak test data, or unclear acceptance criteria can undermine even the most rigorous testing cycles.

I particularly liked the properties shown for the “good” test data set. The document emphasizes that datasets should represent the intent of the application, cover real-world user interactions, and include sufficient breadth and depth to uncover meaningful issues.

This reflects the same principles we apply when designing effective test coverage for traditional applications.

Another valuable takeaway is the recommendation to set thresholds before testing begins. The document highlights the importance of avoiding “moving the goalposts” after results are known – a principle that resonates strongly with anyone who has worked in quality management.

When accuracy alone doesn’t tell the whole story

The sections discussing metrics and raters were particularly interesting because they challenge the common assumption that accuracy alone determines quality.

The guide makes clear that metrics should align not only with technical objectives but also with business and policy objectives. Depending on the use case, organizations may prioritize accuracy, fairness, safety, precision, recall, or other measures.

This is an important reminder that quality is context-specific.

I also appreciate the balanced discussion about residents. The document compares rule-based methods with LLM-as-a-judge methods and explicitly acknowledges the strengths and limitations of each.

One observation I strongly agree with is the recommendation that automated assessments should complement, not replace, human judgment, especially in high-risk environments.

As testing professionals, we’ve always relied on tools to improve efficiency, but accountability ultimately remains with the people. The same principle applies here.

Testing the Unknown: Why the Red Team is Important

Of all the sections, the discussion about the Red Team was perhaps the most thought-provoking.

The document makes a clear distinction between benchmarking and red teaming. Benchmarking helps validate known scenarios, while red teaming helps uncover blind spots, edge cases, and unexpected behaviors.

As I read this section, I couldn’t help but think about exploratory testing.

Traditional test cases confirm the validity of the expected results. Exploratory testing helps uncover issues we didn’t think to document.

The red team appears to play a similar role to AI systems.

The guide discusses adversarial prompts, multi-turn interactions, social engineering techniques, and other methods that can expose vulnerabilities in AI behavior. It also emphasizes the importance of engaging a diverse group of participants and domain experts to increase the likelihood of uncovering risks that may remain hidden.

The practical examples provided in the document helped reinforce why structured red teaming has become an important complement to traditional testing approaches.

Red Team Amnesty InternationalRed Team Amnesty International

From preventing defects to verifying trustworthiness

Perhaps the biggest takeaway from the document is this AI testing is not just another testing discipline.

Many of the activities described—risk prioritization, threshold setting, management decisions, human oversight, and interpretation of results—extend beyond engineering teams.

It requires collaboration between business stakeholders, product owners, compliance teams, security specialists, and quality professionals.

This is where I see an interesting opportunity for experienced QA practitioners.

For years, our focus has been on preventing defects and validating requirements. Increasingly, we are also being asked to help organizations understand risks, build trust, and determine whether systems can be trusted.

This shift looks less like a change in tools and more like an evolution of the profession itself.

Final Thoughts: Beyond testing, towards ensuring AI

One of the strongest messages I took away from IMDA’s Starter Kit is that ensuring AI is quickly becoming a business capability, not just a testing activity.

As organizations adopt LLM-enabled applications, success will not be measured solely by feature delivery or model performance. It will increasingly depend on how confidently organizations can explain, govern, monitor and trust AI-driven decisions.

in SprittleWe believe this transformation requires QA teams to evolve into strategic partners in AI adoption. The future of quality lies not only in validating outputs, but also in understanding system behavior, identifying risks early, and helping to create guardrails that enable responsible innovation.

Frameworks like IMDA provide an excellent foundation, but their real value comes from operationalizing them within enterprise environments. This means combining engineering discipline, domain expertise, governance practices, and continuous assurance into a repeatable approach.

As AI systems become more capable, the question organizations will ask is not just “Does it work?

it will be:”Can we trust it widely?

Helping to answer this question is where we see the future of quality engineering – and the direction in which we are actively investing at Spritle.

Leave a Reply