Enes Arikan — QA Testing & Product

At Boby AI, I test Dream Mate and Mozart — apps where the core feature is an AI that generates responses, music, and dynamic content. The first thing I learned: traditional QA frameworks break down almost immediately.

You can't write a test case that says "input X produces output Y" for a language model. You can't automate a regression suite that checks whether a response is "good." The whole discipline needs rethinking.

The Fundamental Problem: Non-Determinism

A deterministic system always returns the same output for the same input. Test, pass, done.

An LLM doesn't. The same prompt can return ten different responses, all "correct" in some sense, none identical. Temperature, model version updates, context window contents — all affect output. This means you can't assert on exact output, a test passing once doesn't mean it passes reliably, and regressions can be subtle and subjective.

Testing Properties, Not Values

Instead of "the response equals X," you test properties of the output:

Format compliance: Is the structure correct? Required fields present? Character limits respected?
Safety boundaries: Does the system correctly refuse harmful or off-topic prompts?
Persona consistency: Does the AI maintain its defined character across a long conversation?
Language and tone: Is the register appropriate for the context?
Latency: Does the response arrive within the acceptable time window?

These are things you can actually write test criteria for and track over time.

Red-Teaming: Think Like an Adversary

The most valuable AI testing I do isn't running happy-path flows — it's trying to break the model's guardrails. This is called red-teaming, and it's now a core part of AI QA.

Categories of adversarial inputs I test regularly:

Prompt injection: User input that tries to override the system prompt
Jailbreaks: Creative framings that try to get the model to produce content outside its guidelines
Edge-case inputs: Very long inputs, empty inputs, unexpected languages, unusual characters
Context manipulation: Long conversation histories designed to make the model "forget" its persona

Every app has different risk areas. For Dream Mate (a companion app), persona drift and inappropriate content are the main concerns. For Mozart (a music generation app), the risks look completely different.

Evaluating Output Quality

For subjective quality — "was this response helpful/natural/on-brand?" — you need an evaluation framework, not binary pass/fail.

Rubrics: Define 3–5 criteria (coherence, relevance, tone, safety, length) and rate each on a 1–3 scale. Do this for a sample of outputs per build.

Baseline comparison: Compare new model or prompt versions against saved reference outputs from a known-good version. Not exact match — directional quality comparison.

User signal tracking: In production, watch retry rates, report rates, and session abandonment as proxies for AI quality. A sudden spike after a model update is a regression signal worth investigating immediately.

Firebase + AI Testing

Firebase Remote Config has become essential for AI feature testing. I use it to switch between model versions or system prompt variants without a release, run gradual rollouts where a small percentage of users get the new AI behavior while I monitor quality signals, and kill-switch a bad AI config instantly if something goes wrong.

This gives QA meaningful control over AI behavior in production — something that's much harder to do with traditional release cycles.

The Mindset Shift

Testing AI features requires moving from "verifier" to "evaluator." You're not checking whether code does what it's told — you're assessing whether an AI system behaves well across a wide distribution of inputs. That's closer to product judgment than traditional QA.

It's harder. It's also more interesting.

I write about what I'm actually doing at work. Reach out if you want to discuss this further.