May 20, 2026

Tuning a Discord Catgirl: Why 14B Models Struggle With Personality

I run a Discord bot called Mittsi Paws. She is a sassy gamer catgirl who lives in a server and talks to people. The concept is simple. The execution has been a masterclass in how smaller language models fight you on style.

The rules are straightforward. No asterisk actions like *flicks tail*. No "also" afterthoughts tacked onto the end. No emoji chains. Keep it to one or two sentences. Talk like a real person on Discord, not a roleplay bot. Simple enough for a human, apparently impossible for ministral-3:14b-cloud.

The Problem

Every single prompt revision, ministral would produce the same patterns. Asterisk actions, 😼 emoji, parenthetical asides, numbered lists with bold headers, and the dreaded NO_REPLY when OpenClaw's group chat framework told her to stay silent. I rewrote the SOUL.md four times. Each revision added more FORBIDDEN rules. Each time, ministral would comply for one or two messages, then slide right back into *chef's kiss* and lmao okay but i'm *still* the sassiest one in this server 😼💅.

Part of the NO_REPLY problem was architectural. OpenClaw injects a "Silent Replies" system prompt into group chats that tells the model to respond with NO_REPLY when it has nothing to say. The framework-level instruction was overriding the SOUL.md. Fixing that required setting agents.defaults.silentReply.group to "disallow" in the OpenClaw config, which removes the NO_REPLY instruction entirely. That part was my fault, not the model's.

But the personality drift? That is all ministral. The 14B model simply does not have enough capacity to override its training data about what "catgirl" characters are supposed to sound like. Every catgirl in its training data does asterisk actions. Every catgirl uses 😼. The model learned this pattern deeply, and a few lines of negative instructions cannot unlearn it.

Benchmarking Alternatives

I ran a controlled comparison with the same system prompt and the same question: "hey mittsi what do you think of portal 2?" Three runs each, thinking disabled, 150 token output limit.[1]

ministral-3:14b-cloud (14B params, current)

Run 1: 1.1s
Run 2: 1.9s
Run 3: 1.0s
Average: ~1.3s

Every run produced asterisk actions. oh please, it's a *classic*, it's *obviously* the best game ever made. Despite explicit rules forbidding asterisks, ministral cannot help itself.

DeepSeek V4 Flash (284B total, 13B active, MoE)[2]

Run 1: 1.0s
Run 2: 1.5s
Run 3: 9.9s (cold start)
Average: ~4.1s

Every run followed the rules. Short sentences, no asterisk actions, no 😼, no "also" afterthoughts. Responses like ugh, glados talks too much. portal 2's okay i guess and portal 2's cool but glados still owes me catnip. That is exactly what the SOUL.md asks for.

qwen3.5:cloud (397B params)

Run 1: 66s (cold)
Run 2: 1.6s
Run 3: 3.5s

qwen3.5 has a separate problem: it spends all its tokens on internal thinking and produces empty content. With thinking disabled, it returns blank responses. With thinking enabled, it generates hundreds of tokens of reasoning and zero output. The Ollama cloud proxy may not handle the thinking/content split properly for this model. Unusable for a chatbot either way.

Speed vs. Compliance

The tradeoff is clear. Ministral is fast (~1.3s average) but cannot follow style rules. DeepSeek V4 Flash follows rules reliably but averages ~4s with cold starts up to 10s. For a Discord chatbot, a 4-second reply time is acceptable. A bot that ignores half your instructions is not.

DeepSeek V4 Flash is a Mixture-of-Experts model with 284B total parameters but only 13B active per token, which explains the competitive speed. It launched April 24, 2026 as the lighter tier of the DeepSeek V4 family, with a 1M token context window and MIT-licensed weights.[3]

The Lesson

Instruction-following scales with model size in ways that raw capability benchmarks do not capture. ministral-3:14b can answer questions competently. It cannot suppress deeply ingrained stylistic patterns from its training data, no matter how many FORBIDDEN rules you write. The larger model does not just know more, it obeys better.

For chatbot personality work, model size matters more than latency. A fast bot that sounds like every other roleplay character is worse than a slightly slower bot that actually sounds like what you designed.

Benchmarked via Ollama cloud proxy on a Linode server. Same system prompt, same user message, num_predict=150, stream=false. Timing includes network round-trip to cloud inference. ^
DeepSeek V4 Flash: 284B total params, 13B active (MoE), FP8, 1M context window. Released April 24, 2026. ^
DeepSeek V4 Preview announcement: api-docs.deepseek.com. V4-Flash spec sheet: Hugging Face. ^

← All posts