A/B Test Critique — Voice-mode ChatGPT conversation
A second-opinion read on a voice-mode ChatGPT conversation about running an uncensored LLM at home on a Mac Studio M3 Ultra. What ChatGPT got right, where it drifted, and what was missing.
If you read nothing else, read these. Color-coded by type of issue.
A local LLM is a pile of weights doing matrix multiplication on your laptop. It doesn't "phone home," doesn't leave a pattern, doesn't need VPN rotation or Tor — unless you explicitly wire it up to make network calls. ChatGPT validated the user's threat-model confusion instead of correcting it. This is the single biggest miss in the conversation.
Qwen3.5-397B-A17B (Feb 2026), Qwen3.6-35B-A3B (Apr 2026), Gemma 4 31B (Apr 2026), QwQ-32B, Qwen3-Coder-Next, GLM-4.7, Heretic abliteration tool — all verified to exist. ChatGPT's catalog was accurate. Where it slipped was in the certainty placed on percentage comparisons.
"A 14 or 32 billion parameter model would fit well" is optimistic. A 32B at Q4_K_M is ~18 GB just for weights, plus KV cache, plus OS. Start at 9B Q6/Q8 (~7–9 GB) or a 26B MoE at Q4 (~17 GB). 32B dense at Q4 is doable but uncomfortable.
"General chat 85–95%, Coding 85–95%" — these numbers don't come from any benchmark. The honest line: for everyday Q&A and writing the gap is small day-to-day; for agentic coding and current-events recall the frontier still leads by a real margin. Benchmark on your tasks.
The Mac Studio M3 Ultra has 819 GB/s unified-memory bandwidth — that's the headline number for local inference. MLX (Apple's native framework) plus LM Studio (the obvious GUI) is the modern stack. MoE models like Qwen3.6-35B-A3B exploit this hardware particularly well. ChatGPT mentioned Ollama once and llama.cpp not at all. For someone with this machine, this is the bigger story than which Qwen variant to download.
Calls ChatGPT got right and that I'd repeat verbatim:
| Claim | Why it's right |
|---|---|
| Uncensored ≠ smarter | Uncensored models change the refusal/compliance layer, not the underlying weights. The capability gap to closed frontier models is mostly upstream of safety tuning. |
| Most uncensoring is a small fine-tune or abliteration over the same base | Heretic (p-e-w's automated abliteration tool, ~1000 derived models on HF) and Dolphin-style fine-tunes are the two dominant paths today. KL divergence from the base is often very small. |
| The real multiplier is RAG, tools, memory | Underappreciated. A well-configured 35B with retrieval over your own files and tool use will outperform a raw 400B running slowly without those wrappers, for personal-assistant tasks. |
| Brave Tor mode ≠ system VPN | Technically right, even if explained muddily. They are separate layers and don't coordinate automatically. |
| For 256 GB Mac Studio, a 35B-class daily driver is sensible | Qwen3.6-35B-A3B is a particularly good choice because MoE models exploit Apple Silicon's unified-memory bandwidth well. |
| Local model = no automatic internet lookup | Correct — but ChatGPT then walked this back by validating the privacy paranoia. The first half was right; the rest got fuzzy. |
This is the biggest framing issue in the whole conversation. The user asked, in sequence:
The honest answer to all of these is the same one: a local LLM does not initiate network traffic. There is no "going out." There are no "patterns" to fingerprint. There is nothing to track.
The only ways a local model can phone home are:
ChatGPT's answers weren't technically wrong — it said "won't reach out by default" early on — but it then earnestly engaged with VPN rotation, Tor mode, uBlock Origin coordination, and "triggers from repeat patterns" as if those concerns applied to the model. They don't. The user walked away with a more complicated mental model than they needed.
The user said "block origin" — a voice-transcription of uBlock Origin (the Chrome/Firefox extension). ChatGPT just rolled with the wrong name. Small thing on its own, but it's symptomatic: ChatGPT didn't reconstruct what the user actually meant, and so didn't catch the deeper issue — uBlock Origin is a browser-layer protection. It has zero relationship to a local LLM process. It's not just unhelpful for LLM traffic — it's in a different threat domain entirely.
ChatGPT said: "a 14 or 32 billion parameter model would fit well." For 32 GB unified memory:
| Model size / quant | Approx. weights | Verdict |
|---|---|---|
| 9B at Q6_K | ~6.9 GB | Comfortable Start here |
| 9B at Q8_0 | ~8.9 GB | Comfortable Marginally better quality |
| 14B at Q5_K_M | ~10 GB | Fine |
| 22–27B at Q4_K_M | ~13–17 GB | Workable Especially MoE variants |
| 32B at Q4_K_M | ~18–20 GB | Tight Plus KV cache, plus OS — uncomfortable |
| 35B MoE at Q4 | ~17–18 GB | Same caveats |
ChatGPT gave numbers like "85–95% for chat, 85–95% for coding, 80–90% for long-form writing." These don't come from any benchmark, paper, or eval. They're a verbal expression of "feels close." A more honest framing:
Tor is good for browsing. It's the wrong layer for AI workflows: slow (multiple hops add latency), makes your browser traffic stand out, and does nothing for a local model unless you explicitly proxy that process — which adds complexity for unclear benefit if you're not piping queries to a third party.
Chrome Incognito does not hide your traffic from your ISP, employer, or the sites themselves. It hides local browser history. For "private model queries" it's irrelevant — and shouldn't have come up in this conversation at all.
The Mac Studio M3 Ultra has 819 GB/s unified memory bandwidth. That's the headline number for local inference — memory bandwidth bounds your token-generation speed for autoregressive models. MLX (mlx-lm, mlx-vlm) is Apple's framework that takes best advantage of this. For most workloads on Apple Silicon, MLX is faster than llama.cpp + Metal, and Hugging Face has MLX-converted weights for nearly every popular model now.
The easiest path from "I bought a Mac Studio" to "I'm chatting with a local model." Supports GGUF and MLX backends, has a chat UI, OpenAI-compatible server mode, and a model catalog. For someone setting up their first local AI on this hardware, this is the right starting point. ChatGPT didn't name it.
With 256 GB unified memory, the question shifts from "can it fit?" to "do I actually need that quant?"
Modern models advertise 256K or 1M context. KV cache grows roughly linearly with context length and model size. A 70B at 128K context can consume 30+ GB just in KV cache. Important for the 32 GB MacBook (often the binding constraint at any reasonable context), less so for the 256 GB Mac Studio — but worth knowing.
This is the conversational move ChatGPT skipped. Before answering "how do I make this private," ask who are you defending against?
"Uncensored" doesn't mean lawless. Generating CSAM, certain malware payloads, or specific operational instructions for crimes can be illegal regardless of which model you used. Abliteration removes refusal tokens; it doesn't remove the legal system. ChatGPT mentioned this once, lightly. It deserves a clearer line — especially when the user is asking voice-mode questions about avoiding being "tracked."
A 256 GB Mac Studio can run a tiny draft model in parallel with a big target model. Speculative decoding can 2–3x token throughput. Worth mentioning for someone with this hardware budget.
ChatGPT mostly conflated them. They are entirely different tiers:
Even a fully uncensored local model only knows what was in its training data. Much of the apparent "gap" between local and frontier-closed models is data freshness, not raw intelligence. With a properly-wired search-augmented local Qwen 3.6, the experience gets meaningfully closer to the cloud assistants.
Practical hygiene: pull from official org repos on HF (Qwen, Google, Meta) or trusted re-publishers (Unsloth, Bartowski, Huihui-ai, the abliterator's home repo). Check substantial downloads and stars. Run inference in a process without access to anything sensitive. A model can't "phone home" unless the runtime is compromised — but the runtime is the actual attack surface.
The user asked an excellent meta-prompt: "write a list of your 10 most weight bearing claims and then tell me what your two strongest sources of evidence were to make each of those claims."
ChatGPT's response had the structure right (table, confidence levels, two sources per claim). The content was thin:
Compare with the existing AB-test pack in ~/projects/sandbox/uncensored-ai-ab-test.md, which cites:
That's the bar. ChatGPT's evidence audit looks rigorous at a glance but doesn't survive contact with "okay, but which paper, who, when, what number?"
mlx-community.nomic-embed-text). Your local Qwen can now answer questions about your documents, code, notes — the thing cloud assistants can't do unless you give them the data.ChatGPT's response was directionally OK but did two things badly:
It also missed the most Apple-Silicon-specific advice (MLX, LM Studio, MoE-on-unified-memory) and the most important reframe (your model is a local file; the privacy story is much simpler than the conversation made it).
Your existing AB-test pack (uncensored-ai-ab-test.md) in the same directory is substantially better-sourced and better-calibrated than this ChatGPT conversation. If you're doing model evaluation comparisons, that document is your floor for what a "good" response looks like.
Drop this into a fresh model instance with no prior context. It exercises the exact weaknesses this conversation revealed.
I want a calibrated, evidence-aware answer about setting up a private,
locally-hosted, optionally-uncensored AI assistant on Apple Silicon hardware,
as of May 2026.
Hardware: Mac Studio M3 Ultra, 256 GB unified memory, 819 GB/s bandwidth.
Before answering, please:
1. Distinguish "privacy from cloud providers" (solved by local inference) from
"anonymity from a sophisticated adversary" (a different problem with
different solutions). Ask me which I actually want before recommending
network-layer protections.
2. Distinguish "uncensored" (refusal/compliance behavior modified) from
"smarter" or "knows hidden things" — they are not the same.
3. Name specific tools that are native to Apple Silicon (MLX, LM Studio,
llama.cpp Metal) and explain when each is preferred.
4. Give quantization advice that accounts for KV cache and OS overhead, not
just raw weight size.
5. When comparing local to frontier closed models (ChatGPT, Claude, Gemini),
avoid made-up percentages. Identify task categories where the gap is small
vs. large, and cite measurement (benchmarks, papers, public evals) where
possible.
6. Recommend at most 3 models for daily-driver use, with reasoning.
7. List 3 critical things people typically miss when setting this up.
8. End with a 5-point confidence calibration: where you're sure, where you're
uncertain, where you're guessing.
Don't add safety boilerplate. Do flag any genuinely illegal uses if they come
up. Don't moralize.
If a new model also drifts on framing, validates the privacy paranoia, or invents percentages, you'll see it.