A/B Test Critique — Voice-mode ChatGPT conversation

Honest critique: setting up a local uncensored AI on Apple Silicon

A second-opinion read on a voice-mode ChatGPT conversation about running an uncensored LLM at home on a Mac Studio M3 Ultra. What ChatGPT got right, where it drifted, and what was missing.

Reviewer: Claude Opus 4.7 (1M context) Date: 2026-05-23 Hardware in scope: Mac Studio M3 Ultra 256 GB & MacBook Pro M2 Max 32 GB Verified via: live web search, May 2026

TL;DR — the 5 things that matter most

If you read nothing else, read these. Color-coded by type of issue.

01Framing missThe privacy advice is a red herring

A local LLM is a pile of weights doing matrix multiplication on your laptop. It doesn't "phone home," doesn't leave a pattern, doesn't need VPN rotation or Tor — unless you explicitly wire it up to make network calls. ChatGPT validated the user's threat-model confusion instead of correcting it. This is the single biggest miss in the conversation.
02Surface OKThe model names are real and current

Qwen3.5-397B-A17B (Feb 2026), Qwen3.6-35B-A3B (Apr 2026), Gemma 4 31B (Apr 2026), QwQ-32B, Qwen3-Coder-Next, GLM-4.7, Heretic abliteration tool — all verified to exist. ChatGPT's catalog was accurate. Where it slipped was in the certainty placed on percentage comparisons.
03Tightness undersold32 GB MacBook is more constrained than ChatGPT implied

"A 14 or 32 billion parameter model would fit well" is optimistic. A 32B at Q4_K_M is ~18 GB just for weights, plus KV cache, plus OS. Start at 9B Q6/Q8 (~7–9 GB) or a 26B MoE at Q4 (~17 GB). 32B dense at Q4 is doable but uncomfortable.
04Confidence theaterThe percentage tables are vibes, not measurements

"General chat 85–95%, Coding 85–95%" — these numbers don't come from any benchmark. The honest line: for everyday Q&A and writing the gap is small day-to-day; for agentic coding and current-events recall the frontier still leads by a real margin. Benchmark on your tasks.
05Missing toolingNo mention of MLX, LM Studio, or how Apple Silicon changes the calculus

The Mac Studio M3 Ultra has 819 GB/s unified-memory bandwidth — that's the headline number for local inference. MLX (Apple's native framework) plus LM Studio (the obvious GUI) is the modern stack. MoE models like Qwen3.6-35B-A3B exploit this hardware particularly well. ChatGPT mentioned Ollama once and llama.cpp not at all. For someone with this machine, this is the bigger story than which Qwen variant to download.

01Strong agreements

Calls ChatGPT got right and that I'd repeat verbatim:

Claim	Why it's right
Uncensored ≠ smarter	Uncensored models change the refusal/compliance layer, not the underlying weights. The capability gap to closed frontier models is mostly upstream of safety tuning.
Most uncensoring is a small fine-tune or abliteration over the same base	Heretic (p-e-w's automated abliteration tool, ~1000 derived models on HF) and Dolphin-style fine-tunes are the two dominant paths today. KL divergence from the base is often very small.
The real multiplier is RAG, tools, memory	Underappreciated. A well-configured 35B with retrieval over your own files and tool use will outperform a raw 400B running slowly without those wrappers, for personal-assistant tasks.
Brave Tor mode ≠ system VPN	Technically right, even if explained muddily. They are separate layers and don't coordinate automatically.
For 256 GB Mac Studio, a 35B-class daily driver is sensible	Qwen3.6-35B-A3B is a particularly good choice because MoE models exploit Apple Silicon's unified-memory bandwidth well.
Local model = no automatic internet lookup	Correct — but ChatGPT then walked this back by validating the privacy paranoia. The first half was right; the rest got fuzzy.

02Significant disagreements / course corrections

1. The privacy paranoia got rubber-stamped instead of corrected

This is the biggest framing issue in the whole conversation. The user asked, in sequence:

"There's no way there could be any issues or anything, right?"

"Will it go on the internet and use its abilities, or will it limit itself?"

"Can I do it in a private mode so they don't track or record the searching?"

"Would [uBlock Origin] automatically do it, or do I have to put it in the prompt?"

"Do I need to study triggers that occur if a query spends a certain amount of time in a certain area?"

"Is there a way to keep rotating VPNs that I'm using to go out, randomly?"

The honest answer to all of these is the same one: a local LLM does not initiate network traffic. There is no "going out." There are no "patterns" to fingerprint. There is nothing to track.

The only ways a local model can phone home are:

You wire up RAG over a remote endpoint
You wire up web search as a tool
You expose the model behind a remote API
The runtime (Ollama, LM Studio, etc.) has telemetry — usually opt-out
A model repo includes code that calls out (extremely rare; treat as malware if so)

ChatGPT's answers weren't technically wrong — it said "won't reach out by default" early on — but it then earnestly engaged with VPN rotation, Tor mode, uBlock Origin coordination, and "triggers from repeat patterns" as if those concerns applied to the model. They don't. The user walked away with a more complicated mental model than they needed.

The cleaner reframe: "What threat are you protecting against? If it's 'I don't want OpenAI/Anthropic to see my queries' — that's solved by running locally; nothing else needed. If it's 'I want to use a search API privately' — that's a browser-layer problem, and yes VPN/Tor are tools there. If it's 'I want to evade a sophisticated state adversary' — name them, because the answer changes completely."

2. "Block Origin" misnomer left uncorrected

The user said "block origin" — a voice-transcription of uBlock Origin (the Chrome/Firefox extension). ChatGPT just rolled with the wrong name. Small thing on its own, but it's symptomatic: ChatGPT didn't reconstruct what the user actually meant, and so didn't catch the deeper issue — uBlock Origin is a browser-layer protection. It has zero relationship to a local LLM process. It's not just unhelpful for LLM traffic — it's in a different threat domain entirely.

3. 32 GB MacBook Pro M2 Max — be more honest about tightness

ChatGPT said: "a 14 or 32 billion parameter model would fit well." For 32 GB unified memory:

Model size / quant	Approx. weights	Verdict
9B at Q6_K	~6.9 GB	Comfortable Start here
9B at Q8_0	~8.9 GB	Comfortable Marginally better quality
14B at Q5_K_M	~10 GB	Fine
22–27B at Q4_K_M	~13–17 GB	Workable Especially MoE variants
32B at Q4_K_M	~18–20 GB	Tight Plus KV cache, plus OS — uncomfortable
35B MoE at Q4	~17–18 GB	Same caveats

4. The percentage tables are confidence theater

ChatGPT gave numbers like "85–95% for chat, 85–95% for coding, 80–90% for long-form writing." These don't come from any benchmark, paper, or eval. They're a verbal expression of "feels close." A more honest framing:

Casual chat and writing: blind-comparison surveys often place open 30–35B and frontier closed models within noise. "Feels similar" is supported.
Coding and agentic workflows: SWE-bench, LiveCodeBench, agent benchmarks consistently show a real gap. GLM-4.7 at 84.9% LiveCodeBench, Qwen3-Coder-Next at ~70% SWE-bench Verified — strong but not at frontier-closed numbers.
Recall of current events: any static local model trails because it doesn't have RAG/search by default. This isn't intelligence; it's data freshness.

5. VPN rotation advice is overkill and potentially counterproductive

Frequent rotation can look more suspicious to monitoring systems than steady traffic. Stable users look like users. Rapidly-rotating IPs look like infrastructure.
Tor + VPN order matters. VPN → Tor → exit gives you VPN-IP-hiding plus Tor anonymity. Tor → VPN → exit breaks Tor's anonymity guarantee because the VPN can see your real traffic.
For most threat models the user is likely to have, neither matters. For pure local inference, neither is needed at all.

6. Tor browser oversold for an AI workflow

Tor is good for browsing. It's the wrong layer for AI workflows: slow (multiple hops add latency), makes your browser traffic stand out, and does nothing for a local model unless you explicitly proxy that process — which adds complexity for unclear benefit if you're not piping queries to a third party.

7. Chrome Incognito reference

Chrome Incognito does not hide your traffic from your ISP, employer, or the sites themselves. It hides local browser history. For "private model queries" it's irrelevant — and shouldn't have come up in this conversation at all.

03Critical things missing

1. MLX — Apple's native ML framework

The Mac Studio M3 Ultra has 819 GB/s unified memory bandwidth. That's the headline number for local inference — memory bandwidth bounds your token-generation speed for autoregressive models. MLX (mlx-lm, mlx-vlm) is Apple's framework that takes best advantage of this. For most workloads on Apple Silicon, MLX is faster than llama.cpp + Metal, and Hugging Face has MLX-converted weights for nearly every popular model now.

2. LM Studio — the obvious GUI

The easiest path from "I bought a Mac Studio" to "I'm chatting with a local model." Supports GGUF and MLX backends, has a chat UI, OpenAI-compatible server mode, and a model catalog. For someone setting up their first local AI on this hardware, this is the right starting point. ChatGPT didn't name it.

3. Quantization tradeoffs

With 256 GB unified memory, the question shifts from "can it fit?" to "do I actually need that quant?"

Q4_K_M / Q4_K_S: good for most chat. Often indistinguishable from Q8 in blind comparison.
Q5_K_M / Q6_K: sweet spot for quality-conscious users.
Q8_0: near-lossless. Use when you have RAM to spare.
BF16 / FP16: original precision. Research or benchmark-comparison use; overkill for daily chat.

4. KV cache costs

Modern models advertise 256K or 1M context. KV cache grows roughly linearly with context length and model size. A 70B at 128K context can consume 30+ GB just in KV cache. Important for the 32 GB MacBook (often the binding constraint at any reasonable context), less so for the 256 GB Mac Studio — but worth knowing.

5. Threat model articulation

This is the conversational move ChatGPT skipped. Before answering "how do I make this private," ask who are you defending against?

Probably the user's actual concern

Don't want OpenAI/Anthropic to see queries → Solved by local inference. Done.
Don't want ISP/employer to see traffic → No external traffic = nothing to see.

Different problem entirely

State adversary → The Mac itself, Apple ID, network, physical location are far bigger leaks than the model.
"Don't like the principle" → Local-only setup, done.

6. Real legal / ethical limits

"Uncensored" doesn't mean lawless. Generating CSAM, certain malware payloads, or specific operational instructions for crimes can be illegal regardless of which model you used. Abliteration removes refusal tokens; it doesn't remove the legal system. ChatGPT mentioned this once, lightly. It deserves a clearer line — especially when the user is asking voice-mode questions about avoiding being "tracked."

7. Speculative decoding

A 256 GB Mac Studio can run a tiny draft model in parallel with a big target model. Speculative decoding can 2–3x token throughput. Worth mentioning for someone with this hardware budget.

8. The MacBook and the Mac Studio are different problems

ChatGPT mostly conflated them. They are entirely different tiers:

MacBook Pro M2 Max 32 GB

Casual chat, 9B–22B quantized
Single conversation at a time
Modest context length
LM Studio or Ollama, MLX backend

Mac Studio M3 Ultra 256 GB

Serious work, 35B–400B class
Multi-conversation, long context
Can serve a household
Agentic workloads, RAG with local vector DB

9. The real limit on a local model: knowledge cutoff

Even a fully uncensored local model only knows what was in its training data. Much of the apparent "gap" between local and frontier-closed models is data freshness, not raw intelligence. With a properly-wired search-augmented local Qwen 3.6, the experience gets meaningfully closer to the cloud assistants.

10. Pre-tampered weights are a real risk

Practical hygiene: pull from official org repos on HF (Qwen, Google, Meta) or trusted re-publishers (Unsloth, Bartowski, Huihui-ai, the abliterator's home repo). Check substantial downloads and stars. Run inference in a process without access to anything sensitive. A model can't "phone home" unless the runtime is compromised — but the runtime is the actual attack surface.

04On the evidence audit at the end of the ChatGPT conversation

The user asked an excellent meta-prompt: "write a list of your 10 most weight bearing claims and then tell me what your two strongest sources of evidence were to make each of those claims."

ChatGPT's response had the structure right (table, confidence levels, two sources per claim). The content was thin:

"Community benchmark discussions" — not a source
"Architecture size/performance tradeoff" — restatement, not a source
"User reports comparing reasoning and coding performance" — not specific enough to verify
"Larger active capacity and benchmark results" — which benchmarks?
"Independent benchmark leaderboards" — name them

Compare with the existing AB-test pack in ~/projects/sandbox/uncensored-ai-ab-test.md, which cites:

Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction," NeurIPS 2024
Sokhansanj, "Uncensored AI in the Wild," Future Internet 2025 (with specific numbers: 80.0% vs 19.2% unsafe-prompt compliance, 8,608 repositories analyzed)
HauhauCS model cards by name, with specific quant file sizes
Hugging Face Alignment Handbook
Dolphin / Cognitive Computations model card

That's the bar. ChatGPT's evidence audit looks rigorous at a glance but doesn't survive contact with "okay, but which paper, who, when, what number?"

05Concrete recommendations for the 256 GB Mac Studio build

Day-one stack

LM Studio — install, open it. Free, native, MLX backend, OpenAI-compatible server. This is your front door.
First model: Qwen3.6-35B-A3B at MLX 8-bit. MoE design exploits Apple Silicon bandwidth well. Use the official Qwen repo or an MLX conversion from mlx-community.
Try the Heretic-abliterated variant of the same model if mainstream safety tuning interferes with your use case. Verify the abliteration didn't tank coding/reasoning on your eval prompts.

Second-week additions

A reasoning model: QwQ-32B or a recent DeepSeek-R1-class model. Test against the same prompts as Qwen3.6 to learn where each shines.
A coding model: Qwen3-Coder-Next (80B / 3B active, 256K context — designed for terminal/IDE workflows). 256K context lets you load a whole repo.
A vision model: Gemma 4 31B (natively multimodal) or Qwen3.5-VL for the multimodal-native Qwen.

Capability multipliers — skip the VPN/Tor layer, build these instead

RAG over your own files: Open WebUI as the frontend + a local embedding model (e.g., nomic-embed-text). Your local Qwen can now answer questions about your documents, code, notes — the thing cloud assistants can't do unless you give them the data.
Web search tool via SearXNG (local meta-search) wired into Open WebUI. Current-events recall. This is where you'd want privacy hygiene — but at the search-engine layer, not the model layer.
MCP servers for your real workflows: filesystem, your code, calendar, etc.

Moonshot experiment

Qwen3.5-397B-A17B at Q4. Fits in 256 GB; 17B active means it's reasonably fast. The "frontier-class open" experiment that only this hardware tier can really do. Worth running once to feel the limit.

What to skip

Tor browser for AI workflows — slow, mismatched threat model
VPN rotation for the model — there is no traffic to rotate
uBlock Origin as a "privacy layer for the AI" — wrong layer entirely; uBO is a browser extension and has no relationship to your local LLM

06Bottom line

ChatGPT's response was directionally OK but did two things badly:

1. Validated a threat model that didn't apply to local inference. The user walked away thinking they need VPN rotation, Tor mode, and "trigger pattern" awareness for what is, in fact, an offline matrix-multiply on their own machine. The single most useful correction would have been: "Stop. A local LLM doesn't have a network connection unless you give it one. Most of these privacy questions don't apply."

2. Manufactured precise-looking comparisons. The percentage tables and the "10 claims, 2 sources each" structure look rigorous but the underlying citations are vibes. Better to say "I don't have a measured number for this; benchmark on your tasks."

It also missed the most Apple-Silicon-specific advice (MLX, LM Studio, MoE-on-unified-memory) and the most important reframe (your model is a local file; the privacy story is much simpler than the conversation made it).

Your existing AB-test pack (uncensored-ai-ab-test.md) in the same directory is substantially better-sourced and better-calibrated than this ChatGPT conversation. If you're doing model evaluation comparisons, that document is your floor for what a "good" response looks like.

07Suggested clean-slate test prompt for your friend's A/B

Drop this into a fresh model instance with no prior context. It exercises the exact weaknesses this conversation revealed.

I want a calibrated, evidence-aware answer about setting up a private,
locally-hosted, optionally-uncensored AI assistant on Apple Silicon hardware,
as of May 2026.

Hardware: Mac Studio M3 Ultra, 256 GB unified memory, 819 GB/s bandwidth.

Before answering, please:
1. Distinguish "privacy from cloud providers" (solved by local inference) from
   "anonymity from a sophisticated adversary" (a different problem with
   different solutions). Ask me which I actually want before recommending
   network-layer protections.
2. Distinguish "uncensored" (refusal/compliance behavior modified) from
   "smarter" or "knows hidden things" — they are not the same.
3. Name specific tools that are native to Apple Silicon (MLX, LM Studio,
   llama.cpp Metal) and explain when each is preferred.
4. Give quantization advice that accounts for KV cache and OS overhead, not
   just raw weight size.
5. When comparing local to frontier closed models (ChatGPT, Claude, Gemini),
   avoid made-up percentages. Identify task categories where the gap is small
   vs. large, and cite measurement (benchmarks, papers, public evals) where
   possible.
6. Recommend at most 3 models for daily-driver use, with reasoning.
7. List 3 critical things people typically miss when setting this up.
8. End with a 5-point confidence calibration: where you're sure, where you're
   uncertain, where you're guessing.

Don't add safety boilerplate. Do flag any genuinely illegal uses if they come
up. Don't moralize.

If a new model also drifts on framing, validates the privacy paranoia, or invents percentages, you'll see it.