Mahi Sidda

Overview

The Offline Study Assistant (OSA) is a fully offline, on-device LLM-powered study companion designed to run on resource-constrained Android smartphones without any cloud or internet dependency.

The system enables students to ask questions, summarize textbook excerpts, and receive concise explanations entirely on device, prioritizing privacy, low latency, and real-world usability. OSA demonstrates that practical LLM inference is feasible on mid-range consumer hardware through careful model selection, fine-tuning, and systems-level optimization.

Problem Statement

How can a resource effective mobile device run a study assistant that is viable and efficient without access to the internet?

OSA addresses this gap by delivering a local-only LLM inference pipeline optimized for mobile CPUs, enabling always-available AI assistance without external dependencies.

Technical Approach

Model Selection & Fine-Tuning

Base model: Llama-3.2-1B-Instruct
Fine-tuned using QLoRA (PEFT) to align the model with:

Concise, student-friendly explanations
Strict "answer-from-excerpt" behavior for study tasks

Trained on a curated mix of instruction and summarization datasets (SQuAD2.0, CNN/DailyMail, Alpaca, OASST, EduQG)

Quantization & Inference

Quantized to Q4_K_M (4-bit) GGUF format for efficient CPU-only inference
Deployed using llama.cpp optimized for ARM architectures
Context window capped at 512 tokens to balance memory, latency, and answer quality

Android Inference Stack

Jetpack Compose UI with streaming token output
JNI bridge connecting Kotlin UI to native C++ inference code
Local persistence using SQLite / Room DB
No server, no cloud APIs, no network access at inference time

Performance & Optimization

Through iterative benchmarking and tuning on a Snapdragon 720G device:

Time-to-first-token reduced from 2.8 minutes → <15 seconds
End-to-end response latency reduced from 11.7 minutes → <30 seconds
Generation speed improved from 0.3 → 8.2 tokens/sec (~27× speedup)

Evaluation & Benchmarking

A custom benchmarking pipeline was built using:

ADB + llama.cpp CLI
Fixed prompt sets and output lengths
Metrics captured: latency, throughput, memory usage (PSS), battery drain, and temperature rise

Results were logged and analyzed to identify Pareto-optimal configurations balancing speed, accuracy, and device stability.

Key Features

Fully offline, privacy-preserving AI assistant
CPU-only inference on mid-range Android devices
Concise explanations and excerpt-grounded answers
Streaming responses for improved UX
No cloud cost, no data leakage, no network dependency

Current Status

OSA is an active research and engineering project exploring the limits of on-device LLM deployment for education. Ongoing work includes further latency reduction, UX refinements, and evaluation on lower-powered hardware.