MS

Offline Study Assistant

Android StudioKotlinJNIC++PythonPyTorchTransformersPEFT (QLoRA)llama.cppSQLiteRoom DB
GitHubDemo: Coming soon

Overview

The Offline Study Assistant (OSA) is a fully offline, on-device LLM-powered study companion designed to run on resource-constrained Android smartphones without any cloud or internet dependency.

The system enables students to ask questions, summarize textbook excerpts, and receive concise explanations entirely on device, prioritizing privacy, low latency, and real-world usability. OSA demonstrates that practical LLM inference is feasible on mid-range consumer hardware through careful model selection, fine-tuning, and systems-level optimization.

Problem Statement

How can a resource effective mobile device run a study assistant that is viable and efficient without access to the internet?

OSA addresses this gap by delivering a local-only LLM inference pipeline optimized for mobile CPUs, enabling always-available AI assistance without external dependencies.

Technical Approach

Model Selection & Fine-Tuning

  • Base model: Llama-3.2-1B-Instruct
  • Fine-tuned using QLoRA (PEFT) to align the model with:
    • Concise, student-friendly explanations
    • Strict "answer-from-excerpt" behavior for study tasks
  • Trained on a curated mix of instruction and summarization datasets (SQuAD2.0, CNN/DailyMail, Alpaca, OASST, EduQG)

Quantization & Inference

  • Quantized to Q4_K_M (4-bit) GGUF format for efficient CPU-only inference
  • Deployed using llama.cpp optimized for ARM architectures
  • Context window capped at 512 tokens to balance memory, latency, and answer quality

Android Inference Stack

  • Jetpack Compose UI with streaming token output
  • JNI bridge connecting Kotlin UI to native C++ inference code
  • Local persistence using SQLite / Room DB
  • No server, no cloud APIs, no network access at inference time

Performance & Optimization

Through iterative benchmarking and tuning on a Snapdragon 720G device:

  • Time-to-first-token reduced from 2.8 minutes → <15 seconds
  • End-to-end response latency reduced from 11.7 minutes → <30 seconds
  • Generation speed improved from 0.3 → 8.2 tokens/sec (~27× speedup)

Evaluation & Benchmarking

A custom benchmarking pipeline was built using:

  • ADB + llama.cpp CLI
  • Fixed prompt sets and output lengths
  • Metrics captured: latency, throughput, memory usage (PSS), battery drain, and temperature rise

Results were logged and analyzed to identify Pareto-optimal configurations balancing speed, accuracy, and device stability.

Key Features

  • Fully offline, privacy-preserving AI assistant
  • CPU-only inference on mid-range Android devices
  • Concise explanations and excerpt-grounded answers
  • Streaming responses for improved UX
  • No cloud cost, no data leakage, no network dependency

Current Status

OSA is an active research and engineering project exploring the limits of on-device LLM deployment for education. Ongoing work includes further latency reduction, UX refinements, and evaluation on lower-powered hardware.