Quantized Labs | SDK Documentation

The Quantized Labs SDK provides low-level hooks for executing .quantized artifacts on edge devices. It exposes high-level inference APIs while transparently managing underlying NPU/GPU orchestration, asymmetric entropy routing, and phase-state decapsulation.

1. Installation & Ecosystem Integrations

The Quantized Labs natively supports WebAssembly (JS), iOS (Swift), and Android (Kotlin) targets.

Node.js / React Native / WebAssembly

npm install @quantized-labs/sdk

iOS (Swift Package Manager)

.package(url: "https://github.com/quantized-labs/quantized_labs-swift.git", from: "1.2.0")

Android (Gradle / Kotlin)

implementation("com.quantizedlabs.engine:engine:1.2.0")

2. Native Initialization & Hardware Hooks

The engine automatically detects available hardware accelerators. You can force a specific hardware target across all native wrappers.

TypeScript (Node.js & React Native)

import { QuantizedEngine } from '@quantized-labs/sdk';

const engine = new QuantizedEngine({
  apiKey: 'ql_prod_your_api_key_here',
  computeTarget: 'auto', // Auto-detects NPU -> GPU -> CPU
  memoryLimitMB: 1024
});
await engine.loadModel('./weights/llama3.1-8b.quantized');

iOS (Swift)

import QuantizedEngine

// Swift bridging directly targets the Apple Neural Engine (ANE)
let engine = try QuantizedEngine(config: QuantizedLabsConfig(
    apiKey: "ql_prod_your_api_key_here",
    computeTarget: .appleNeuralEngine,
    memoryLimitMB: 1024
))
try await engine.loadModel(from: Bundle.main.url(forResource: "llama3-8b", withExtension: "quantized")!)

Android (Kotlin)

import com.quantizedlabs.engine.QuantizedEngine

// Kotlin JNI directly targets Snapdragon Hexagon DSP
val engine = QuantizedEngine(
    apiKey = "ql_prod_your_api_key_here",
    computeTarget = ComputeTarget.SNAPDRAGON_NPU,
    memoryLimitMB = 1024
)
engine.loadModel(assets.open("llama3-8b.quantized"))

Supported Hardware Targets

Target Flag	Description	Minimum Requirement
`apple_neural_engine`	Routes matrix multiplication exclusively to the ANE via CoreML buffers.	A14 Bionic / M1
`snapdragon_npu`	Utilizes Qualcomm Hexagon Tensor Accelerators.	Snapdragon 8 Gen 2
`cuda`	Desktop/Server fallback via NVIDIA CUDA.	Compute Capability 7.0+

3. API Reference

engine.generate(prompt: string, config?: GenerateConfig): Promise<string>

Performs standard blocking inference. Returns the complete generated string once the stop token is reached or max tokens are generated.

engine.stream(prompt: string, config?: GenerateConfig): AsyncGenerator<string>

Returns an async generator that yields tokens as they are decoded from the asymmetric pathways. Essential for UI responsiveness.

engine.getTelemetry(): EngineTelemetry

Returns live hardware metrics including instantaneous Tokens-Per-Second (TPS), VRAM allocation, and thermal sensor readings.

engine.unload(): Promise<void>

Gracefully destroys the execution context and frees all associated RAM/VRAM buffers. Must be called to prevent memory leaks in Single Page Applications.

4. Error Handling

The SDK throws specific error codes during initialization or execution failures. Always wrap inference calls in a try/catch block.

Error Code	Reason	Resolution
`ERR_NPU_UNAVAILABLE`	The requested `computeTarget` could not be accessed.	Fallback to `'auto'` or `'cpu'`.
`ERR_OOM_RESERVE`	Host device has insufficient contiguous memory.	Decrease the `memoryLimitMB` during initialization or free up OS memory.
`ERR_CHECKSUM_MISMATCH`	The `.quantized` file is corrupted or tampered with.	Re-download the artifact from the MCaaS portal.
`ERR_THERMAL_THROTTLE`	Host device temperature exceeded safe limits.	Wait for device to cool; generation is halted automatically to prevent hardware damage.

5. Cognitive Module (LoRA) Training

You can fine-tune specific behaviors using Cognitive Modules. Our compiler requires standard HuggingFace PEFT adapters, which are then compiled into .lora files that can be hot-swapped into the running base .quantized engine.

import { QuantizedLabsCompiler } from '@quantized-labs/quantized_labs-compiler';

// Convert a standard HuggingFace PEFT adapter
const compiler = new QuantizedLabsCompiler();
await compiler.compileLora({
  inputAdapterPath: './hf_lora_weights/',
  baseModelFormat: 'llama3-8b',
  outputArtifact: './medical_expert.lora'
});

// Hot-swap the module at runtime in the SDK
await engine.applyModule('./medical_expert.lora');

6. CI/CD Deployment Hooks

To automatically pull the latest compressed models during your GitHub Actions or GitLab pipelines for Over-The-Air (OTA) app updates, use the CLI.

# Add this to your CI/CD pipeline script
npm install -g @quantized-labs/quantized_labs-cli

# Authenticate using your Enterprise API key
quantized_labs auth --token $QUANTIZED_LABS_API_KEY

# Pull the latest compressed production payload directly into your build folder
quantized_labs pull project_49a8f2 --env production --out ./android/app/src/main/assets/models/

7. Advanced SDK Usage

Memory Pinning

To achieve zero-copy inference, you can pin memory buffers directly in the OS kernel. This bypasses the garbage collector and prevents the OS from paging the VRAM.

// Pin 2GB of contiguous memory
engine.pinMemory({
  sizeMB: 2048,
  lockToNPU: true,
  preventPaging: true
});

Core Isolation (Android Only)

On big.LITTLE architectures, you can bind the Quantized Labs exclusively to the high-performance cores, preventing thread migration latency.

val engine = QuantizedEngine(
    computeTarget = ComputeTarget.SNAPDRAGON_NPU,
    threadAffinity = intArrayOf(4, 5, 6, 7) // Bind to Gold cores
)

8. Migrating from CoreML / ONNX

If you are currently using CoreML or ONNX Runtime for edge LLMs, you are likely experiencing high RAM overhead and thermal throttling. Quantized Labs is a drop-in replacement.

Step 1: Export to HuggingFace FP16

You do not need to export to ONNX or CoreML. Simply take your base PyTorch or Safetensors model from HuggingFace.

Step 2: Compile to .quantized

Use the Quantized Labs Compiler CLI to process the FP16 model. The compiler will run Asymmetric Entropy Routing to shrink the model by 85%.

Step 3: Replace MLModel / OrtSession

Remove your MLModel or OrtSession instantiations. Initialize the QuantizedEngine as shown in Section 2.