SDK Documentation
The Quantized Labs SDK provides low-level hooks for executing .quantized artifacts on edge devices. It exposes high-level inference APIs while transparently managing underlying NPU/GPU orchestration, asymmetric entropy routing, and phase-state decapsulation.
1. Installation & Ecosystem Integrations
The Quantized Labs natively supports WebAssembly (JS), iOS (Swift), and Android (Kotlin) targets.
Node.js / React Native / WebAssembly
npm install @quantized-labs/sdkiOS (Swift Package Manager)
.package(url: "https://github.com/quantized-labs/quantized_labs-swift.git", from: "1.2.0")Android (Gradle / Kotlin)
implementation("com.quantizedlabs.engine:engine:1.2.0")2. Native Initialization & Hardware Hooks
The engine automatically detects available hardware accelerators. You can force a specific hardware target across all native wrappers.
TypeScript (Node.js & React Native)
import { QuantizedEngine } from '@quantized-labs/sdk';
const engine = new QuantizedEngine({
apiKey: 'ql_prod_your_api_key_here',
computeTarget: 'auto', // Auto-detects NPU -> GPU -> CPU
memoryLimitMB: 1024
});
await engine.loadModel('./weights/llama3.1-8b.quantized');iOS (Swift)
import QuantizedEngine
// Swift bridging directly targets the Apple Neural Engine (ANE)
let engine = try QuantizedEngine(config: QuantizedLabsConfig(
apiKey: "ql_prod_your_api_key_here",
computeTarget: .appleNeuralEngine,
memoryLimitMB: 1024
))
try await engine.loadModel(from: Bundle.main.url(forResource: "llama3-8b", withExtension: "quantized")!)Android (Kotlin)
import com.quantizedlabs.engine.QuantizedEngine
// Kotlin JNI directly targets Snapdragon Hexagon DSP
val engine = QuantizedEngine(
apiKey = "ql_prod_your_api_key_here",
computeTarget = ComputeTarget.SNAPDRAGON_NPU,
memoryLimitMB = 1024
)
engine.loadModel(assets.open("llama3-8b.quantized"))Supported Hardware Targets
| Target Flag | Description | Minimum Requirement |
|---|---|---|
apple_neural_engine | Routes matrix multiplication exclusively to the ANE via CoreML buffers. | A14 Bionic / M1 |
snapdragon_npu | Utilizes Qualcomm Hexagon Tensor Accelerators. | Snapdragon 8 Gen 2 |
cuda | Desktop/Server fallback via NVIDIA CUDA. | Compute Capability 7.0+ |
3. API Reference
Performs standard blocking inference. Returns the complete generated string once the stop token is reached or max tokens are generated.
Returns an async generator that yields tokens as they are decoded from the asymmetric pathways. Essential for UI responsiveness.
Returns live hardware metrics including instantaneous Tokens-Per-Second (TPS), VRAM allocation, and thermal sensor readings.
Gracefully destroys the execution context and frees all associated RAM/VRAM buffers. Must be called to prevent memory leaks in Single Page Applications.
4. Error Handling
The SDK throws specific error codes during initialization or execution failures. Always wrap inference calls in a try/catch block.
| Error Code | Reason | Resolution |
|---|---|---|
ERR_NPU_UNAVAILABLE | The requested computeTarget could not be accessed. | Fallback to 'auto' or 'cpu'. |
ERR_OOM_RESERVE | Host device has insufficient contiguous memory. | Decrease the memoryLimitMB during initialization or free up OS memory. |
ERR_CHECKSUM_MISMATCH | The .quantized file is corrupted or tampered with. | Re-download the artifact from the MCaaS portal. |
ERR_THERMAL_THROTTLE | Host device temperature exceeded safe limits. | Wait for device to cool; generation is halted automatically to prevent hardware damage. |
5. Cognitive Module (LoRA) Training
You can fine-tune specific behaviors using Cognitive Modules. Our compiler requires standard HuggingFace PEFT adapters, which are then compiled into .lora files that can be hot-swapped into the running base .quantized engine.
import { QuantizedLabsCompiler } from '@quantized-labs/quantized_labs-compiler';
// Convert a standard HuggingFace PEFT adapter
const compiler = new QuantizedLabsCompiler();
await compiler.compileLora({
inputAdapterPath: './hf_lora_weights/',
baseModelFormat: 'llama3-8b',
outputArtifact: './medical_expert.lora'
});
// Hot-swap the module at runtime in the SDK
await engine.applyModule('./medical_expert.lora');
6. CI/CD Deployment Hooks
To automatically pull the latest compressed models during your GitHub Actions or GitLab pipelines for Over-The-Air (OTA) app updates, use the CLI.
# Add this to your CI/CD pipeline script
npm install -g @quantized-labs/quantized_labs-cli
# Authenticate using your Enterprise API key
quantized_labs auth --token $QUANTIZED_LABS_API_KEY
# Pull the latest compressed production payload directly into your build folder
quantized_labs pull project_49a8f2 --env production --out ./android/app/src/main/assets/models/
7. Advanced SDK Usage
Memory Pinning
To achieve zero-copy inference, you can pin memory buffers directly in the OS kernel. This bypasses the garbage collector and prevents the OS from paging the VRAM.
// Pin 2GB of contiguous memory
engine.pinMemory({
sizeMB: 2048,
lockToNPU: true,
preventPaging: true
});Core Isolation (Android Only)
On big.LITTLE architectures, you can bind the Quantized Labs exclusively to the high-performance cores, preventing thread migration latency.
val engine = QuantizedEngine(
computeTarget = ComputeTarget.SNAPDRAGON_NPU,
threadAffinity = intArrayOf(4, 5, 6, 7) // Bind to Gold cores
)8. Migrating from CoreML / ONNX
If you are currently using CoreML or ONNX Runtime for edge LLMs, you are likely experiencing high RAM overhead and thermal throttling. Quantized Labs is a drop-in replacement.
You do not need to export to ONNX or CoreML. Simply take your base PyTorch or Safetensors model from HuggingFace.
Use the Quantized Labs Compiler CLI to process the FP16 model. The compiler will run Asymmetric Entropy Routing to shrink the model by 85%.
Remove your MLModel or OrtSession instantiations. Initialize the QuantizedEngine as shown in Section 2.