Inference platforms that put machine learning where the data is generated — cameras, radios, sensors — with deterministic latency and predictable power budgets.
The GPUs that win on training throughput often lose on real-world inference — too much idle compute at batch-size-one, too much latency jitter, too much wall power for the deployment site. Over the last few years the stack has split, and so has the hardware. These four shifts drive our silicon choices.
Real workloads don’t batch. A camera sees one frame. A radio sees one burst. A translator sees one utterance. Architectures tuned for 1000-sample batches waste 90% of their compute when you pin the batch to one — which is where FPGAs and inference-dedicated ASICs pull ahead.
INT8 is now the production default. INT4 and binary kernels are routine for vision and signal workloads where post-training quantization gives away <2% accuracy. Hardware that can’t run low-precision densely gets replaced.
Dedicated inference ASICs — datacenter-grade tensor cores through milliwatt edge NPUs — now bracket the design space between general-purpose CPUs and FPGAs. We deploy across all three, and we pick by workload, not vendor allegiance.
When latency has to be bounded to microseconds — tick-to-trade, real-time control, vision pipelines — FPGA remains the only platform that delivers. It’s also the platform where custom operators (HLS kernels) close the final 2× performance gap over stock frameworks.
Pick your framework. Pick your target. We’ll handle the conversion, quantization and deployment plumbing.
| Framework | FPGA (Kintex/UltraScale+) | ZYNQ MPSoC | Jetson-class | x86 accelerator |
|---|---|---|---|---|
| PyTorch ≥ 1.12 | ONNX → Vitis AI | Vitis AI | TensorRT | OpenVINO / TRT |
| TensorFlow 2.x | TF → ONNX → Vitis | Vitis AI | TF-TRT | OpenVINO / TF |
| ONNX | Native | Native | Native | Native |
| Custom (C / RTL) | Hand-tuned HLS | HLS + PS | CUDA kernels | Intrinsics |
| Hugging Face models | Case-by-case | Quantized | TRT-LLM | OpenVINO |
We pick the silicon around your latency, power and BOM envelope — not the other way round.
ZYNQ UltraScale+ or Jetson-class modules for translators, smart cameras and portable scanners with tight power budgets (<15 W typical).
Learn more →Kintex / Alveo-style cards that slot into standard 1U/2U servers for data-center pipelines at line rate (10-100 GbE ingress).
Learn more →1U / 2U servers with multi-FPGA fabric for telco, satellite and sensor analytics — turnkey or OEM-branded.
Learn more →Most customer models come in as float32 PyTorch checkpoints. These are the four transformations that close the latency and power gap on real silicon.
Typical FPGA deployment runs INT8; edge-most deployments go INT4. We measure accuracy delta on your validation set before committing — usually 0.5-2% drop for 3-4× speedup.
Channel / block-level pruning to reduce MAC count. Unlike unstructured pruning, this actually accelerates on FPGA and GPU.
Conv+BN+ReLU fused into single kernels; redundant transposes eliminated. Typical 1.5-2× throughput improvement with zero accuracy change.
For workloads where stock operators leave performance on the table — e.g. JESD framers, transcoding kernels, or custom attention variants — we write hand-tuned HLS that slots into the graph.
Every model we deploy goes through the same 5-step flow. Average time from handoff to deployable artifact: 4-8 weeks.
Profile your model, pick target silicon, set accuracy/latency budgets.
PyTorch / TF → ONNX. Validate graph fidelity bit-accurate.
PTQ with calibration data; accuracy delta report before commit.
Target-specific compile (Vitis AI / TRT / OpenVINO); micro-benchmarks.
Ship model + runtime + OTA update mechanism; acceptance tests.
These are families we’ve shipped to production across customer deployments. Outside this list, most PyTorch / ONNX models land with standard toolchain flow.
You don’t always need a full platform build. Some customers come in with a trained model and an impossible power budget. Others have tried an off-the-shelf PCIe card and run out of headroom. Talk to us at any of these five points.
Most common entry point. You own the model and accuracy target; we own the path from PyTorch / TensorFlow checkpoint to a shipping runtime on the target silicon. Typical engagement: 6-10 weeks to a deployable artifact.
Before committing hardware BOM, we build a scoped PoC on representative silicon (FPGA dev-kit, edge SoC or PCIe card) with real inference on your data. Written report documents latency, accuracy and power on your actual workload.
Thermal, physical, cost, certification, supply chain — there are many reasons a standard accelerator card is the wrong answer. We scope custom carrier boards, ruggedized modules or embedded integration instead of forcing the square peg.
Radiation-tolerant, conduction-cooled, fanless, secure-boot, MIL-STD, marine-certified — the shape of the compute has to match the shape of the product. That’s where we do our best work.
Often the harder half of the problem. We own the full signal path — sensor / radio / camera ingress, timing, pre-processing, format conversion — not just the tensor core that sits at the end.
Our customers deploy across a wide geometry — from a surveillance camera on a street corner to a special-purpose appliance inside a telco data center. The three archetypes below cover the majority of engagements.
Everything from a surveillance camera to a telephone-pole radio to a networking closet inside an office building. Fanless, thermally bounded, often power-over-Ethernet, always with an OTA path and field-calibration hooks.
Learn more →Our partners handle the conventional server build-out — we come in for special-purpose devices: inline inference for telco traffic, radar or sensor-fusion analytics appliances, custom PCIe accelerators where no stock card meets the spec.
Learn more →For some partners we build and supply the development platforms used to evaluate their silicon or IP — complete boards, firmware, host drivers and a demo stack that reviewers can plug in and run on day one.
Learn more →Talk to our Singapore engineering team about your RF, FPGA/DSP, or AI inference project. We'll help you pick the right silicon and ship on time.