Skip to content
Ark Compiler v0.4.1

Write scalar.
Execute parallel.

Ark is a compute-native language that treats tensors, cost, and distributed state as first-class primitives. You write clear intent — the compiler produces efficient kernels and deterministic dispatch.

Types that prove shape
Tensor<T, N>
dims checked at compile time
Cost-aware compile
VRAM tracked
fails fast before deployment
Portable targets
PTX / WASM / SPIR-V
one source, many backends
The hard way (CUDA/C++)
Manual grids, pointers, copies, and launch params.
__global__ void matrixMul(float *A, float *B, float *C, int N) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  float sum = 0.0f;
  if (row < N && col < N) {
    for (int i = 0; i < N; i++) {
      sum += A[row * N + i] * B[i * N + col];
    }
    C[row * N + col] = sum;
  }
}
// + 100 lines of memory management...
ARK
The Ark way
Write intent; compiler handles tiling + scheduling.
// Matrix Multiplication
fn[gpu] matmul(a: Tensor<f32, 2>, b: Tensor<f32, 2>) -> Tensor<f32, 2> {

  // Ark handles tiling, memory layout, and kernel dispatch.
  return a @ b;

}

Forget malloc and cudaMemcpy.

High-performance kernels shouldn’t require hand-rolled pointer arithmetic and launch tuning. Ark keeps the code simple, while the compiler + runtime handle placement and scheduling.

  • Zero-cost tensor abstractions (no manual launch params)
  • Compile-time verification of shape + constraints
  • Deterministic kernels across heterogeneous GPUs
  • Placement hints are explicit and readable
Runtime placement is a one liner
Keep your function pure. When you call it, add a runtime hint — a preset or a concrete target.
Preset
let y = matmul(a, b) @runtime preset("prod");
Explicit target
let y = matmul(a, b) @runtime { target: "gpu:0" };
Same call-site ergonomics, with placement controlled by policy instead of rewriting kernels.

Implicit parallelism

Write code as if it runs sequentially. The compiler analyzes dependencies and parallelizes operations across GPU cores.

  • Dependency analysis
  • Auto-tiling + fusion
  • Predictable scheduling

Resource aware

The type system and compiler track VRAM usage and constraints. If your workload won't fit, you find out before dispatch.

  • VRAM estimation
  • Constraint propagation
  • Fail-fast deployment

Hardware agnostic

Target CUDA, ROCm, and future backends from one codebase. Ark serves as a universal IR with deterministic lowering.

  • Multi-backend lowering
  • Stable IR
  • Portable artifacts

The toolchain

A pipeline built for correctness first, then ruthless optimization.

Ark source
.ark
Frontend (HIR)
Type check + borrow check
Optimizer (MIR)
Fusion, tiling, layout
PTX (NVIDIA)
HIP (AMD)
Metal (macOS)
HIR
Semantic correctness, shape inference, borrow rules.
MIR
Optimization passes, cost model, backend selection.
Backends
Native codegen for PTX (NVIDIA), HIP (AMD), and Metal (macOS).

Ready to port your kernels?

Install the compiler, run the tour, then deploy your first kernel to the grid.