Ark Compiler v0.4.1

Write scalar.
Execute parallel.

Ark is a compute-native language that treats tensors, hardware cost, and distributed state as first-class primitives. You write clear intent—the compiler produces hyper-optimized kernels and deterministic grid dispatch.

Read the spec GitHub repository

Type-level Shapes

Tensor<T, N>

Dims mathematically proven at compile time

Cost-aware Compile

VRAM Tracking

Fails fast before grid deployment

Portable Targets

PTX / WASM / SPIR-V

One pure source, multiple backends

The hard way (CUDA/C++)

Manual thread grids, raw pointers, host-to-device copies, and obscure launch parameters.

matmul.cu

__global__ void matrixMul(float *A, float *B, float *C, int N) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  float sum = 0.0f;
  if (row < N && col < N) {
    for (int i = 0; i < N; i++) {
      sum += A[row * N + i] * B[i * N + col];
    }
    C[row * N + col] = sum;
  }
}
// + 100 lines of brittle memory management...

Compute Native

The Ark way

Write strict mathematical intent; the compiler handles tiling, fusion, and grid scheduling.

matmul.ark

// Matrix Multiplication
fn[gpu] matmul(a: Tensor<f32, 2>, b: Tensor<f32, 2>) -> Tensor<f32, 2> {

  // Ark natively handles tiling, shared memory layout, 
  // and deterministic kernel dispatch.
  return a @ b;

}

Zero-overhead

Forget `malloc` and `cudaMemcpy`.

High-performance kernels shouldn’t require hand-rolled pointer arithmetic, manual barrier synchronization, and endless launch tuning. Ark keeps the source logic mathematically pure, while the compiler and runtime handle optimal memory placement and hardware scheduling.

Zero-cost tensor abstractions (no manual launch params)
Compile-time verification of multidimensional shapes
Deterministic kernels across heterogeneous GPUs
Placement hints are explicit, readable, and enforceable

Runtime placement is a one-liner

Keep your functions pure. When you call them, append a runtime hint — an abstract network preset or a concrete local target.

Abstract Preset

let y = matmul(a, b) @runtime preset("prod");

Explicit Local Target

let y = matmul(a, b) @runtime { target: "gpu:0" };

Implicit parallelism

Write code as if it runs sequentially. The Ark compiler analyzes dependencies and mathematically proves how to parallelize operations across thousands of GPU cores.

Dependency analysis
Auto-tiling + loop fusion
Predictable hardware scheduling

Resource aware

The type system tracks exact VRAM footprints and hardware constraints. If your requested tensor allocations exceed the target's capacity, compilation fails instantly.

VRAM usage estimation
Constraint propagation
Fail-fast deployment grids

Hardware agnostic

Target Nvidia, AMD, and edge devices from one pure codebase. Ark serves as a universal intermediate representation with deterministic, bit-exact lowering.

Multi-backend lowering
Stable Abstract IR
Portable cryptographic artifacts

The Ark Pipeline

A compiler built for correctness,
then ruthless optimization.

Ark Source

Frontend (HIR)

Optimizer (MIR)

Backends

High-Level Intermediate Representation

The compiler builds an abstract syntax tree and lowers it to HIR. Here, the heavy mathematical lifting occurs: verifying multidimensional shapes, executing borrow-checker rules, and enforcing isolation constraints.

Shape inference & validationBorrow & lifetime checkingSemantic correctness bounds

Ready to port your kernels?

Install the compiler, run through the interactive language tour, and deploy your first mathematically proven kernel to the grid.

Install compiler Documentation

Write scalar. Execute parallel.

Forget malloc and cudaMemcpy.