tinytorch, torchlang, ariel | Deep Learning in R

I’m announcing three packages today: tinytorch, torchlang, and ariel. They address different parts of the ’torch in R’ story that I want to tell before explaining the packages.

libtorch

Ten years ago you could tell who was cosplaying as a data scientist when they would post silly things like, “You can’t use R in production! It’s slow! Python is fast!” The reality is that most fast code called in Python is just a wrapper around C/C++. It’s the same for R (plus a healthy dose of Fortran). Deep learning obviously requires fast code. So it is no surprise that under the hood of PyTorch is C++. That C++ backend is bundled as libtorch. So you’d expect R’s torch is just a thin wrapper around that C++, right? Wrong!

libtorch is compiled on Windows using Microsoft Visual C++ (MSVC). But running C++ in R on Windows requires Rtools, which compiles with MinGW. These are not ABI-compatible. So you cannot safely link MinGW-built code (Rtools) against MSVC-built libtorch! And then there’s Apple’s Clang! This is where the lantern piece of R’s torch comes in. It’s a C-shim keeping all these compiler/OS requirements playing nice with each other. This is why R’s torch package can ship on Linux, macOS, and Windows. CPU and CUDA. It is a non-trivial effort to put it lightly. Every libtorch release is a small epic.

I should know. I went through that epic with the libtorch 2.8 bump. It’s why the cornball-ai/torch fork exists. I wanted some libtorch 2.8 functionality (float8) in R and took Claude Code to the task of doing the bump. These changes resulted in adding over 200k lines of code and docs. Even with CLI agents, it still took a good amount of time, testing, and expertise from package author Daniel Falbel 🙏 to get it shipped. Having vibe-coded a few other forthcoming R packages that wrap C++, I thought there should be an easier way.

tinytorch: a Linux-only C++ wrapper

So what does the package look like if you target one operating system?

tinytorch is the answer. It’s R bindings to libtorch with a torch-compatible API and exactly one dependency: Rcpp. No lantern, no cross-platform build matrix. A configure script finds libtorch on your system and links against it. That’s it.

torch is the right design choice for the R community. With Daniel’s herculean effort, your R torch code will run on any OS.

tinytorch is the right choice if you only care about Linux and you want a simple binding layer or one that runs faster. It’s the right choice for me when I want to try something quick and dirty from a newer libtorch release. I couldn’t do that while simultaneously servicing three operating systems.

TorchScript and torch.compile

PyTorch has had TorchScript since 2018. You’d decorate a model with @torch.jit.script, the tracer would walk it, and you’d get a serialized graph you could run without Python. It’s how I was able to port Stable Diffusion to R last year. That graph also unlocked optimizations: operator fusion, dead code elimination, etc. The downside was apparently that writing TorchScript wasn’t very ‘Pythonic’. So TorchScript is now being deprecated.

PyTorch’s direction is now torch.compile, a two-stage pipeline. First, TorchDynamo intercepts CPython bytecode as your model runs and captures the tensor ops into an FX graph. When it hits something it can’t represent (data-dependent control flow, a print, a weird Python object), it does a “graph break”, compiles what it has so far, drops back to the interpreter for the messy bit, then starts a new graph. Dynamo is also a herculean effort. There’s a whole team at Meta on it because Python’s dynamic typing and monkey-patching mean you can’t know what’s a tensor op until runtime, which is why the abstract syntax tree (AST) alone isn’t enough and you need the bytecode interception. The output is one or more FX graphs of pure PyTorch ops with Python control flow stitched around them.

Second, TorchInductor takes those FX graphs and turns them into actual kernels. It does the hard work: operator fusion (so adjacent pointwise ops become one kernel instead of paying memory bandwidth round-trips), scheduling, memory planning, and codegen. On GPU it emits Triton source code, which the Triton compiler then lowers to PTX. On CPU it emits C++ with OpenMP. The user just writes standard PyTorch and the optimizations happen.

The downside of this for the rest of us is that when TorchScript goes away, optimized torch code will only live in Python since it is now so tightly coupled via TorchDynamo. My new favorite conspiracy theory is that this is really an elaborate plan for Pythonistas to defend their deep learning moat. Where does that leave R’s torch?

torchlang and ariel

Three weeks ago, I published a love letter to R’s native AST parser utils::getParseData() in the form of a blog post. I described how I used the AST to format R code with my rformat package. I also described how you can use the AST to give CLI agents like Claude Code a map for one or all of your R packages using my saber package. (It can help humans too I guess, but you should intuitively know these things.)

In that post I alluded to a third use of the AST in R. That is for constructing an intermediate representation (IR). Because R code tends to be more functional, we get fewer side-effects, and a much more usable IR from the AST. cornball.ai does not have a team of 20 engineers to build out TorchDynamo for R.

torchlang is the 80/20 version of TorchDynamo for R. It takes the AST for R torch code and emits an IR that captures the same kind of graph that Triton can use for optimizations. The upside is that, unlike the OOP Python code TorchDynamo has to parse, R’s more well-behaved functional code gets us farther quicker.

ariel is the back end. It takes torchlang’s IR and lowers it to GPU kernels via Triton’s MLIR infrastructure. (Named after the Little Mermaid, aka King Triton’s daughter.) Triton’s MLIR lowering is C++ under the hood, so I had Claude wrap it with Rcpp and, voila, no Python in the pipeline. torchlang makes the fusion decision (a cost model in cost_model.R scores fusion groups by memory traffic saved versus kernel launch overhead) and ariel emits the fused MLIR, plus dedicated kernels for the load-bearing patterns: tiled matmul with epilogue fusion, fused softmax, fused layer norm. That covers the bulk of where torch.compile actually makes transformer code faster, since the elementwise tail of every transformer block is memory-bound and fusion is the whole game there.

What ariel and torchlang don’t have yet is the rest of TorchInductor’s scheduler: buffer reuse, in-place rewriting, recompute-vs-cache decisions, full memory planning. Those are the next things to build, and they’re where help would matter most.

Both are early and experimental, but they came up in conversation last week so I made them public. Frankly, in my work on translating PyTorch AI models to R torch, the optimizations are well-known such that it was easier and more successful to just write them in explicitly rather than call the torchlang -> ariel pipeline. If you’ve worked on TorchDynamo, R’s metaprogramming, or just have opinions about how a tracer should behave when it hits an if branch on a tensor transformation, please reach out and/or open issues.

XLA

Three weeks ago, an rstats reddit thread came up asking what doesn’t exist in R that should. The top comment at the time was asking for JAX in R. I looked into the JAX project and it seemed like a way to write deep learning code in a more functional style reminiscent of how I write mathematical and statistical functions in base R. It also looked like an opportunity for another thin wrapper around C++, so I set Claude to work on it. I do this often these days as a way to learn about projects by just building them.

I built rjax in a morning. But about 90% of the way through it, I realized that someone had previously pointed me to XLA work in R. r-xla/anvil is Sebastian Fischer’s more mature project in the same space. I assumed they were doing the same thing I whipped up.

They are not. anvil is a more developed compiler, written in R, that lowers traced code to StableHLO and hands the bytecode to PJRT. rjax binds xla::XlaBuilder, the legacy XLA C++ builder API. OpenXLA has moved to StableHLO as the preferred IR for PJRT and XlaBuilder is now a second-class citizen. No removal date, but no future either. rjax has its own working tracer, jit, and reverse-mode autodiff (the JAX architecture in miniature), but it’s all sitting on a backend layer the XLA team has moved past. anvil targets the right IR and has months of head start.

rjax was sunset on arrival. anvil is the right bet, and I’m going to spend my XLA hours there instead.

Try it

If you’re on Linux and want something from libtorch 2.11, try tinytorch. If you’ve ever wished R’s torch had torch.compile, look at torchlang and ariel, and tell me what’s wrong with them.

# tinytorch
remotes::install_github("cornball-ai/tinytorch")

# torchlang
remotes::install_github("cornball-ai/torchlang")

# ariel
remotes::install_github("cornball-ai/ariel")