Julia Helps To Bridge The Gap Between User and Creator

julia

Published

October 23, 2020

You might have heard about Julia, the language often praised for the C-like performance it can attain while keeping a clean syntax reminiscent of Python. In this blog post, I want to share a different opinion why I like using Julia, which is only tangentially related to its pure performance. It is about the community Julia enables and how that could have a beneficial influence on the way scientific software is written.

First off, I am not a trained computer scientist. Almost nobody in my research fields, psychology and neuroscience, is a trained computer scientist. But everybody needs to code nowadays to do research. Experiments are setup with code, data is analysed with code, graphs are made with code. So how does that work out if nobody is really trained for coding?

It leads to a situation where people waste time and often produce inferior results because they don’t really grasp the tools they are using. Code is often of low quality, neither version controlled, maintainable nor reproducible. Matlab, R and Python ecosystems offer tools that allow researchers to focus on their direct interest, the data, and spend less time fighting with the compilers and complicated syntax of C++ and Fortran. But this “convenience layer” contributes to a situation where people use tools without a deeper understanding what they are doing, without an idea of what to do when those tools are not enough. We have powerful packages with relatively user-friendly API’s at our disposal in each dynamic language of choice. But the important and performant parts are written in C, C++ and Fortran, hidden from view.

It is not easy to find out how things work under the hood in Python, R or Matlab. I think that this creates a big gap between users and creators of scientific tools.

Students in a university R or Python data analysis course will probably learn a bit about loops and conditionals first, because that’s just how everyone learning to code starts. But after that first phase, they will quickly move on to learning APIs of packages like dplyr, ggplot, pandas, because that’s what’s actually used by everybody. These API’s tend to be quite removed from the basic building blocks of each language (like the common but non-standard %>% syntax in R). If those tools don’t offer something out of the ordinary as a pre-packaged functionality, the students are out of luck. They know that they shouldn’t attempt to write any serious low-level analysis methods in R or Python directly, because that will probably be slow, and their tools of choice are also not made like that. One of the first things Python, R and Matlab novices learn is “Vectorize everything! Don’t write for-loops, they are slow!”. How surprised would they be to find out that the inside of pandas, dplyr and co is full of for-loops, because they are indispensable building blocks in compiled languages?

One such example of the boundaries of existing packages I’ve encountered in my previous work was when I analysed head movement data with Python’s pandas library. I really wanted a column in my dataframe that had one rotation matrix per row, describing head orientation over time. But that was not possible to do effectively because a rotation matrix is just not a data type that pandas expects you to store in a column. At every corner my “weird” datatype caused problems with underlying assumptions of pandas, or numpy or matplotlib. In Julia, it would have been really easy to make a Vector with 3x3 matrices from StaticArrays.jl. DataFrames.jl doesn’t care what your column vectors contain. To me, the point was not even to have the fastest solution, just to have a solution that cleanly expressed my intent.

The two-language problem is often presented as an inconvenience to researchers because of its time cost. I think it is a bit more than that. True, it takes a lot of time to figure out a solution to a problem in a dynamic language and then transfer it faithfully to a compiled language, to write bug-free bindings and package everything up for reuse. Julia tries to solve that problem, and it does very well to bridge the gap between a glue language with simple syntax and a serious numerical powerhouse. Countless benchmarks can attest to its speed. But when we focus only at how much time it takes to use two languages, I think we overlook what kind of effect a language gap has on research communities in this age of code.

The Julia community is filled with people from diverse scientific backgrounds. Many of them, like me, are not computer scientists. That doesn’t stop them from being involved in writing serious low level packages. And if they are not writing packages themselves, they are often helping by filing issues and creating PRs, adding their own perspectives on design questions. They do this even for the Julia language itself, if they find bugs or API inconsistencies. When I was using Matlab, Python and R, I didn’t see other researchers contribute to the fundamentals of their respective ecosystems in this way. In Julia, I see it all the time.

This is possible, in my opinion, because there is a continuous path from surface-level glue code to close-to-the-metal high-performance code in Julia, which can be discovered almost playfully - usually driven by the desire to reduce the number of allocations or runtime of a small function. In Julia, novices can learn first principles in a beginner friendly way, without caring about types, writing code that looks basically like Python. These first principles don’t lose their importance when serious packages are discovered. They instead become ever more powerful, the more knowledge a new user absorbs, because they can be combined in more and more flexible and innovative ways. As another example, if you use Stan from R, you have to feed it a script in a different language, while your Turing.jl models can be written in normal Julia. There’s really no limit to what you can send through Bayesian inference this way. Additionally, advanced topics like metaprogramming and optimization are always only a few steps away, and interesting lessons about one’s own code or the inner workings of Julia can be learned just by applying a couple of macros such as @code_warntype here and there. A transformation from beginner to expert code sometimes goes only through a couple minor changes like adding @inbounds in strategical places, or minimizing the use of allocations.

For example, with Revise.jl and the @edit macro, it’s quite simple to manipulate even Base functions on the fly, and play around with different implementations of Julia’s fundamental building blocks. The multiple dispatch paradigm makes it possible to inject functionality deep into third party code and to connect one’s homegrown implementations with the work of others in a way that I have never seen in Python, R or Matlab. This is not brittle tampering like, e.g., monkeypatching in Python, but allows you to meaningfully extend the available functionality if you want to. Packages are specifically written to be extensible by others, each dispatch presents an opportunity for third parties to hook into. One person might have a small idea, but through Julia’s ability for composition, it can easily become part of something bigger (like the often cited Differential Equations + Measurements + Plot Recipe combo).

I think Julia’s gentle learning curve and raw processing power are an invitation to domain experts outside of computer science to help write the software their fields need. Personally, it makes me feel more powerful and self-sufficient to work in a dynamic language that can not only give access to other people’s well written packages (written in Julia, but also R, Python and Matlab through RCall.jl, PyCall.jl and MATLAB.jl, respectively). It also allows me to write my own, and not just for toy problems but theoretically scalable even to supercomputers.

I hope that in the future Julia loses the perception of being a niche language for numerics, which might discourage people who don’t need just that from trying it. Instead, I would frame it as a powerful general programming language that offers something for everyone from beginners to experts and encourages collaboration and code reuse through its extensible multiple dispatch paradigm. If researchers write more algorithms in high-level Julia and not in low-level C++, this could make them more accessible for novices and easier to check for other researchers, because Julia shares the same “pseudo-code” qualities of Python (which doesn’t actually look so clean anymore if everything is prefixed by np.). The Julia community also highly values CI and good test coverage and I hope that such things become more mainstream in the future, because every scientist that works with code can benefit from adopting these best practices.

I wrote this post because I felt these aspects tend to be under-represented when Julia is discussed. Sometimes, the focus on micro-benchmarks and the resulting exaggerated or one-sided claims of superior performance expectedly cause users of other languages to go into defense mode. That’s not necessary in my opinion. Yes, Python R and Matlab also have amazing fast packages that enable people to do great research. But these languages are also fundamentally limited in how low-level users can reach, and they don’t allow for the same freedom to create beautiful and generic implementations that still harness all available computing power. They don’t allow scientists to bridge the gap from user to creator quite as well. And bridging that gap holds a lot of potential, especially in today’s world where everything is about code.