Guile Optimization Gotchas: There Is No Free Beer, Only Cheap

By Artyom Bologov

Guile (as both the Scheme dialect and the compiler) is well-optimized. Language constructs are meaningful and type-specific. Tail recursive functions tend to reduce to a speedy iteration-like code, making most Scheme algorithms extremely fast. But there still are things one has to optimize by hand.

This post goes through what I encountered optimizing a heavily-numeric piece of code. In case you need more details: it was a kinship matrix computation, which means

Gigantic 2D arrays of individuals and genotypes and operations on them.
Cleaning up these enormous chunks of data.
Interacting with multiple unique file formats.
And making a lot of calls to C libraries.

I can't say my code is lightning-fast. But I optimized it in many places and can share some things I learned doing it. Some are more obvious, some less.

Bottom-Up Profiling and Unknowns #

This section might be relatively obvious to those using Guile ,profile for long enough. I'm including this gotcha anyway because it was not that intuitive when I first used the profiler. So feel free to skip to the next section if you're already comfortable reading profiler output.

Here's a small piece of profiler output (right from the top of it) for those who continued reading:

%     cumulative   self
time   seconds     seconds  procedure
 12.93      6.43      6.18  %read-line
 12.41      5.94      5.94  anon #x117f408
 ...
  9.01      4.31      4.31  substring
  5.27      2.52      2.52  sizeof
  4.59      3.58      2.20  <current input>:1398:0:string->num/nan
  4.42      2.11      2.11  string->number

Excerpt from the profiler output

It's visible that I/O, string operations, FFI (sic,) and anon #x117f408 (???) are the biggest performance offenders. That's why the post is centered around these categories.

But before we get to the categories themselves, here's a bit of a profiler use hint: Look for answers at the bottom of the output rather than at the top. Why? Because there are two types of time measurements in the profiler: cumulative seconds and self seconds. Here's another profiler output snippet (this time from the bottom) to show the dissimilarity:

  %        cumulative self
  time     seconds    seconds  procedure
  ...
  0.00     47.83      0.00  <current input>:1567:9
  0.00     47.67      0.00  <current input>:1550:0:kmain
  0.00     31.32      0.00  <current input>:1417:0:geno.txt->lmdb
  0.00     24.40      0.00  <current input>:468:0:call-with-env-and-txn
  0.00     24.32      0.00  <current input>:398:0:call-with-cursor
  0.00     17.16      0.00  <current input>:1363:0:read-separated-lines
  0.00     17.16      0.00  ice-9/ports.scm:429:0:call-with-port
  0.00     10.25      0.00  <current input>:1520:0:lmdb->genotypes-mtx
  0.00     10.25      0.00  <current input>:412:0:for-cursor
  0.00      9.60      0.00  <current input>:1530:8
  0.00      8.46      0.00  <current input>:1458:0:cleanup-vector

Profiler output for the performance-offending entry points

The striking difference with the first listing is that "self seconds" are zero for all of these. While the "cumulative seconds" are compounding to enormous numbers. So here's the difference between these two categories:

cumulative seconds: Are how much time passed between the function invocation and its termination.
self seconds: Are how much time the actual code inside the function spent.

That's why these functions have zero self seconds—they don't do anything heavy, only calling other functions. That's why these functions have so many cumulative seconds—they take a lot of time to complete. They are long due to the functions called inside.

When cumulative seconds do not help

Cumulative seconds measurements don't always show meaningful information. In particular, any named let recursion is likely to end up with surreal amount of cumulative seconds:

0.22  69958.15      0.03  <current input>:1894:13:read-lines

Named let as the nastiest performance drain? No, just a quirk of the profiler.

This difference between the two types of seconds leads us to a vital optimization strategy: Trace the execution path from the bottom to the top in the profiler output. Seeing which functions have the most cumulative seconds helps to narrow the scope. While looking at self seconds helps optimizing the bottlenecks exactly where they appear.

Disclaimer: yes, I know there's ,trace in Guile. The problem with many Guile APIs (including ,trace) is their un-customizable verbosity. I'm running an operation for every line in the 17000-row table file. For every column in this 2000-column row. Which is likely to destroy the REPL with the sheer amount of trace output. It actually did the last time I tried, hanging Emacs dead after 600000 lines of output. So profiling is my only option, it seems.

Anonymous Offenders: anon #x117f408 #

Now this is a tough one I'm not always decyphering. Some functions are anonymous. Lambdas, as we all know them. Raw lambdas are hard to track: they only display as <current input>:1530:8 in profiler output. So count the lines of code or guess the source location based on the named functions next to the lambda.

One strategy to fight this un-profileability of lambdas is to... name them. This way, profiler listings are properly annotated. But yeah, naming every anonymous function is quite an oxymoron. So it might be worth naming things while debugging, and anonymizing them back once done.

The secret of anon #x117f408 still stands. And I don't have an answer to it, sadly. It's not a lambda somewhere in my code. It's something (is it a lambda? or a function even?) somewhere else. You can tolerate it if it's not a performance hit. In this case, it is, so I'll get to locating it (in this bottom-up fashion) somehow sometime.

Now the domain-specific gotchas:

Conses Are Not Free #

What's the asymptotic complexity of the linked list length algorithm? What's the big-O of referencing an arbitrary list element? O(n). Every time you use length or list-ref, one C programmer dies somewhere in the world.

That's why I'm calling them "conses" here: we must remember that lists are head+tail pairs. Saying "list" implies integrity and abstractness. What's so bad about referencing something in a list? It's a linear data structure like vector, after all...

This might seem obvious, but I've been bitten by it: if you use list-ref—refactor it. list-ref is likely to end up the longest-running piece of your code. Because it walks down the cons (sic) chain to get the N-th element. Every time you want to get an element.

Conses Are Cheap, Though #

A fast alternative to list-ref is cdr-iteration. In cases you need to traverse a list, use good old tail recursion. This pattern is almost zero-cost.

(let rec ((l lst))
  (if (null? l)
      null-value
      (do-something (car l)
                    (rec (cdr l)))))

The most optimized (?) pattern Scheme has

Yes, recursion might get naughty in the rare cases when it's not used right (see below). Consider using do and iteration in such cases.

But do+list-ref is an absolute no-go! Saying this because I committed this mistake. It cost me more than 30 seconds of runtime. There's only so much load conses (sic) can sustain.

Recursion Is Not Free... But #

I've listed the pattern above the way I did for a reason. It certainly looks off to anyone worrying about deeply nested data and stack blow-up. So I might be advised to rewrite it in accumulator+reverse style. I tried:

(let read-lines ((line (first (%read-line port)))
                 (acc '()))
  (if (eof-object? line)
      (reverse acc)
      (read-lines (first (%read-line port))
                  (cons (string-separate line) acc))))

An attempt at making accumulator-styled iteration

This piece of code made the application an order of magnitude slower. Consing to accumulator and reversing it is more expensive time-wise, while cheaper stack-space-wise. (Thanks to u/raevnos on Reddit, I now know of destructive reverse!, which makes this idiom faster, but not necessarily fast enough.) So you have to assess the risks and decide:

Running or relatively flat and "wide" data (like me)?: Use the (do-something (rec ...)) pattern to save on function calls and data shuffling.
Running on deeply nested and "narrow" data?: Use the (rec ... (do-something acc)) pattern to ensure stack integrity and personal sanity.

Input/Output Is Not Free #

Working with multiple file formats means inevitably re-inventing the parser. And Guile is well-prepared for parsing. There are:

Delimited reading with (ice-9 rdelim).
PEG parser generation: (ice-9 peg).
Per-character reading (re-inventing state machines.)
And raw line/string parsing.

I ended up using the latter because my data was line-oriented. PEG grammars might end up faster on bigger pieces of data. Delimited reading might work better for non-line-oriented cases. And writing a custom parser might be the last resort for the desperate ones. But in my case, %read-line+strings (see below for caveats) were the optimal effort/performance ratio compared to e.g. state machines. Even so, %read-line, the lowest level (as per the name) I/O function in Guile... Ended up as another performance hit.

Might be the best option to make an extension to Guile itself. In C. Optimizing the bottlenecks like input and string parsing. Speaking of...

Strings Are Not Free #

Coming from C, it's easy to discard strings as lightweight. Processing strings is just a matter of byte iteration until there's null byte, right?

Not in Guile—Guile strings are encoding-aware and bound-checked. Which incurs a significant performance cost when dealing with lots of strings. And it's hard to convert Guile strings to C strings, because encodings.

I had this requirement: split the string at spaces, tabs, and commas and discard empty strings out of the resulting list. Here's a rating for how my multiple approaches fared:

Super slow: string-tokenize—it requires a charset of allowed characters to end up in tokens. And that list is huge in case you want anything but separators.
Slow: string-split + string-null?. Generates a lot of strings to later be discarded.
Quite fast: Hand-rolling a string-iterating tokenizer, extracting substring-s for tokens.
Segfault: Using strtok from C standard library.

I'll get to the point on segfaults and other FFI fun in a second, but here's a takeaway on strings: don't be afraid to optimize your narrow use case with handmade implementation. There's no standard utility to suit all the possible uses. And try bytevectors too—these are much closer to raw bytes C programmers dream of!

FFI Is Not Free #

One of the functions you can see in the slowest functions profiler printed is sizeof. A regular Guile function to get the size of the C type. It's a compile-time construct in C, so it should be fast in Guile too, right?

The hard truth is: Guile FFI is not instantaneous. Every foreign function call takes time. C function might terminate in a split second, but there will be an overhead—wrapping values and all.

And referencing C functions takes time too. Applying foreign-library-function result to arguments every time you call a wrapper function is wasteful. So I pre-defined some internal functions to avoid runtime symbol resolution:

;; Before:
(define* (del ...)
 ((foreign-library-function ...) ...))
;; After:
(define %del (foreign-library-function ...))
(define* (del ...)
  (%del ...))

Defining internal C-side function before actually using it

Another pain in debugging code using FFI is... that it's not debuggable at all. If something goes wrong, Scheme is no longer able to help you. C function segfaults, killing the whole process. And if you use Emacs with Geiser (you likely do), there's no output indicating an error. The session quietly terminates with "It's been nice interacting with you!" Which is not even a bug in Guile or Geiser. It's a sad reality of trusting FFI—it's non-instantaneous and dangerous.

As for the strtok segfault, it was due to iconv failures converting Scheme strings to C and back. Interestingly enough, once I (supposedly) fixed this on C side, Scheme started complaining about decoding strings too. So strtok is not an option, I guess. And it wasn't speeding thing up that much due to FFI call overhead.

GDB and Guile

You can attach GDB to Guile process to see the backtrace and other info when it fails:

$ gdb -q -p GUILE-PID
continue # To make Guile run as it was before GDB

Attaching GDB to a Scheme REPL process by ID

It's just that in my case the problem wasn't feasibly fixable even with GDB info.

FFI Is Cheap, Though #

This roundtrip between Scheme and C means that calling library functions is slow. If the function in question is some element getter, then any performance benefits C brings—Scheme eats up with the FFI overhead. Here's a bit of code that killed the performance of vector operations:

(let* ((vec (vec:alloc individuals 0))
       (bv (pointer->bytevector
            (mdb:val-data data) (mdb:val-size data)))
       (actual-elements (/ (mdb:val-size data) (sizeof double))))
  (do ((idx 0 (1+ idx)))
      ((= idx actual-elements))
    (vec:set!
     vec idx
     (bytevector-ieee-double-native-ref bv (* idx (sizeof double)))))
  ...)

Getting the data from DB and putting it into numeric vectors

Even though I'm using bytevectors and GSL-provided setter, the data is too overwhelming. Neither of these is fast enough:

bytevectors,
parse-c-struct,
or pointer-address and make-pointer arithmetics.

What helps in this bind is... FFI. Again, my case was copying rows of double numbers from the database (LMDB) to numeric vectors (from GNU Scientific Library). A lucky coincidence was that both are written in C and store the row/vector contents as raw C double arrays. So I spun up memcpy and copied the memory region from DB to vector. Directly.

(define memcpy
  (foreign-library-function
   #f "memcpy"
   #:return-type '*
   #:arg-types (list '* '* size_t)))

(let* ((vec (vec:alloc individuals 0)))
  (memcpy (vec:ptr vec 0) (mdb:val-data data) (mdb:val-size data))
  ...)

Using memcpy to copy data between two libraries

That's a dangerous operation, but the stability of LMDB and GSL guarantees that it'll work for a long enough time (for everyone to forget about my code?)

So the message about FFI is ambivalent:

Be wary of using foreign functions if you need many calls or safety guarantees.
But be aware that you often have no choice but to use FFI for some massive and costly operation.

Takeaways #

An attempt at summarizing the points above:

Conses are not random access.
I/O is complicated.
Strings are heavy-weight.
FFI is not zero-cost.

That's a mildly depressing yet relatively obvious list. And it's more of a joke summarizing the wrong-ish things to take away. So I hope that you learned slightly more than the points above reading my post. Good luck in making your Guile code lightning-fast with what you've learned!

Update May 2024: Another useful technique for debugging FFI code with Geiser. Geiser doesn't show any output in case of error or when it dies from a segfault. It does show error messages unconditionally, though. So put (error ...) everywhere and see where it throws and where it doesn't. If it does, then the code before it is working. If it doesn't, then something is wrong.

Update Aug 2024: Do check out Dave Thompson's post on Optimizing Guile Scheme, there's a lot of more structured advice there! While my post is a series of random gotchas for FFI-heavy code, his is a perfect tutorial for optimization.