Guile (as both the Scheme dialect and the compiler) is well-optimized. Language constructs are meaningful and type-specific. Tail recursive functions tend to reduce to a speedy iteration-like code, making most Scheme algorithms extremely fast. But there still are things one has to optimize by hand.
This post goes through what I encountered optimizing a heavily-numeric piece of code. In case you need more details: it was a kinship matrix computation, which means
I can’t say my code is lightning-fast. But I optimized it in many places and can share some things I learned doing it. Some are more obvious, some less.
This section might be relatively obvious to those using Guile
Here’s a small piece of profiler output (right from the top of it) for those who continued reading:
% cumulative self time seconds seconds procedure 12.93 6.43 6.18 %read-line 12.41 5.94 5.94 anon #x117f408 ... 9.01 4.31 4.31 substring 5.27 2.52 2.52 sizeof 4.59 3.58 2.20 <current input>:1398:0:string->num/nan 4.42 2.11 2.11 string->number
It’s visible that I/O, string operations, FFI (sic,) and
But before we get to the categories themselves, here’s a bit of a profiler use hint: Look for answers at the bottom of the output rather than at the top. Why? Because there are two types of time measurements in the profiler: cumulative seconds and self seconds. Here’s another profiler output snippet (this time from the bottom) to show the dissimilarity:
% cumulative self time seconds seconds procedure ... 0.00 47.83 0.00 <current input>:1567:9 0.00 47.67 0.00 <current input>:1550:0:kmain 0.00 31.32 0.00 <current input>:1417:0:geno.txt->lmdb 0.00 24.40 0.00 <current input>:468:0:call-with-env-and-txn 0.00 24.32 0.00 <current input>:398:0:call-with-cursor 0.00 17.16 0.00 <current input>:1363:0:read-separated-lines 0.00 17.16 0.00 ice-9/ports.scm:429:0:call-with-port 0.00 10.25 0.00 <current input>:1520:0:lmdb->genotypes-mtx 0.00 10.25 0.00 <current input>:412:0:for-cursor 0.00 9.60 0.00 <current input>:1530:8 0.00 8.46 0.00 <current input>:1458:0:cleanup-vector
The striking difference with the first listing is that “self seconds” are zero for all of these. While the “cumulative seconds” are compounding to enormous numbers. So here’s the difference between these two categories:
That’s why these functions have zero self seconds—they don’t do anything heavy, only calling other functions.
That’s why these functions have so many cumulative seconds—they take a lot of time to complete.
They are long due to the functions called inside.
Cumulative seconds measurements don’t always show meaningful information.
In particular, any named let recursion is likely to end up with surreal amount of cumulative seconds:
0.22 69958.15 0.03 <current input>:1894:13:read-lines
This difference between the two types of seconds leads us to a vital optimization strategy: Trace the execution path from the bottom to the top in the profiler output. Seeing which functions have the most cumulative seconds helps to narrow the scope. While looking at self seconds helps optimizing the bottlenecks exactly where they appear.
Disclaimer: yes, I know there’s
Now this is a tough one I’m not always decyphering.
Some functions are anonymous.
Lambdas, as we all know them.
Raw lambdas are hard to track: they only display as
One strategy to fight this un-profileability of lambdas is to... name them. This way, profiler listings are properly annotated. But yeah, naming every anonymous function is quite an oxymoron. So it might be worth naming things while debugging, and anonymizing them back once done.
The secret of
Now the domain-specific gotchas:
What’s the asymptotic complexity of the linked list length algorithm?
What’s the big-O of referencing an arbitrary list element?
That’s why I’m calling them “conses” here: we must remember that lists are head+tail pairs.
Saying “list” implies integrity and abstractness.
What’s so bad about referencing something in a list?
It’s a linear data structure like vector, after all...
This might seem obvious, but I’ve been bitten by it:
if you use
A fast alternative to
Yes, recursion might get naughty in the rare cases when it’s not used right (see below).
Consider using
But
I’ve listed the pattern above the way I did for a reason.
It certainly looks off to anyone worrying about deeply nested data and stack blow-up.
So I might be advised to rewrite it in accumulator+reverse style.
I tried:
This piece of code made the application an order of magnitude slower.
Consing to accumulator and reversing it is more expensive time-wise, while cheaper stack-space-wise.
(Thanks to
u/raevnos on Reddit,
I now know of destructive
Working with multiple file formats means inevitably re-inventing the parser.
And Guile is well-prepared for parsing.
There are:
I ended up using the latter because my data was line-oriented.
PEG grammars might end up faster on bigger pieces of data.
Delimited reading might work better for non-line-oriented cases.
And writing a custom parser might be the last resort for the desperate ones.
But in my case,
Might be the best option to make
an extension to Guile itself.
In C.
Optimizing the bottlenecks like input and string parsing.
Speaking of...
Coming from C, it’s easy to discard strings as lightweight.
Processing strings is just a matter of byte iteration until there’s null byte, right?
Not in Guile—Guile strings are encoding-aware and bound-checked.
Which incurs a significant performance cost when dealing with lots of strings.
And it’s hard to convert Guile strings to C strings, because encodings.
I had this requirement: split the string at spaces, tabs, and commas and discard empty strings out of the resulting list.
Here’s a rating for how my multiple approaches fared:
I’ll get to the point on segfaults and other FFI fun in a second, but here’s a takeaway on strings:
don’t be afraid to optimize your narrow use case with handmade implementation.
There’s no standard utility to suit all the possible uses.
And try bytevectors too—these are much closer to raw bytes C programmers dream of!
One of the functions you can see in the slowest functions profiler printed is
The hard truth is: Guile FFI is not instantaneous.
Every foreign function call takes time.
C function might terminate in a split second, but there will be an overhead—wrapping values and all.
And referencing C functions takes time too.
Applying
Another pain in debugging code using FFI is... that it’s not debuggable at all.
If something goes wrong, Scheme is no longer able to help you.
C function segfaults, killing the whole process.
And if you use Emacs with Geiser (you likely do), there’s no output indicating an error.
The session quietly terminates with “It’s been nice interacting with you!”
Which is not even a bug in Guile or Geiser.
It’s a sad reality of trusting FFI—it’s non-instantaneous and dangerous.
As for the
You can attach GDB to Guile process to see the backtrace and other info when it fails:
It’s just that in my case the problem wasn’t feasibly fixable even with GDB info.
This roundtrip between Scheme and C means that calling library functions is slow.
If the function in question is some element getter, then any performance benefits C brings—Scheme eats up with the FFI overhead.
Here’s a bit of code that killed the performance of vector operations:
Even though I’m using bytevectors and GSL-provided setter, the data is too overwhelming.
Neither of these is fast enough:
What helps in this bind is... FFI.
Again, my case was copying rows of double numbers from the database (LMDB) to numeric vectors (from GNU Scientific Library).
A lucky coincidence was that both are written in C and store the row/vector contents as raw C double arrays.
So I spun up
That’s a dangerous operation, but the stability of LMDB and GSL guarantees that it’ll work for a long enough time
(for everyone to forget about my code?)
So the message about FFI is ambivalent:
An attempt at summarizing the points above:
That’s a mildly depressing yet relatively obvious list.
And it’s more of a joke summarizing the wrong-ish things to take away.
So I hope that you learned slightly more than the points above reading my post.
Good luck in making your Guile code lightning-fast with what you’ve learned!
Update May 2024: Another useful technique for debugging FFI code with Geiser.
Geiser doesn’t show any output in case of error or when it dies from a segfault.
It does show error messages unconditionally, though.
So put
Update Aug 2024: Do check out Dave Thompson’s post on
Optimizing Guile Scheme,
there’s a lot of more structured advice there!
While my post is a series of random gotchas for FFI-heavy code,
his is a perfect tutorial for optimization.
Conses Are Cheap, Though
(let rec ((l lst))
(if (null? l)
null-value
(do-something (car l)
(rec (cdr l)))))
Recursion Is Not Free... But
(let read-lines ((line (first (%read-line port)))
(acc '()))
(if (eof-object? line)
(reverse acc)
(read-lines (first (%read-line port))
(cons (string-separate line) acc))))
Input/Output Is Not Free
Delimited reading with
Strings Are Not Free
Super slow:
FFI Is Not Free
;; Before:
(define* (del ...)
((foreign-library-function ...) ...))
;; After:
(define %del (foreign-library-function ...))
(define* (del ...)
(%del ...))
$ gdb -q -p GUILE-PID
continue # To make Guile run as it was before GDB
FFI Is Cheap, Though
(let* ((vec (vec:alloc individuals 0))
(bv (pointer->bytevector
(mdb:val-data data) (mdb:val-size data)))
(actual-elements (/ (mdb:val-size data) (sizeof double))))
(do ((idx 0 (1+ idx)))
((= idx actual-elements))
(vec:set!
vec idx
(bytevector-ieee-double-native-ref bv (* idx (sizeof double)))))
...)
bytevectors,
(define memcpy
(foreign-library-function
#f "memcpy"
#:return-type '*
#:arg-types (list '* '* size_t)))
(let* ((vec (vec:alloc individuals 0)))
(memcpy (vec:ptr vec 0) (mdb:val-data data) (mdb:val-size data))
...)
Be wary of using foreign functions if you need many calls or safety guarantees.
Takeaways
Conses are not random access.