Coffee Compiler Club June 20, 2025

Ben Seidel works for a company that makes ERP software for Mining companies and he has to do a lot of tedious code refactoring with no supporting tools. He would like to be able to use pattern-matching rewrite rules on the abstract syntax of the code he is working on rather than manually cutting/pasting/editing it all. The problem is that it has to work across C#, JavaScript, TypeScript and Java. A start would be to have grammars for these languages, so that you can parse code, but you also need to have a way for the system to understand the build environment in terms of directory structure, .jar files, JavaScript modules and all the bizarre rules that turn up in Cross Origin Request Sharing, not to mention the Java runtime .

Cliff says "Welcome to the wonderful world of language and compiler implementation. There's a lot of reasons why things suck." He has been working on semantics of module languages for external function compilation and has invented a whole new set of module-naming and filesystem rules for his Simple language, and he is still trying to figure out the rules for visibility and things.

The problem is that in many languages these issues are not actually dictated by language specification and it is up to the implementation, so you have multiple schemes for this, even in a language like Standard ML which actually has a Basis specification that the MLton developers use (see http://mlton.org/MLBasis for an explanation). In C, these things typically end up being defined by the build system and communicated to the compiler using command-line parameters. In the old days, that was just a Makefile, but then it became an auto-generated nightmare of auto-generated configure scripts. Then cmake appeared, so for these languages you need to have an ability to parse and analyse cmake specifications in order to be able to support the kinds of refactoring load that Ben is trying to deal with.

Aaron Goldman is thinking about structured logging. You really want to be able to write some sort of abstract syntax into logs, rather than just strings of characters, so that you can use smarter compression and analysis tools. The data that is logged may not fit any pre-determined schema, but  it should still have enough "metadata" to enable it to be easily parsed and analysed by the machine itself.  Cliff says that this is what H2O categorical encoding is about. Aaron would also like to be able generate new data during the processing of such logs. Cliff says this is Data Oriented Processing. In a process calculus, logs are just events that don't have pre-determined actions. The log records may only be read months or years later.

Aaron wants to know what the standard library of a programming language should contain to make structured logging a more reasonable default than just logging strings, because in most companies logs start with being just strings and then after a few years people realise that it would be nice to have had some structure encoded in there, but by that point there are gigabytes of strings being logged per second, presumably from code all over the place that now needs to be refactored. Cliff says logging is just bad and a waste of CPU time and the debugging rewards are miniscule compared the overheads in management and maintenance, tracking all the shit that it generates. Aaron talks about using SQL databases for logging. Cliff says that the way he did this in H2O was by using a Java class as a schema object, and presumably just serializing it to the log records, but if later the class changed then you couldn't go back and reread those logs. But logs aren't just for functional debugging, they're also needed for security incident response, which could become critical months or even years later.

Then Aaron explains that these log analyses are being used to discover which internal modules/services depend upon on which other internal modules by tracing individual requests through the whole stack in a running system. Cliff says "Geez, I would have thought that would have been labelled in some dependency file somewhere?" But in fact the problem is determining how to prioritize services, and this is an essentially dynamic analysis which depends on the order of real-world events. Then there is a discussion about the choice of data that is logged (29:43). You don't want to write security credentials to logs. (33:00) So you would like a language which allowed you tag a field on an abject as "this is sensitive, don't log it". Go and Java both apparently have such annotations. But in a lot of cases the notion of sensitive data is context-dependent.

Then the Java VM class verification process comes up and Cliff asks if one can write a user-defined class annotation and have the loader do user-defined verification on it. I remember when I was sitting on the street in Cochabamba I spent a while looking at the Java VM class loader specification, this was around 2017, and it seemed to me that the whole thing had been designed deliberately to be subvertible, because the class verification mechanism is a class. When I later heard Larry Ellison winging about how Google stole "his code" for Android, and ran their own application post-processing, I wondered whether this is because Google (or at least Larry Page) had noticed this. This was from 13 Aug 2013:


Apple Market Cap, 2010-2025

Subscribe to CBS News.

At 53:00 there's an involved discussion about the problems of resolving dependencies on external modules and problems with versioning. Maven comes up a lot, with mixed opinions: https://mvnrepository.com/. At 1:20:20 there's a discussion about the Go import path URLs so you can automatically download disk wipers:

 

Subscribe to Low Level.

Mixins have been tabled for discussion. 

Comments

Popular posts from this blog

Steven Johnson - So You Think You Know How to Take Derivatives?

Hitachi HD44780U LCD Display Fonts

Welsh Republic Podcast Talking With Kars Collective on Armenia Azerbaijan Conflict