Glue Code + Pipeline Jungles. Thought of the Day 25.0

equation

Machine learning researchers tend to develop general purpose solutions as self-contained packages. A wide variety of these are available as open-source packages at places like mloss.org, or from in-house code, proprietary packages, and cloud-based platforms. Using self-contained solutions often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.

This glue code design pattern can be costly in the long term, as it tends to freeze a system to the peculiarities of a specific package. General purpose solutions often have different design goals: they seek to provide one learning system to solve many problems, but many practical software systems are highly engineered to apply to one large-scale problem, for which many experimental solutions are sought. While generic systems might make it possible to interchange optimization algorithms, it is quite often refactoring of the construction of the problem space which yields the most benefit to mature systems. The glue code pattern implicitly embeds this construction in supporting code instead of in principally designed components. As a result, the glue code pattern often makes experimentation with other machine learning approaches prohibitively expensive, resulting in an ongoing tax on innovation.

Glue code can be reduced by choosing to re-implement specific algorithms within the broader system architecture. At first, this may seem like a high cost to pay – reimplementing a machine learning package in C++ or Java that is already available in R or matlab, for example, may appear to be a waste of effort. But the resulting system may require dramatically less glue code to integrate in the overall system, be easier to test, be easier to maintain, and be better designed to allow alternate approaches to be plugged in and empirically tested. Problem-specific machine learning code can also be tweaked with problem-specific knowledge that is hard to support in general packages.

As a special case of glue code, pipeline jungles often appear in data preparation. These can evolve organically, as new signals are identified and new information sources added. Without care, the resulting system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output. Managing these pipelines, detecting errors and recovering from failures are all difficult and costly. Testing such pipelines often requires expensive end-to-end integration tests. All of this adds to technical debt of a system and makes further innovation more costly. It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles. When machine learning packages are developed in an ivory-tower setting, the resulting packages may appear to be more like black boxes to the teams that actually employ them in practice.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s