Engineer Your Data Before it Engineers You

Teck Wu — 4/4/2023 — 4 Min Read

The urgency for data engineering and maintenance in your organisation

Excerpted from Kode Vicious's (George V. Neville-Neil) post on Data Engineering and Data Maintenance on ACM Queue

In a world that now contains so many rules and regulations around how a company handles user data, it is hard-pressed not to think about the costs of getting your data engineering and data maintenance wrong. How do I protect my user's data from potential exfiltration? How can I protect my user's data from a company breach? As a company scales, with more and more data collected, products built, an extended data lifecycle, an increase in size and complexity of your employee base and so on, the possible vectors for a data leakage can become too big for any company to handle --- a sort of debt akin to software's technical debt that gradually accumulated over the years until it becomes too big to ignore.

In order to have a strong understanding of how to approach Data Engineering and Data Maintenance, it is necessary thus to go all the way backwards to see who they involve, and how the things became that way. We have now come to a place in computing history where data has significant value and significant risk, in equal measure. A quick, drunken, wobble down memory lane shows us that the trajectory of engineering efforts through computing have changed significantly over the last 70 years.

We went from a situation in the 1950s and 1960s which the hardware was the dominating cost as well as the main focus of our efforts, to the rise of software in the latter part of the 20th Century, to the rise of Data in the early 21st Century? Why is this? Moore's law has a lot to answer for here, as well as the human inability to throw stuff away once we've collected it. The old adage, "Data expands to fill the space available for storage" which is actually a corollary of Parkinson's Law, "Work expands so as to fill the time available for its completion" has been true ever since we had the ability to store data. As time progressed software came to dominate the cost of systems, because computers became cheaper and more powerful and therefore we could write larger and more complex programs, which then became systems of programs and then distributed systems of programs. All this increasing complexity forced us to find solutions to the Software Crisis.

It is also not just the amount of data we're storing: it's the relationships among the data. The relationships drive the complexity, just as the explosion of libraries and packages used in modern software drive up the complexity and cost of software systems.

It's well past the time when everyone who even thinks about collecting data, user or otherwise, must first think seriously about Data Engineering and Data Maintenance, because the costs of getting it wrong are far too high, both monetarily and societally. However, go-to-market means startups see code first and data second, unless their real go to market is to get one of the FAANG to buy them for the value of that data, and even then they're more like vacuum cleaners, sucking up everything they can get ahold of, with little concern for its safety or future value and risk. And even when some companies start out down the right path they usually fail at Data Maintenance, just like companies fail at software maintenance.

Now that data has surpassed the majority of software in size and complexity it's time to make Data Engineering and Data Maintenance first class topics of study. To do anything else simply invites us to make the same mistakes, and put people as well as our companies, at risk.

George V. Neville-Neil - author of The Kollected Kode Vicious

George V. Neville-Neil is the author of The Kollected Kode Vicious and co-author of The Design and Implementation of the FreeBSD Operating System. He also gives seminars on a variety of programming-related topics. Code spelunking, operating systems, networking, and time protocols are some of his favourite areas of interest. Due to his immense interest, he has been the columnist better known as Kode Vicious for more than 20 years. Coincidentally (or maybe not ;), he is one of the very first advisors for Borneo, and has been part of our Slack channel from the very beginning, watching us grow and mature. You can read more of George's interesting bytes and tid-bits (pun intended) over the last 20+ years on different topics pertaining to all things software in ACM Queue.