I was working on a story on Data Lakes Unlocked recently, around the time of Great Midwestern comedian Bob Newhart’s passing. Thinking: the explosion of big web data created challenges that existing technology failed at, making room for the data lake, which solved some problems and overlooked others.
Initially, data lakes were perceived as ungoverned repositories where raw data was thrown with the hope of finding insights later, with about as much luck as I might have had with an arcade treasure hunt crane. But the Data Lakers refined their approach over many years to include more structure, governance, and metadata creation. This evolution led to the emergence of the data lakehouse, which combines aspects of both data warehouses and data lakes, and which is being renamed as we speak.
This Newhartian dialog came to me.
What it amounts to is walking through a chain of complexities – the challenges that confront a new version of an old technology. Something like a dialectic. Iceberg data management platform is a great new tool, but it is in some ways to be looked at as an improvement on Hadoop, much as Apache Parquet was, and, in much the same way, as was Apache Hive.
This is Bob Newhart homage. I think the sound version is a good way to engage with this content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yeah, hi Bill? Yes, this is Jack in IT. The CFO was just down here and she had some questions on some of the AI GPU analytics processing bill we have.
Yes. You think you have a handle on it?
And so what is the problem?
You say you need a consistent way to manage and query data and you say you need acid compliance. Well, it sounds kind of difficult …
To deal with schema Evolution?
Well I know there are a lot of things on your plate – that’s that’s quite a lot of problems you got there. Go on, I’m sorry.
And oh, but you found a solution and what’s the solution? Apache Iceberg, okay!
Bill, why do they call it Iceberg?
It’s a fitting metaphor for a vast amount of hidden data.
You know, Bill, if it costs us too much the data maybe can just stay hid.
Okay. Well, how much is saving a lot of time and money going to cost us?
You say, the table commits add up to a lot of small files. But that’s okay. Because you’re going to mitigate it with compaction and partitioning and write optimization. Okay.
And you’re going to do data modeling. This time for sure!
Bill, we are on your side. I’m coming down there with the accountants – but we have to know how much this will cost us.
You say you are working remotely from the Cape?
I guess I’ll fire up Zoom.