Data Lakes and Microsoft Subsurface Conference

Abhay Sri
Jul 28, 2021
2 min read

My first data science conference was the Microsoft Subsurface Conference, and overall, I can confidently say that I will be attending it next year as well. Before this event, I actually did not know about data lakes. The whole conference was based on cloud data lakes and how we could use them to run processes easily.

The first day of the conference was a blur. I had almost no clue what was going on the first couple of hours, but as I picked up key buzz words such as data lakes, data warehouses, etc. I started researching them and finally understanding what the hosts were trying to communicate. The conference had to deal with something I have covered in my blogs before - the storage of data. This data could be raw or processed, and data lakes are basically a centralized system where you don't really need to structure the data. Through various softwares and dashboards, you can easily run machine learning models, data processing, and output real-time analytics from data lakes. Because Microsoft is such a huge company, there were other large companies there at the time as well. There were presenters from Azure, Intel, and AWS. After learning so much about data lakes and other alternatives, I would like to give a run-down of the pros and cons of data lakes.

Pros

Schema on Read - While some may consider this a con, data lakes basically decide the schema of the storage while reading the data. This makes data lakes more versatile over the traditionally used counterpart, data warehouses. Data warehouses are schema on write, which means the schema is created before the data is transferred.
Data can be of any quality. Data warehouses only store highly refined data, but data lakes, on the other hand, can store data of any quality, even "raw" data.
Versatility - Data can come from anywhere and have any structure. The data can even flow in real-time. This means that you can save a lot of time and cost by using data lakes.

Cons

Query results take longer on data lakes than data warehouses. This is because data lakes use low-cost storage, whereas data warehouses use costly high-quality storage.
Can create complexity. If there is too much raw data from several sources, data lakes have the potential to become too complex. As a result, data scientists may end up spending a significant chunk of time filtering or cleaning the data.

All in all, this conference was so worth it, and I can't wait to attend again!