It's decoupled, catalogued and queryable, arguably even more so than in a lake. Storage costs are a non-factor for the data volumes of most companies. People are storing raw data in tables and keeping it there. Modern warehouses do literally the same thing. Why? It's cheap and it is completely decoupled from other systems. With the way we use datalakes, we end up keeping multiple copies of the same source data, with a level of curation as needed. The whole idea of a data lake is to keep volumes of data, close to it's raw format, catalogued and stored cheaply In a format that can be readily consumed by other systems. Or any combination of the two.īut what if you now need to pull in more data from a different system, cleanse it, add some sort of correlation between that and the original system? Or you could swap it out for file objects on aws S3 - and then use Athena to access it. Say, Postgres: you could have separate databases for your operational activities, your data lake and your data mart.Īnd you could swap out Postgres for an MPP database like Redshift, BigQuery, or Snowflake. So, hypothetically, if you've got a small data volume you could implement all three environments on a single database server. This wouldn't generally work very well since your operational data probably doesn't support versioning of the rows, you're limited to a single source of data - rather than integrating multiple related datasets, and it's painful to only have access to views for transformations (and can be slow to query). Data warehouse: views on top of the tables. ![]() While there are definitely exceptions and room to be flexible, you have three things going on here:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |