Is S3 a data lake

Amazon launches service for data lakes

The AWS Lake Formation offering is now generally available on Amazon Web Services. Amazon announced the service at re: Invent 2018. It offers managed data lakes on Amazon's cloud platform. AWS charges the usual fees for the underlying services for storing and transferring the data, but provides the service free of charge.

As part of the first announcement of AWS Lake Formation at re: Invent in November 2018, Amazon spoke of 10,000 data lakes that are on Amazon Simple Storage Service (S3). The new service automates the installation and management of the lakes and helps customers prepare the data before it flows into the data lake. Finally, AWS offers special security functions.

The data lake with its inflows and outflows

Relational or NoSQL databases and other S3 instances can be used as sources. AWS Lake Formation offers a source crawler that takes care of the acquisition of the data. In the data lake, the service organizes the inputs according to frequent query terms and divides them into data blocks, which should ensure efficient processing. AWS Lake Formation uses machine learning to deduplicate and find data that is different but points to the same thing.

The data stored in the data lake can be transferred to Amazon Redshift, Athena, AWS Glue or Amazon Elastic MapReduce (EMR) for further processing. The latter is still in beta at the start of the service. A preparation with Amazon QuickSight and SageMaker should follow in the next few months. Access to the data lake can be controlled via AWS Identity and Access Management and AWS Key Management, among other things.

A reservoir for all data

The term data lake goes back to Pentaho founder James Dixon. The concept is designed for large analysis systems. The data initially flow unprocessed into the lake and are allowed to change there. The name comes from the fact that the lake receives the data from numerous tributaries and combines structured with unstructured and raw data. There is no specific technique for storing the data associated with the term.

One advantage of the way it works is that administrators do not have to define formats or structures in advance. However, they must make sure that they can continue to manage the data and keep it accessible. If they have no control or poor access to the data lake, one appropriately speaks of a data swamp - in this case the lake is marshy.

In fact, the orderly access and optimization of the data is a major challenge when creating and managing data lakes. It is precisely these difficulties that Amazon wants to address with the new offer. Further details can be found in the announcement. (rme)

Read comments (11) Go to the homepage
Ad ad