I’ll show you a project I performed on July,2024:
Data preparation is a critical step in any data pipeline, especially when dealing with large datasets that need to be cleaned, transformed, and made ready for analysis. AWS offers powerful tools like Amazon S3 and AWS Glue to help streamline this process. Whether you’re new to AWS services or an experienced user, understanding how to effectively leverage these tools can significantly improve your data workflows. This article will guide you through the entire process, from setting up your environment to executing data transformations and querying your data.
Our gallery
I am text block. Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.
Setting Up Your Environment:
To begin, you’ll need to set up an Amazon S3 bucket, which will act as your primary data storage location. In our example, we create a bucket named laptop-donnees-s3
. Within this bucket, it’s advisable to organize your data by creating two separate folders: one for raw data and another for transformed data. The raw-data
folder will store unprocessed files such as CSVs, while the transformed-data
folder will hold the data after it has undergone processing through AWS Glue.
Data Crawling and Cataloging with AWS Glue:
AWS Glue is an essential service when it comes to managing and transforming large datasets. One of its core features is the Crawler, a tool that automatically connects to your data source, scans the data, and determines its schema. The Crawler can handle various data sources, including S3 buckets, Oracle databases, and MySQL databases, among others.
Once the Crawler is set up and connected to your data source, it scans the available data. During this process, built-in classifiers in AWS Glue analyze the data and create a schema that describes the structure and format of your dataset. This schema information is then stored in AWS Glue’s Data Catalog as tables. These tables play a crucial role in subsequent ETL (Extract, Transform, Load) processes, as they provide a structured representation of your data.
Transforming Data with AWS Glue:
With the data cataloged, the next step is to transform it to meet your specific needs. AWS Glue allows you to define jobs that can automate these transformations. For example, if you need to clean up your data by modifying certain columns, such as converting a bigint
type to an int
, this can be easily done within an AWS Glue job.
AWS Glue supports a range of transformations, from simple data type conversions to more complex operations. It also allows for custom scripting in Python or Spark, enabling you to handle more sophisticated data processing tasks that aren’t covered by Glue’s out-of-the-box capabilities.
Tables: Organizing Your Data
Tables are structured formats used to organize and store data in a way that makes it easy to query and analyze. In the context of AWS Glue, when raw data is scanned by a Crawler, it is organized into tables based on the schema detected in the data. These tables are then stored in the AWS Glue Data Catalog.
Practical Example: Currency Conversion
Let’s consider a practical example where your dataset contains pricing information in Indian Rupees (INR), but you need to convert these prices to Canadian Dollars (CAD) for analysis. By applying SQL transformations within AWS Glue, you can use the current exchange rate to convert these amounts. The transformed data, now in CAD, can be saved back into the transformed-data
folder in S3, making it easier to perform further analysis and comparisons.
Querying Transformed Data with Amazon Athena:
After your data has been transformed, the next logical step is to analyze it. This is where Amazon Athena comes in. Athena is a serverless query service that allows you to run SQL queries directly on your data stored in S3. Since Athena can read from the tables defined in AWS Glue’s Data Catalog, it seamlessly integrates with your ETL pipeline.
For instance, if you’re working with a dataset that includes sales data, you can use Athena to extract detailed insights such as the types of products sold, quantities, and other key metrics. By querying the transformed data in S3, you gain the ability to perform complex analyses without the need for additional infrastructure.
I am text block. Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo, when an unknown printer took a galley.
Conclusion:
After applying data processing with AWS Glue, running queries in Amazon Athena, and performing analyses in Power BI, crucial insights into computer sales were obtained. The most in-demand models were identified, enabling more efficient stock management and strategic price adjustments to maximize sales.
Customer feedback and reviews were integrated to identify areas for product improvement, while monitoring return rates helped prevent potential issues with certain models or configurations. These efforts led to a better understanding of the market and the optimization of business operations, supporting more informed decisions aligned with consumer needs.
This project also allowed us to develop advanced skills in the use of AWS Glue, S3, and Athena, including managing ETL jobs, configuring crawlers, and optimizing data pipeline performance. Additionally, we strengthened our knowledge of data security on AWS and the use of Power BI for data visualization.
Here are some examples of queries used in AWS Athena:
Laisser un commentaire