We use machine learning technology to do auto-translation. Click "English" on top navigation bar to check Chinese version.
How Globe Telecom used Amazon Web Services DataSync at scale to migrate 7.2 PB of Hadoop data
Data migration is a critical first step for an organization in their cloud journey. It often requires a lift and shift of business-critical applications, databases, data analytics workloads, data warehouses, big data, and trained artificial intelligence/machine learning (AI/ML) models. The data is generated and stored in different layers causing complexity in the migration process. Due to this complexity, it’s important that data ingestion and migration processes are well designed with a streamlined methodology to make sure of a continuously flowing data transfer.
In this post, we take you through Globe Telecom’s data migration journey, including the migration of Cloudera data to
Globe Telecom’s technical requirements
Globe Telecom needed to build and manage the centralized data repository on Amazon S3 as a raw data lake. When the data lands in Amazon S3, it needed to be pre-processed and shared with analytics engines to run data insights. Business users would later leverage business intelligence (BI) tools to visualize the data and help in data-driven business decisions.
Globe Telecom’s data migration Requirements:
- Cloudera is the source HDFS
- Migrate data from HDFS Storage Nodes without a staging area
- Online Data migration with an available network bandwidth of 10Gb/s with
Amazon Web Services Direct Connect - Data size of 7.2 petabytes consisting of historical data as well as new incoming data
- Total number of files are > 1Billion
- Historical data and the incremental sync of new data sets
- Minimal on-premises footprint for migration infrastructure
- Automation, monitoring, reporting, and scripting support
Evaluating solutions
Initially, we looked at a vendor product that performs historical data migrations, subsequent transfers of newly ingested data, and the ability to perform an open file sync from the source Cloudera system. The overall feature set was compelling and it met our primary requirements. However, we decided not to proceed due to heavy licensing costs, infrastructure requirements, and complexity when implementing at such a large scale. Additionally, it was difficult to access the software to perform proof of concept testing.
We evaluated additional solutions for migrating the HDFS data to Amazon S3, including
During the proof of concept, we listed the success criteria mentioned as follows and conducted a series of tests for each tool. It was a tight race and we considered the following factors:
- Agility
- Reliability
- Functionality
- Availability and scalability
- Security
- Support and future-proofing
- Cost
Among these areas, the areas of differentiation were cost, availability, and scalability. We ultimately chose DataSync based on these factors, along with the following additional reasons that influenced our decision.
- Easy set up and deployment
-
Amazon Web Services Command Line Interface (Amazon Web Services CLI) and scripting is supported - Incremental data transfer between source and target
- Single dashboard for monitoring
- Task based with the ability to scale
- DataSync Agents on
Amazon Elastic Compute Cloud (Amazon EC2) - Simple pricing model
Solution overview
We built a unique solution architecture for migrating our 7.2PB of Cloudera data using DataSync. The Amazon Web Services team recommends that you run the DataSync agent as close to your storage system as possible for low latency access and better performance. However, in our setup we took the approach of installing all DataSync agents as EC2 instances with no on-premises footprint and utilized the high bandwidth network with consistent performance. For more information, refer to the supported DataSync agent
We applied “
Building for resilience
We built a resilient architecture for the HDFS data migration by distributing the EC2 instances across Availability Zones (AZs), and grouped the agents based on tasks with “include filters.” We did not encounter any latency or performance issues, as our environment has an available 10-Gbps network pipe and a source storage system that provides consistent read throughput. Careful planning of task allocation per agent and using filters helped us optimize the data ingestion.
Tasks running in each Availability Zone utilize three DataSync agents configured with each source HDFS location. To protect against unforeseen events, two additional standby agents were deployed and activated. While DataSync tasks do not provide automatic failover between agents, standby agents can be used as ready replacements.
Designing for scale
Amazon Web Services DataSync agents were activated using private
Source system |
Network bandwidth |
Network throughput |
Read IOPS |
Cloudera CDH 5.13.3, 370 Datanodes, 2 Namenodes |
10 Gb/s |
800 MB/s |
27K |
In the following source Cloudera location, we included folders in each task with a specific agent to process. Using this method, we processed 6 to 9 tasks using 9 to 12 agents across Availability Zones.
Data Type | Source directory location | Destination location on Amazon S3 |
Historical data | HDFS /S2/data/ | Prod S3 /s2/data |
We ran the tasks in parallel and achieved a data migration rate of up to 72 TB per day with a network utilization of 85%. This is close to 800MB/s or 2.2 TB/hr.
The DataSync agents were deployed as M5.4Xlarge
The following image is our strategy that we carved out for a task execution and ‘include filters’ for DataSync
In addition, we ran tasks to migrate incremental file updates using the same combinations of source locations and task filters.
Conclusion
Globe Telecom’s EDO data migration project was successfully completed within the defined project timeline of four months. DataSync provided agility, flexibility, and security to build a scale out architecture for high performance and faster secure movement of data. Built-in automation, monitoring, a single dashboard view and task completion reports helped our team members focus on data migration strategy, reduced data transfer costs, and provided peace of mind during the migration phase. DataSync’s data integrity and validation checks gave us confidence in our data post-migration. We quickly kicked off the analytical data pipeline for further processing and data visualization for end users with a shorter turnaround time. DataSync streamlined our HDFS data migration journey to Amazon Web Services cloud.
Thank you for reading this post. If you have any comments or questions, don’t hesitate to leave them in the comments section. For more information, please watch our
The mentioned AWS GenAI Services service names relating to generative AI are only available or previewed in the Global Regions. Amazon Web Services China promotes AWS GenAI Services relating to generative AI solely for China-to-global business purposes and/or advanced technology introduction.