Canaan, founded in 2013, released the world's first blockchain computing device based on ASIC chip in 2013, which led the industry into the ASIC era, and gradually accumulated rich experience in chip mass production. The mass production of the 16nm product in 2016 marked that Canaan became the first company in advanced processes in mainland China. Since 2018, Canaan has successively achieved the world's first self-research based 7nm chip mass production and mass production of RISC-V-based self-research edge intelligent computing chip Kanzhi K210 for commercial use, among which, KPU, an AI neural network accelerator, relies entirely on its own research and development. Today, the monthly mass production of Canaan has reached tens of millions of chips, products and services cover more than 60 countries and regions around the world, and the company has established a good customer base in the United States, Canada, Sweden, Iceland, Bosnia and Herzegovina, Malaysia, Korea, Russia, Armenia, and Hong Kong, etc. In the future, Canaan and its business partners will drive the process of universal benefit and make life beautiful together via AI on the basis of chip research and development and high-performance computing.
Canaan is a Fabless chip design company and we hope to devote as much energy and resources as possible to our adept chip design and development. Through cooperation with AWS, we quickly obtained a global leading IT infrastructure to support multiple chip design projects under a lower manpower and resource investment, which has significantly improved the progress of the chip design work, makes the project cycle more predictable, and brings a comprehensive cost savings of more than 30%.
Vice President of Canaan
The modern chip design industry increasingly relies on all manners of EDA (Electronic Design Automation) tools and softwares to assist designers as the semiconductor process evolves. But Canaan finds these design tools and softwares nearly demanding on the enterprise's IT infrastructure in real production practice. If a lot of manpower and efforts are put in to self-built data center to try to solve these problems, additional burden beyond the design effort usually will occur.
First, at different stages of chip design, designers need different tools and softwares, which require different characteristics of IT infrastructure. For example, some softwares rely heavily on CPU’s performance and stability; some require massive internal storage; others require high IOPS and handling capacity file system storage to support. When the chip design enterprises project the local data center, it is difficult to balance different performance requirements with sound architecture and cost planning. Besides, the wide variety of facilities substantially increase the difficulty of deployment and operation.
Second, as one of the subdivided application scenarios of high-performance computing, modern chip design software is critical to the performance of IT infrastructure. It is common that a single computing task schedules hundreds of cores of CPU, consumes TBs of memory resources, and needs to run continuously for several days. And tens of millions of small files and individual files (about tens of TBs) of scenarios exist simultaneously. For a chip design enterprise, it is very difficult to design and maintain a high performance computing cluster environment of this scale and keep its stable operation. Seemingly small errors and failures can lead to significant risks such as failure of calculation tasks, data loss, and delay in schedule.
Finally, because of the characteristics of the entire semiconductor industry chain, the workload of chip design enterprises usually has a strong periodicity. Whether it is a short period peak of intensive operation by designers in the project or a long period peak due to overall project scheduling, the final result is that even when a large amount of high-configuration equipment is spent to meet peak resource requirements, it is difficult to avoid huge idle waste with an annual usage rate of less than 10%.
In addition to the above technical difficulties, many pain spots of project management have been plaguing Canaan, for example:
- Limited by the size of the local data center, when multiple projects or multiple teams work in parallel, it is necessary to resolve the "serial queuing" that IT infrastructure resources often face, which makes project task scheduling difficult and unpredictable;
- When the same set of IT infrastructure is shared by different teams in different projects, it is difficult to make benefits evaluation such as resource utilization rate and cost apportionments;
- The impact on the financial planning of unexpected equipment procurement at the peak of the project and the project delay risk arising from lengthy and uncontrollable procurement deployment cycle;
- If IT infrastructures are built in branch offices in different regions, it will be difficult to uniformly manage and the security risks will be increased. If all are connected to the same self-built data center, a lot of challenges to network infrastructure performance, stability, and configuration flexibility will be brought.
The practice on AWS of Canaan
It is security that must be the most concerned problem and a priority issue to be resolved by Canaan, a chip design enterprise. Canaan has built an omni-bearing security system covering data security, network security, operational security, and audit review by selecting different AWS services.
- With AWS Direct Connect service, Canaan has established a dedicated line connection between the self-built data center and AWS multiple regions, which not only achieved better network connection performance but also ensured the security of data transmission through encrypted communication;
- For different projects and teams, through the creation of multiple Amazon Virtual Private Cloud (Amazon VPC), build a logically isolated cloud-based network environment, form multi-cluster security boundaries, realize extranet isolation of key resources with private subnets, and control access permissions for internal flow through security groups;
- By calling AWS Identity and Access Management (IAM) API, it integrates with the local directory management systems and authentication systems and completes resource invocation authorization of relevant people and authentication operations on the cloud.
- For sensitive data information, use AWS Key Management Service (KMS) to encrypt and protect the used storage service Amazon Elastic File System (Amazon EFS), Amazon FSx for Lustre, Amazon Elastic Block Store (Amazon EBS) for encryption protection;
- Establish encrypted VPN connections from each branch office to AWS and collect resources and operational logs through AWS CloudTrail and Amazon CloudWatch services for future audit;
- Use the encrypted Amazon Simple Storage Service (Amazon S3) for centralized storage and long-distance archival backups on the cloud.
After the deployment of the base network and authentication systems, Canaan adopts AWS ParallelCluster to deploy and manage SGE-based High Performance Computing (HPC) clusters in the AWS cloud. By compiling different AWS CloudFormation templates, the different infrastructure environments required for different design stages can be rapidly built at the speed of minutes. For computation-intensive tasks, select Amazon Elastic Compute Cloud (Amazon EC2) Z1d living examples with a kernel frequency up to 4.0GHZ or C5 series living examples of computational optimization; for memory-intensive tasks, select X1e living examples with a memory up to 3.9TB or R5 series living examples of memory optimization. To cope with the needs of high IOPS and high throughput for file storage at different stages of computing tasks, Canaan chose Amazon FSx for Luster, a fully hosted and high-performance file system, which easily achieved up to hundreds of gigabytes of throughput and millions of IOPS read-writes and could take into account the requirements of data high availability. In regions where Amazon FSx for Luster service is not yet available, I3 instances are adopted to deploy GlusterFS clusters to build a high-performance shared file system that is required to run the software. In addition, Placement Group is selected to obtain low network latency and high network throughput in scenarios where network bandwidth between the living examples is required.
Concerning cost control, Canaan selects the most cost-effective service and living example type for deployment through the benchmark test of different computing tasks. Review resource idle status in time to trigger release operations to reduce waste. Meanwhile, after a period of use, for long-term stable loads and short-term expected gusty loads, reserved instance RI and Spot instance are adopted respectively to obtain cost-effective discounts. Fig. 1 is a system schematic diagram of Canaan, and the adopted AWS services include Amazon EC2, Amazon S3, Amazon FSx for Lustre, Amazon VPC, AWS Direct Connect, AWS KMS, IAM, AWS CloudTrail, Amazon CloudWatch, AWS ParallelCluster, and more.
Canaan can gain nearly "limitless" infrastructure expansion capability in the time measured in minutes by migrating the chip design load to AWS, which means that we needn’t worry about specific resource shortages for a single computing task, and can choose between time and expense costs more flexibly. Meanwhile, multi-team and multi-project can work in multi-cluster, which greatly saves "queuing" time and thus improve the overall chip development speed. After the computing task is finished, the resource in the idle cloud can be released in time to save costs, which really realizes "paying only for efficient use".
"The use of AWS services has facilitated our overall level of security control more objectively. And AWS's operation and management levels of infrastructure are far above us, and facts have also proved that AWS’s service operation is more stable than our self-built data center. We constantly insist that professional people should do professional things. Now that the semiconductor industry can accept the authorization of IP vendors to do production through OEM already, it is not so far to accept cloud computing services to enhance its IT support capabilities. "Mr. Wu Jingjie, Vice President of Canaan, concluded.
To learn how to accelerate innovation, optimize production and provide advanced products and services with the AWS Cloud, please see the details on the Semiconductor and Electronics pages.