Q: What is Amazon EMR?

Amazon EMR is a managed Hadoop service that allows you to run the latest versions of popular big data frameworks such as Apache Spark, Presto, Hbase, Hive, and more, on fully customizable clusters. Amazon EMR gives you full control over the configuration of your clusters and the software you install on them.

Q: What can I do with Amazon EMR?

Using Amazon EMR, you can instantly provision popular open source frameworks such as Hadoop and Spark with as much or as little capacity as you like, to perform data-intensive tasks. Common use cases include web indexing, data mining, log file analysis, extract-transform-load (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon EMR lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.

Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows, and programmatically monitor progress of running clusters. In addition, you can use the simple web interface of the Amazon Web Services Management Console to launch your clusters and monitor processing-intensive computations.

Q: Can I use this project to analyze my own logs?

Yes. You may upload your own data logs to an Amazon S3 bucket and use a similar cluster to run queries against your own data. However, please notice that this project is not meant for production environments.

Q: How do I get my data into Amazon S3?
You can easily and securely create buckets, upload objects, and set access controls using the Amazon Web Services Management Console. The Amazon S3 Getting Started Guide shows you how to start using the Amazon Web Services Management Console with Amazon S3.

Amazon S3 is also integrated with a variety of other Amazon Web Services services and 3rd party connectors to help you bring data in and out of the cloud. 

Q: How secure is my data?

Amazon S3 is secure by default. Only the bucket and object owners originally have access to Amazon S3 resources they create. Amazon S3 supports user authentication to control access to data. You can securely upload/download your data to Amazon S3 via SSL endpoints using the HTTPS protocol. You can use Amazon Identity and Access Management (IAM) tools such as IAM Users and Roles to control access and permissions. For example, you could give certain users read but not write access to your clusters. Also, you can use Amazon EMR security configurations to set various encryption at-rest and in-transit options, including support for Amazon S3 encryption. Learn more about controlling access to your cluster and Amazon EMR encryption.