Exam

AWS Certified Machine Learning Specialty A Cloud Guru Quiz Questions (Level 2)Questions And Answer

AWS Certified Machine Learning Specialty A Cloud Guru Quiz Questions (Level 2) Questions And Answers 2022 You are a ML specialist within a large... [Show More] organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Which set of tools can help you achieve this with the LEAST amount of effort? 1- Redshift and Redshift Spectrum 2- AWS Glue Data Catalog and Athena 3-Data Pipeline and Athena 4-Data Pipeline and RDS - Correct Answer- 2 Glue Data Catalog and Athena Answer-Using Redshift/Redshift Spectrum and Data Pipeline/RDS could work, but require much more effort in setting up and provisioning resources. Using AWS Glue you can use a crawler to crawl the logs files in S3. This will create structured tables within your AWS Glue database. These tables can then be queried using Athena. This solution requires the least amount of effort. You are a ML specialist who is setting up a ML pipeline. The amount of data you have is massive and needs to be set up and managed on a distributed system to efficiently run processing and analytics on. You also plan to use tools like Apache Spark to process your data to get it ready for your ML pipeline. Which setup and services can most easily help you achieve this? 1-Redshift out-performs Apache Spark and should be used instead. 2-Multi AZ RDS Read Replicas with Apache Spark installed. 3-Self-managed cluster of EC2 instances with Apache Spark installed. 4-Elastic Map Reduce (EMR) with Apache Spark installed. - Correct AnswerAnswer-Amazon's EMR allows you to set up a distributed Hadoop cluster to process, transform, and analyze large amounts of data. Apache Spark is a processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters. You have been tasked with converting multiple JSON files within a S3 bucket to Apache Parquet format. Which AWS service can you use to achieve this with the LEAST amount of effort? 1-Create a Data Pipeline job that reads from your S3 bucket and sends the data to EMR. In EMR, create an Apache Spark job to process the data as Apache Parquet and output the newly formatted files into S3. 2-Create an AWS Glue job to convert the S3 objects from JSON to Apache Parquet. Output the newly formatted files into S3. 3-Create a Lambda function that reads all of the objects in the S3 bucket. Loop through each of the objects and convert from JSON to Apache Parquet. Once the conversion is complete, output the newly formatted files into S3. 4-Create an EMR cluster to run an Apache Spark job that processes the data as Apache Parquet. Output the newly formatted files into S3. - Correct Answer- AnswerAWS Glue makes it super simple to transform data from one format to another. You can simply create a job that takes in data defined within the Data Catalog and outputs in any of the following formats: avro, csv, ion, grokLog, json, orc, parquet, glueparquet, or xml. You are a ML specialist within a large organization who helps job seekers find both technical and non-technical jobs. You've collected data from a data warehouse from an engineering company to determine which skills qualify job seekers for different positions. After reviewing the data you realise the data is biased. Why? 1-The data collected has missing values for different skills for job seekers. 2-The data collected only has a few attributes. Attributes like skills and job title are not included in the data. 3-The data collected needs to be from the general population of job seekers, not just from a technical engineering company. 4-The data collected is only a few hundred observations making it bias to a small subset of job types. - Correct Answer- Answer-It's important to know what type of questions we are trying to solve. Since our organization helps both technical and non-technical job seekers, only gathering data from an engineering company is biased to those looking for technical jobs. We need to gather data from many different repositories, both technical and non-technical. You have been tasked with collecting thousands of PDFs for building a large corpus dataset. The data within this dataset would be considered what type of data? 1-Unstructured 2-Relational 3-Semi-structured 4-Structured - Correct Answer- Answer-Since PDFs have no real structure to them, like key-value pairs or column names, they are considered unstructured data. Your organization has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into? 1-Semi-structured data, stored in DynamoDB. 2-Structured data, stored in RDS. 3-Unstructured data, stored in S3. 4-Semi-structured data, stored in S3. - Correct Answer- Answer-Key-value pair JSON data is considered Semi-structured data because it doesn't have a defined structure, but has some structural properties. If our data is going to be used for a machine learning project in AWS, we need to find a way to get that data into S3. What is the most common data source can you use to pull training datasets into Amazon SageMaker? 1-RDS 2-S3 3-DynamoDB 4-RedShift - Correct Answer- Answer-Generally, we store our training data in S3 to use for training our model. You are a ML specialist working with data that is stored in a distributed EMR cluster on AWS. Currently, your machine learning applications are compatible with the Apache Hive Metastore tables on EMR. You have been tasked with configuring Hive to use the AWS Glue Data Catalog as its metastore. Before you can do this you need to transfer the Apache Hive metastore tables into an AWS Glue Data Catalog. What two answer option workflows can accomplish the requirements with the LEAST amount of effort? 1-Create a Data Pipeline job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog. 2-Create a second EMR cluster that runs an Apache Spark script to copy the Hive metastore tables from the original EMR cluster into AWS Glue. 3-Run a Hive script on EMR that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog. 4-Setup your Apache Hive application with JDBC driver connections, then create a crawler that crawls the Apache Hive Metastore using the JDBC connection and creates an AWS Glue Data Catalog. 5-Create DMS endpoints for both the input Apache Hive Metastore and the output data store S3 bucket, run a DMS migration to transfer the data, then create a crawler that creates an AWS Glue Data Catalog. - Correct Answer- Answer- 3&4 - Apache Hive supports JDBC connections that easily can be used with a crawler to create an AWS Glue Data Catalog. The benefit of using Data Catalog (over Hive Metastore) is because it provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. We can simply run a Hive script to query tables and output that data in CSV (or other formats) into S3. Once that data is on S3, we can crawl it to create a Data Catalog of the Hive Metastore or import the data directly from S3. An organization needs to store a mass amount of data in AWS. The data has a keyvalue access pattern, developers need to run complex SQL queries and transactions, and the data has a fixed schema. Which type of data store meets all of their needs? 1-S3 2-DynamoDB 3-RDS 4-Athena - Correct Answer- Answer-"Amazon RDS handles all these requirements. Transactional and SQL queries are the important terms here. Although RDS is not typically thought of as optimized for key-value based access, using a schema with a primary key can solve this. S3 has no fixed schema. Although Amazon DynamoDB provides key-value access and consistent reads, it does not support complex SQL based queries. Simple SQL queries are supported for DynamoDB via PartiQL. Finally, Athena is used to query data on S3 so this is not a data store on AWS." You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Your organization already uses Redshift as their data warehousing solution. Which tool can help you achieve this with the LEAST amount of effort? 1-Redshift Spectrum 2-Apache Hive 3-Athena 4-S3 Analytics - Correct Answer- Answer-Since the organization already uses Redshift as their data warehouse solution, Redshift spectrum would require less effort than using AWS Glue and Athena. Which Amazon service allows you to build a high-quality training labeled dataset for your machine learning models? This includes human workers, vendor companies that you choose, or an internal, private workforce. 1-S3 2-Lambda 3-SageMaker Ground Truth 4-Jupyter Notebooks - Correct Answer- Answer-You could use Jupyter Notebooks or Lambda to help automate the labeling process, but SageMaker Ground Truth is specifically used for building high-quality training datasets. You are trying to set up a crawler within AWS Glue that crawls your input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from your data and no tables are created within your AWS Glue Data Catalog. What is the reason for these results? 1-The crawler does not have correct IAM permissions to access the input data in the S3 bucket. 2-The checkbox for 'Do not create tables' was checked when setting up the crawler in AWS Glue. 3-The bucket path for the input data store in S3 is specified incorrectly. 4-AWS Glue built-in classifiers could not find the input data format. You need to create a custom classifier. - Correct Answer- Answer-AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems. If AWS Glue cannot determine the format of your input data, you will need to set up a custom classifier that helps AWS Glue crawler determine the schema of your input data. In general within your dataset, what is the minimum number of observations you should have compared to the number of features? 1-10,000 times as many observations as features. 2-100 times as many observations as features. 3-10 times as many observations as features. 4-1000 times as many observations as features. - Correct Answer- Answer-We need a large, robust, feature-rich dataset. In general, having AT LEAST 10 times as many observations as features is a good place to start. So for example, we have a dataset with the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. Since id is just an identifier, we have 4 features (date, full review, full review summary, and a binary safe/unsafe tag). This means we need AT LEAST 40 rows/observations. Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics? 1-Kinesis Producer Library (KPL) 2-Kinesis API (AWS SDK) 3-Kinesis Consumer Library 4-Kinesis Client Library (KCL) - Correct Answer- Answer-Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications. Which service in the Kinesis family allows you to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing? 1-Kinesis Firehose 2-Kinesis Streams 3-Kinesis Data Analytics 4-Kinesis Video Streams - Correct Answer- Answer-Kinesis Video Streams allows you to stream video, images, audio, radar into AWS to further analyze, build custom application around, or store in S3. You work for a farming company that has dozens of tractors with build-in IoT devices. These devices stream data into AWS using Kinesis Data Streams. The features associated with the data is tractor Id, latitude, longitude, inside temp, outside temp, and fuel level. As a ML specialist you need to transform the data and store it in a data store. Which combination of services can you use to achieve this? (Select 3) 1-Set up Kinesis Firehose to ingest data from Kinesis Data Streams, then send data to Lambda. Transform the data in Lambda and write the transformed data into S3. 2-Set up Kinesis Data Analytics to ingest the data from Kinesis Data Stream, then run real-time SQL queries on the data to transform it. After the data is transformed, ingest the data with Kinesis Data Firehose and write the data into S3. 3-Immediately send the data to Lambda from Kinesis Data Streams. Transform the data in Lambda and write the transformed data into S3. 4-Use Kinesis Data Streams to immediately write the data into S3. Next, set up a Lambda function that fires any time an object is PUT onto S3. Transform the data from the Lambda function, then write the transformed data into S3. 5-Use Kinesis Data Firehose to run real-time SQL queries to transform the data and immediately write the transformed data into S3. - Correct Answer- Answer-Amazon Kinesis Data Firehose can ingest streaming data from Amazon Kinesis Data Streams, which can leverage Lambda to transform the data and load into Amazon S3. Amazon Kinesis Data Analytics can query, analyze and transform streaming data from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose as a destination for loading data into Amazon S3. Amazon Kinesis Data Streams can ingest and store data streams for Lambda processing, which can transform and load the data into Amazon S3. Continues..... [Show Less]

Preview 6 out of 69 pages