AWS Certified Machine Learning Specialty
A Cloud Guru Quiz Questions (Level 2)
Questions And Answers 2022
You are a ML specialist within a large
... [Show More] organization who needs to run SQL queries
and analytics on thousands of Apache logs files stored in S3. Which set of tools can
help you achieve this with the LEAST amount of effort?
1- Redshift and Redshift Spectrum
2- AWS Glue Data Catalog and Athena
3-Data Pipeline and Athena
4-Data Pipeline and RDS - Correct Answer- 2 Glue Data Catalog and Athena
Answer-Using Redshift/Redshift Spectrum and Data Pipeline/RDS could work, but
require much more effort in setting up and provisioning resources. Using AWS Glue
you can use a crawler to crawl the logs files in S3. This will create structured tables
within your AWS Glue database. These tables can then be queried using Athena.
This solution requires the least amount of effort.
You are a ML specialist who is setting up a ML pipeline. The amount of data you
have is massive and needs to be set up and managed on a distributed system to
efficiently run processing and analytics on. You also plan to use tools like Apache
Spark to process your data to get it ready for your ML pipeline. Which setup and
services can most easily help you achieve this?
1-Redshift out-performs Apache Spark and should be used instead.
2-Multi AZ RDS Read Replicas with Apache Spark installed.
3-Self-managed cluster of EC2 instances with Apache Spark installed.
4-Elastic Map Reduce (EMR) with Apache Spark installed. - Correct AnswerAnswer-Amazon's EMR allows you to set up a distributed Hadoop cluster to process,
transform, and analyze large amounts of data. Apache Spark is a processing
framework and programming model that helps you do machine learning, stream
processing, or graph analytics using Amazon EMR clusters.
You have been tasked with converting multiple JSON files within a S3 bucket to
Apache Parquet format. Which AWS service can you use to achieve this with the
LEAST amount of effort?
1-Create a Data Pipeline job that reads from your S3 bucket and sends the data to
EMR. In EMR, create an Apache Spark job to process the data as Apache Parquet
and output the newly formatted files into S3.
2-Create an AWS Glue job to convert the S3 objects from JSON to Apache Parquet.
Output the newly formatted files into S3.
3-Create a Lambda function that reads all of the objects in the S3 bucket. Loop
through each of the objects and convert from JSON to Apache Parquet. Once the
conversion is complete, output the newly formatted files into S3.
4-Create an EMR cluster to run an Apache Spark job that processes the data as
Apache Parquet. Output the newly formatted files into S3. - Correct Answer- AnswerAWS Glue makes it super simple to transform data from one format to another. You
can simply create a job that takes in data defined within the Data Catalog and
outputs in any of the following formats: avro, csv, ion, grokLog, json, orc, parquet,
glueparquet, or xml.
You are a ML specialist within a large organization who helps job seekers find both
technical and non-technical jobs. You've collected data from a data warehouse from
an engineering company to determine which skills qualify job seekers for different
positions. After reviewing the data you realise the data is biased. Why?
1-The data collected has missing values for different skills for job seekers.
2-The data collected only has a few attributes. Attributes like skills and job title are
not included in the data.
3-The data collected needs to be from the general population of job seekers, not just
from a technical engineering company.
4-The data collected is only a few hundred observations making it bias to a small
subset of job types. - Correct Answer- Answer-It's important to know what type of
questions we are trying to solve. Since our organization helps both technical and
non-technical job seekers, only gathering data from an engineering company is
biased to those looking for technical jobs. We need to gather data from many
different repositories, both technical and non-technical.
You have been tasked with collecting thousands of PDFs for building a large corpus
dataset. The data within this dataset would be considered what type of data?
1-Unstructured
2-Relational
3-Semi-structured
4-Structured - Correct Answer- Answer-Since PDFs have no real structure to them,
like key-value pairs or column names, they are considered unstructured data.
Your organization has given you several different sets of key-value pair JSON files
that need to be used for a machine learning project within AWS. What type of data is
this classified as and where is the best place to load this data into?
1-Semi-structured data, stored in DynamoDB.
2-Structured data, stored in RDS.
3-Unstructured data, stored in S3.
4-Semi-structured data, stored in S3. - Correct Answer- Answer-Key-value pair
JSON data is considered Semi-structured data because it doesn't have a defined
structure, but has some structural properties. If our data is going to be used for a
machine learning project in AWS, we need to find a way to get that data into S3.
What is the most common data source can you use to pull training datasets into
Amazon SageMaker?
1-RDS
2-S3
3-DynamoDB
4-RedShift - Correct Answer- Answer-Generally, we store our training data in S3 to
use for training our model.
You are a ML specialist working with data that is stored in a distributed EMR cluster
on AWS. Currently, your machine learning applications are compatible with the
Apache Hive Metastore tables on EMR. You have been tasked with configuring Hive
to use the AWS Glue Data Catalog as its metastore. Before you can do this you
need to transfer the Apache Hive metastore tables into an AWS Glue Data Catalog.
What two answer option workflows can accomplish the requirements with the LEAST
amount of effort?
1-Create a Data Pipeline job that reads from your Apache Hive Metastore, exports
the data to an intermediate format in Amazon S3, and then imports that data into the
AWS Glue Data Catalog.
2-Create a second EMR cluster that runs an Apache Spark script to copy the Hive
metastore tables from the original EMR cluster into AWS Glue.
3-Run a Hive script on EMR that reads from your Apache Hive Metastore, exports
the data to an intermediate format in Amazon S3, and then imports that data into the
AWS Glue Data Catalog.
4-Setup your Apache Hive application with JDBC driver connections, then create a
crawler that crawls the Apache Hive Metastore using the JDBC connection and
creates an AWS Glue Data Catalog.
5-Create DMS endpoints for both the input Apache Hive Metastore and the output
data store S3 bucket, run a DMS migration to transfer the data, then create a crawler
that creates an AWS Glue Data Catalog. - Correct Answer- Answer- 3&4 - Apache
Hive supports JDBC connections that easily can be used with a crawler to create an
AWS Glue Data Catalog. The benefit of using Data Catalog (over Hive Metastore) is
because it provides a unified metadata repository across a variety of data sources
and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon
Redshift, Redshift Spectrum, Athena, and any application compatible with the
Apache Hive metastore. We can simply run a Hive script to query tables and output
that data in CSV (or other formats) into S3. Once that data is on S3, we can crawl it
to create a Data Catalog of the Hive Metastore or import the data directly from S3.
An organization needs to store a mass amount of data in AWS. The data has a keyvalue access pattern, developers need to run complex SQL queries and
transactions, and the data has a fixed schema. Which type of data store meets all of
their needs?
1-S3
2-DynamoDB
3-RDS
4-Athena - Correct Answer- Answer-"Amazon RDS handles all these requirements.
Transactional and SQL queries are the important terms here. Although RDS is not
typically thought of as optimized for key-value based access, using a schema with a
primary key can solve this. S3 has no fixed schema. Although Amazon DynamoDB
provides key-value access and consistent reads, it does not support complex SQL
based queries. Simple SQL queries are supported for DynamoDB via PartiQL.
Finally, Athena is used to query data on S3 so this is not a data store on AWS."
You are a ML specialist within a large organization who needs to run SQL queries
and analytics on thousands of Apache logs files stored in S3. Your organization
already uses Redshift as their data warehousing solution. Which tool can help you
achieve this with the LEAST amount of effort?
1-Redshift Spectrum
2-Apache Hive
3-Athena
4-S3 Analytics - Correct Answer- Answer-Since the organization already uses
Redshift as their data warehouse solution, Redshift spectrum would require less
effort than using AWS Glue and Athena.
Which Amazon service allows you to build a high-quality training labeled dataset for
your machine learning models? This includes human workers, vendor companies
that you choose, or an internal, private workforce.
1-S3
2-Lambda
3-SageMaker Ground Truth
4-Jupyter Notebooks - Correct Answer- Answer-You could use Jupyter Notebooks or
Lambda to help automate the labeling process, but SageMaker Ground Truth is
specifically used for building high-quality training datasets.
You are trying to set up a crawler within AWS Glue that crawls your input data in S3.
For some reason after the crawler finishes executing, it cannot determine the
schema from your data and no tables are created within your AWS Glue Data
Catalog. What is the reason for these results?
1-The crawler does not have correct IAM permissions to access the input data in the
S3 bucket.
2-The checkbox for 'Do not create tables' was checked when setting up the crawler
in AWS Glue.
3-The bucket path for the input data store in S3 is specified incorrectly.
4-AWS Glue built-in classifiers could not find the input data format. You need to
create a custom classifier. - Correct Answer- Answer-AWS Glue provides built-in
classifiers for various formats, including JSON, CSV, web logs, and many database
systems. If AWS Glue cannot determine the format of your input data, you will need
to set up a custom classifier that helps AWS Glue crawler determine the schema of
your input data.
In general within your dataset, what is the minimum number of observations you
should have compared to the number of features?
1-10,000 times as many observations as features.
2-100 times as many observations as features.
3-10 times as many observations as features.
4-1000 times as many observations as features. - Correct Answer- Answer-We need
a large, robust, feature-rich dataset. In general, having AT LEAST 10 times as many
observations as features is a good place to start. So for example, we have a dataset
with the following features: id, date, full review, full review summary, and a binary
safe/unsafe tag. Since id is just an identifier, we have 4 features (date, full review,
full review summary, and a binary safe/unsafe tag). This means we need AT LEAST
40 rows/observations.
Which service built by AWS makes it easy to set up a retry mechanism, aggregate
records to improve throughput, and automatically submits CloudWatch metrics?
1-Kinesis Producer Library (KPL)
2-Kinesis API (AWS SDK)
3-Kinesis Consumer Library
4-Kinesis Client Library (KCL) - Correct Answer- Answer-Although the Kinesis API
built into the AWS SDK can be used for all of this, the Kinesis Producer Library
(KPL) makes it easy to integrate all of this into your applications.
Which service in the Kinesis family allows you to securely stream video from
connected devices to AWS for analytics, machine learning (ML), and other
processing?
1-Kinesis Firehose
2-Kinesis Streams
3-Kinesis Data Analytics
4-Kinesis Video Streams - Correct Answer- Answer-Kinesis Video Streams allows
you to stream video, images, audio, radar into AWS to further analyze, build custom
application around, or store in S3.
You work for a farming company that has dozens of tractors with build-in IoT
devices. These devices stream data into AWS using Kinesis Data Streams. The
features associated with the data is tractor Id, latitude, longitude, inside temp,
outside temp, and fuel level. As a ML specialist you need to transform the data and
store it in a data store. Which combination of services can you use to achieve this?
(Select 3)
1-Set up Kinesis Firehose to ingest data from Kinesis Data Streams, then send data
to Lambda. Transform the data in Lambda and write the transformed data into S3.
2-Set up Kinesis Data Analytics to ingest the data from Kinesis Data Stream, then
run real-time SQL queries on the data to transform it. After the data is transformed,
ingest the data with Kinesis Data Firehose and write the data into S3.
3-Immediately send the data to Lambda from Kinesis Data Streams. Transform the
data in Lambda and write the transformed data into S3.
4-Use Kinesis Data Streams to immediately write the data into S3. Next, set up a
Lambda function that fires any time an object is PUT onto S3. Transform the data
from the Lambda function, then write the transformed data into S3.
5-Use Kinesis Data Firehose to run real-time SQL queries to transform the data and
immediately write the transformed data into S3. - Correct Answer- Answer-Amazon
Kinesis Data Firehose can ingest streaming data from Amazon Kinesis Data
Streams, which can leverage Lambda to transform the data and load into Amazon
S3.
Amazon Kinesis Data Analytics can query, analyze and transform streaming data
from Amazon Kinesis Data Streams and use Amazon Kinesis Data Firehose as a
destination for loading data into Amazon S3.
Amazon Kinesis Data Streams can ingest and store data streams for Lambda
processing, which can transform and load the data into Amazon S3.
Continues..... [Show Less]