A company needs to set up a data catalog and metadata management for data sources that run in the AWS
Cloud. The company will use the data catalog to maintain the metadata of all the objects that are in a set of
data stores. The data stores include structured sources such as Amazon RDS and Amazon Redshift. The data
stores also include semistructured sources such as JSON files and .xml files that are stored in Amazon S3.
The company needs a solution that will update the data catalog on a regular basis. The solution also must
detect changes to the source metadata.
Which solution will meet these requirements with the LEAST operational overhead?
A data engineer wants to orchestrate a set of extract, transform, and load (ETL) jobs that run on AWS. TheETL jobs contain tasks that must run Apache Spark jobs on Amazon EMR, make API calls to Salesforce, andload data into Amazon Redshift.The ETL jobs need to handle failures and retries automatically. The data engineer needs to use Python toorchestrate the jobs.Which service will meet these requirements?
A Data Engineering team is planning to perform sampling on a large dataset stored in Amazon S3 for exploratory data analysis. Their goal is to understand the overall trends without processing the entire dataset. The dataset is structured and spread across multiple S3 objects.
Which AWS service and method should the team use to implement an efficient and cost-effective sampling strategy?
A company is migrating its database servers from Amazon EC2 instances that run Microsoft SQL Server to
Amazon RDS for Microsoft SQL Server DB instances. The company's analytics team must export large data
elements every day until the migration is complete. The data elements are the result of SQL joinsacross
multiple tables. The data must be in Apache Parquet format. The analytics team must store the data in Amazon
S3.
Which solution will meet these requirements in the MOST operationally efficient way?
A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company
has an ecommerce application that generates a dataset that contains personally identifiable information (PII).
The company has an internal analytics application that does not require access to the PII.
To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to
implement a solution that with redact PII dynamically, based on the needs of each application that accesses the
dataset.
Which solution will meet the requirements with the LEAST operational overhead?
© Copyrights TheExamsLab 2025. All Rights Reserved
We use cookies to ensure your best experience. So we hope you are happy to receive all cookies on the TheExamsLab.