Tech ONTAP Blogs

Use Cloudera Hadoop S3A connector with StorageGRID

angelacheng
NetApp
1,506 Views

Hadoop has been a favorite of data scientist for some time now. Hadoop allows for the distributed processing of large data sets across clusters of computers using simple programing frameworks. Hadoop was designed to scale up from single servers to thousands of machines, with each machine possessing local compute and storage. 

 

Why use S3A for Hadoop workflows?

As the volume of data has grown over time, the approach of adding new machines with their own compute and storage has become inefficient. Scaling linearly crates challenges for using resources efficiently and managing the infrastructure.

 

To address these challenges, the Hadoop S3A client offers high-performance I/O against S3 object storage. Implementing a Hadoop workflow with S3A helps you leverage object storage as a data repository and enables you to separate compute and storage, which in turn enables you to scale compute and storage independently. Decoupling compute and storage also enable you to dedicate the right amount of resources for your compute jobs and provide capacity based on the size of your data set. Therefore, you can reduce your overall TCO for Hadoop workflows.

 

Configure S3A connector to use StorageGRID

 

Prerequisites

  • A StorageGRID S3 endpoint URL, a tenant s3 access key, and a secret key for Hadoop S3A connection testing.
  • A Cloudera cluster and root or sudo permission to each host in the cluster to install the Java package.

As of April 2022, Java 11.0.14 with Cloudera 7.1.7 was tested against StorageGRID 11.5 and 11.6. However, the Java version number might be different at the time of a new install.

 

Install Java package

  • Check the Cloudera support matrix for the supported JDK version.
  • Download the Java 11.x package that matches the Cloudera cluster operating system. Copy this package to each host in the cluster. In this example, the rpm package is used for CentOS.
  • Log into each host as root or using an account with sudo permission. Perform the following steps on each host.
  1. Install the package

$ sudo rpm -Uvh jdk-11.0.14_linux-x64_bin.rpm

 

  1. Check where Java is installed. If multiple versions are installed, set the newly installed version as default.

alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command

-----------------------------------------------

 +1           /usr/java/jre1.8.0_291-amd64/bin/java

  2           /usr/java/jdk-11.0.14/bin/java

Enter to keep the current selection[+], or type selection number: 2

 

  1. Add this line to the end of /etc/profile. The path should match the path of above selection.

export JAVA_HOME=/usr/java/jdk-11.0.14

 

  1. Run the following command for the profile to take effect.

source /etc/profile

 

Cloudera HDFS S3A configuration

 

Steps

  • From the Cloudera Manager GUI, select Clusters > HDFS, and select Configuration.
  • Under CATEGORY, select Advanced, and scroll down to locate `Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml.
  • Click the (+) sign and add following value pairs.

fs.s3a.access.key

<S3 tenant key from SG>

fs.s3a.secret.key

<S3 tenant key from SG>

fs.s3a.buffer.dir 

/tmp/   (optional)

fs.s3a.connection.ssl.enabled

[true|false]  (default is https if this entry is missing)

fs.s3a.endpoint

<SG LB endpoint:port>

fs.s3a.impl

org.apache.hadoop.fs.s3a.S3AFileSystem

fs.s3a.path.style.access

[true|false] (default is virtual host style if this entry is missing)

 

HDFS S3A configuration sample screenshot

HDFS S3A configuration sample screenshotHDFS S3A configuration sample screenshot

 

  • Click the Save Changes button. Select the Stale Configuration icon from the HDFS menu bar, select Restart Stale Services on the next page, and select Restart Now.

 

HDFS stale configuration restartHDFS stale configuration restart

 

Test S3A connection to StorageGRID

 

Perform basic connection test

Log into one of the hosts in the Cloudera cluster, and enter this s3 list command

hadoop fs -ls s3a://<bucket-name>/

 

The following example uses path syle with a pre-existing hdfs-test bucket and a test object.

[root@ce-n1 ~]# hadoop fs -ls s3a://hdfs-test/

22/02/15 18:24:37 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

22/02/15 18:24:37 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).

22/02/15 18:24:37 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started

22/02/15 18:24:37 INFO Configuration.deprecation: No unit for fs.s3a.connection.request.timeout(0) assuming SECONDS

Found 1 items

-rw-rw-rw-   1 root root       1679 2022-02-14 16:03 s3a://hdfs-test/test

22/02/15 18:24:38 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...

22/02/15 18:24:38 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.

22/02/15 18:24:38 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

 

 

Troubleshooting

Scenario 1

Use an HTTPS connection to StorageGRID and get a 'handshake_failure' error after a 15-minute timeout.

Reason: Old JRE/JDK version using outdated or unsupported TLS cipher suite for connection to StorageGRID.

Sample error message

[root@ce-n1 ~]# hadoop fs -ls s3a://hdfs-test/

22/02/15 18:52:34 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

22/02/15 18:52:34 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).

22/02/15 18:52:34 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started

22/02/15 18:52:35 INFO Configuration.deprecation: No unit for fs.s3a.connection.request.timeout(0) assuming SECONDS

22/02/15 19:04:51 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...

22/02/15 19:04:51 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.

22/02/15 19:04:51 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

22/02/15 19:04:51 WARN fs.FileSystem: Failed to initialize fileystem s3a://hdfs-test/: org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExistV2 on hdfs: com.amazonaws.SdkClientException: Unable to execute HTTP request: Received fatal alert: handshake_failure: Unable to execute HTTP request: Received fatal alert: handshake_failure

ls: doesBucketExistV2 on hdfs: com.amazonaws.SdkClientException: Unable to execute HTTP request: Received fatal alert: handshake_failure: Unable to execute HTTP request: Received fatal alert: handshake_failure

 

 

Resolution: Make sure that JDK 11.x or later is installed and set to default the Java library.  Refer to the Install Java package section for more information.

 

Scenario 2:

Failed to connect to StorageGRID with error message 'Unable to find valid certification path to requested target'.

Reason: StorageGRID S3 endpoint server certificate is not trusted by Java program.

Sample error message:

[root@hdp6 ~]# hadoop fs -ls s3a://hdfs-test/

22/03/11 20:58:12 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

22/03/11 20:58:13 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).

22/03/11 20:58:13 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started

22/03/11 20:58:13 INFO Configuration.deprecation: No unit for fs.s3a.connection.request.timeout(0) assuming SECONDS

22/03/11 21:12:25 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...

22/03/11 21:12:25 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.

22/03/11 21:12:25 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

22/03/11 21:12:25 WARN fs.FileSystem: Failed to initialize fileystem s3a://hdfs-test/: org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExistV2 on hdfs: com.amazonaws.SdkClientException: Unable to execute HTTP request: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target: Unable to execute HTTP request: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

 

Resolution: NetApp recommends using a server certificate issued by a known public certificate signing authority to make sure that the authentication is secure. Alternatively, add a custom CA or server certificate to the Java trust store.

 

Complete the following steps to add a StorageGRID custom CA or server certificate to the Java trust store.

  1. Backup the existing default Java cacerts file.

cp -ap $JAVA_HOME/lib/security/cacerts $JAVA_HOME/lib/security/cacerts.orig

 

  1. Import the StorageGRID S3 endpoint cert to the Java trust store.

keytool -import -trustcacerts -keystore $JAVA_HOME/lib/security/cacerts -storepass changeit -noprompt -alias sg-lb -file <StorageGRID CA or server cert in pem format>

Troubleshooting tips

  1. Increase the hadoop log level to DEBUG.

export HADOOP_ROOT_LOGGER=hadoop.root.logger=DEBUG,console

 

  1. Execute the command and direct the log messages to error.log.

hadoop fs -ls s3a://_<bucket-name>_/ &>error.log

 

 

 

Public