Access to Azure Data Lake Storage Gen2 with Hadoop

Hadoop FileSystem is one of the primary access methods for data in Azure Data Lake Storage Gen2. Users of Azure Blob Storage have access to ADLS Gen2 by new driver, the Azure Blob File System driver as ABFS.

ABFS driver is part of Apache Hadoop.

How it is possible?

  1. Download the newest Hadoop (doesn't work with 3.1.2), extract and move to /opt/:
curl -O http://ftp.man.poznan.pl/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
tar -zxf hadoop-3.2.0.tar.gz
sudo mv hadoop-3.2.0 /opt/

2. You should have the newest Java that you can install by yum or other packages manager:

yum install -y java-1.8.0-openjdk

3. Add to .bashrc file and relogon:

export HADOOP_HOME=/opt/hadoop-3.2.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_SHARE=$HADOOP_HOME/share/hadoop
export HADOOP_CLASSPATH=$(find $HADOOP_SHARE -name '*.jar' | xargs echo | tr ' ' ':')
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0
export PATH=$PATH:$HADOOP_HOME/bin

In class path you need using find command because HDFS needs all jar and it can't recursively attach library.

4. It is possible to have two and more authorization access to storage. All authorization must be edit in core-site.xml ($HADOOP_HOME/etc/hadoop/)

SharedKey:

<configuration>
        <property>
                <name>fs.azure.account.key.[ADLSgen2-name].dfs.core.windows.net</name>
                <value>[SharedKey]</value>
        </property>
</configuration>

OAuth 2.0 Client Credentials

<configuration>
<property>
  	<name>fs.azure.account.auth.type</name>
  	<value>OAuth</value>
  	<description>
  	Use OAuth authentication
  	</description>
</property>
<property>
	<name>fs.azure.account.oauth.provider.type</name>
	<value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
	<description>
	Use client credentials
	</description>
</property>
<property>
	<name>fs.azure.account.oauth2.client.endpoint</name>
	<value>https://login.microsoftonline.com/[ServicePrincipal-TenandID]/oauth2/token</value>
  	<description>
  	URL of OAuth endpoint
  	</description>
</property>
<property>
  	<name>fs.azure.account.oauth2.client.id</name>
  	<value>[ServicePrincipal-ID]</value>
  	<description>
  	Client ID
  	</description>
</property>
<property>
  	<name>fs.azure.account.oauth2.client.secret</name>
  	<value>[ServicePrincipal-Secret]</value>
  	<description>
  	Secret
  	</description>
</property>
</configuration>

5. Create first directory by tool:

hdfs dfs -mkdir abfss://[FileSystem]@[YourDataLake].dfs.core.windows.net/[new-directory]

This is simple way to access ADLS files.

If you have a different solution, I invited you to the Microsoft Azure User Group Poland.