Connecting to Azure Storage from Spark: Methods and Code Samples¶

In this guide, I will walk you through the different methods to connect Apache Spark to Azure Storage services including Azure Blob Storage and Azure Data Lake Storage. You will learn how to set up your Spark session to read and write data from and to Azure Storage using various authentication methods.

Overview¶

Connecting Apache Spark to Azure Storage can be achieved through several methods, each suited for different scenarios and security requirements. We will cover the use of ABFS driver for Azure Data Lake Storage Gen2, managed identities in Azure-hosted Spark, shared access signatures, and more.

Method 1: ABFS Driver for ADLS Gen2¶

The ABFS (Azure Blob File System) driver is specially designed for Azure Data Lake Storage Gen2 and supports OAuth2 authentication, providing a secure method to access your data.

Sample Code for ABFS Driver:¶

Here is a sample code to connect using OAuth authentication and service principal. The code requires Haddop Azure Storagae Jars which needs to be downloaded spearately.

from pyspark.sql import SparkSession

# Replace with your Azure Storage account information
storage_account_name = "your_storage_account_name"
client_id = "your_client_id_of_the_registered_app"
client_secret = "your_client_secret_of_the_registered_app"
tenant_id = "your_tenant_id_of_the_registered_app"

spark = SparkSession.builder \
    .appName("Any_App_Name") \
    .config("spark.jars", 
             "/usr/local/lib/python3.8/dist-packages/pyspark/jars/hadoop-azure-3.3.3.jar,"\
             "/usr/local/lib/python3.8/dist-packages/pyspark/jars/hadoop-azure-datalake-3.3.3.jar,"\
             "/usr/local/lib/python3.8/dist-packages/pyspark/jars/hadoop-common-3.3.3.jar") \
    .config(f"fs.azure.account.auth.type.{storage_account_name}.dfs.core.windows.net", "OAuth") \
    .config(f"fs.azure.account.oauth.provider.type.{storage_account_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") \
    .config(f"fs.azure.account.oauth2.client.id.{storage_account_name}.dfs.core.windows.net", client_id) \
    .config(f"fs.azure.account.oauth2.client.secret.{storage_account_name}.dfs.core.windows.net", client_secret) \
    .config(f"fs.azure.account.oauth2.client.endpoint.{storage_account_name}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token") \
    .getOrCreate()

Method 2: Managed Identity for Azure-hosted Spark¶

For Spark clusters hosted on Azure, Managed Identity offers a way to securely access Azure services without storing credentials in your code.

Sample Code for Managed Identity:¶

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Azure Blob with Managed Identity") \
    .config("fs.azure.account.auth.type.<storage_account_name>.blob.core.windows.net", "CustomAccessToken") \
    .config("fs.azure.account.custom.token.provider.class.<storage_account_name>.blob.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ManagedIdentityCredentialProvider") \
    .getOrCreate()

Method 3: Azure Blob Storage with Access Key¶

Using the Azure Blob Storage access key is a straightforward method to establish a connection, but it's less secure than using OAuth2 or Managed Identities.

Sample Code for Access Key:¶

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Azure Blob Access with Access Key") \
    .config("fs.azure.account.key.<storage_account_name>.blob.core.windows.net", "<access_key>") \
    .getOrCreate()

Method 4: Shared Access Signature (SAS)¶

Shared Access Signatures (SAS) provide a secure way to grant limited access to your Azure Storage resources without exposing your account key.

Sample Code for SAS:¶

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Azure Blob Access with SAS") \
    .config("fs.azure.sas.<container_name>.<storage_account_name>.blob.core.windows.net", "<sas_token>") \
    .getOrCreate()

Method 5: Environment Variables/Secrets¶

For an extra layer of security, use environment variables or a secret scope to manage your credentials, keeping them out of your code base.

Sample Code for Using Environment Variables/Secrets:¶

import os
from pyspark.sql import SparkSession

# Assume the environment variables or secrets are already set
storage_account_name = os.getenv('STORAGE_ACCOUNT_NAME')
sas_token = os.getenv('SAS_TOKEN')

spark = SparkSession.builder \
    .appName("Azure Blob Access with Environment Variables") \
    .config(f"fs.azure.sas.<container_name>.{storage_account_name}.blob.core.windows.net", sas_token) \
    .getOrCreate()