Table of contents

Background
1. If you have Synapse, why not save the data in Synapse SQL warehouse?
2. Give me some real examples
  1. Removing Patient’s Identity and Sharing Their Data as ADLS Tables
  2. Inlet wants to store chip Sensor data and Share as Tables for university students
CETAS Scenarios
1. Pull Data from Synapse Warehouse and put It in ADLS CETAS Tables
2. Pull Data from ADLS and Pur as ADLS CETAS Tables
  1. Get the access sorted
  2. Pull from ADLS & Put in ADLS as CETAS
Service Principal Method and SAS Methods
1. Service Principal Authentication Setup
2. Shared Access Signature (SAS) Setup
Alternative to CETAS?

Background

CETAS is very similar to SQL CREATE TABLE TableName command. But, the table is created in Azure Data Lake and is called an External Table. External here means it’s not inside the Synapse SQL data warehouse (Fancy name: Dedicated SQL Pool).

If you have Synapse, why not save the data in Synapse SQL warehouse?

Because Synapse SQL warehouse storage is expensive and not meant for everyone to access.

alt text

Give me some real examples

A hospital wants to share patient data stored in its Synapse SQL warehouse with researchers. Will they let the entire world access their Synapse warehouse? No way, that would land them in jail. Patient data can’t be shared directly. So, they create CETAS in Azure Data Lake with data from their SQL warehouse.

Benefits: Avoids jail time and helps the research world make new medicines.

alt text

Inlet company collects a huge amount of sensor data from chips in ADLS. They have an external university research team that wants to analyze it. The in-house team connects to Synapse using PySpark to clean the data. Then, using Synapse SQL (serverless), they create neat, report-ready CETAS tables in ADLS. The external university is given access to these tables only, keeping Inlet’s internal Synapse warehouse out of the entire process.

Hence, CETAS is an important command for creating external tables and storing the data from your SQL query permanently as tables in ADLS.

CETAS Scenarios

Pull Data from Synapse Warehouse and put It in ADLS CETAS Tables

Suppose you are a data engineer with access to both the Azure Data Lake Gen2 account and the Synapse workspace. The patient records are present in the SQL warehouse, and you want to export this data into ADLS as a table.

Get the access sorted

Enable Managed Identity for Synapse Workspace:

Here, we will use Managed Identity. Why? Because it’s the simplest, and Azure handles everything for us. But there are other options for access too, like the Service Principal method and the SAS method.
- Go to your Synapse workspace in the Azure portal.
- Under the “Identity” section, ensure the “System-assigned managed identity” is enabled.
Grant Access to Managed Identity on ADLS Gen2:
- Go to your ADLS Gen2 account in the Azure portal.
- Navigate to the “Access Control (IAM)” section.
- Click on “Add role assignment.”
- Assign the role “Storage Blob Data Contributor” to the managed identity of your Synapse workspace.

Custom DB to store connection info - Serverless SQL only

When using CETAS with a serverless SQL pool, don’t use the built-in database for connection info, credentials, or file formats. Instead, create a new database to keep things organized. For a dedicated SQL pool, you can use the built-in database.

Create Custom Database:
```
CREATE DATABASE MyCustDbForCETASInfo;
```
Use Custom Database:
```
USE MyCustDbForCETASInfo;
```

Pull-from-Warehouse and Put-in ADLS CETAS

Next, create an external table in ADLS Gen2 using managed identity for authentication.

Create Database Scoped Credential:

USE MyCustDbForCETASInfo;

CREATE DATABASE SCOPED CREDENTIAL MyADLSCredential
WITH
IDENTITY = 'Managed Identity';

Create External Data Source:

CREATE EXTERNAL DATA SOURCE MyADLS
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://myfilesystem@myadlsaccount.dfs.core.windows.net',
    CREDENTIAL = MyADLSCredential
);

Create External File Format:

CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET
);

Create and fill the CETAS Table:

CREATE EXTERNAL TABLE SalesDataExternal
WITH (
    LOCATION = '/salesdata/',
    DATA_SOURCE = MyADLS,
    FILE_FORMAT = ParquetFormat
)
AS
SELECT * FROM SalesData;

Pull Data from ADLS and Pur as ADLS CETAS Tables

Like the Inlet company case study, let’s say you have lots of .parquet files as data. You want to clean this data and create tables in ADLS itself using Synapse. To do this follow these steps,

Get the access sorted

The creation of managed identity etc have been left out. As I explained them before.

Pull from ADLS & Put in ADLS as CETAS

Create Database Scoped Credential:

CREATE DATABASE SCOPED CREDENTIAL MyADLSCredential
WITH
IDENTITY = 'Managed Identity';

Create External Data Source:

-- Create an external data source for the Azure storage account
CREATE EXTERNAL DATA SOURCE MyADLS
WITH (
    TYPE = HADOOP, -- For dedicated SQL pool
    -- TYPE = BLOB_STORAGE, -- For serverless SQL pool
    LOCATION = 'abfss://myfilesystem@myadlsaccount.dfs.core.windows.net',
    CREDENTIAL = MyADLSCredential
);

Create External File Format:

CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET
);

Query Data from ADLS:

SELECT *
FROM OPENROWSET(
    BULK 'abfss://myfilesystem@myadlsaccount.dfs.core.windows.net/path/to/data/',
    FORMAT = 'PARQUET'
) AS [result]

Create the CETAS Table:

CREATE EXTERNAL TABLE ProcessedDataExternal
WITH (
    LOCATION = '/processeddata/',
    DATA_SOURCE = MyADLS,
    FILE_FORMAT = ParquetFormat
)
AS
SELECT * FROM OPENROWSET(
    BULK 'abfss://myfilesystem@myadlsaccount.dfs.core.windows.net/path/to/data/',
    FORMAT = 'PARQUET'
) AS [result];

Service Principal Method and SAS Methods

We used managed identity in our examples. Apart from Managed Identity, you have a couple of other options for connecting Synapse to Azure Data Lake Storage (ADLS):

Service Principal Authentication: A service principal is like a special user for applications to access Azure resources. You create a service principal and give it the needed permissions on ADLS. Its like functional id.
Shared Access Signature (SAS): A Shared Access Signature (SAS) allows limited access to your storage account for a specific time and with specific permissions.

Here’s how you can set up each method:

Service Principal Authentication Setup

Create a Service Principal: You can create a service principal using the Azure portal, Azure CLI, or PowerShell. Here is an example using Azure CLI:

  az ad sp create-for-rbac --name <service-principal-name> --role "Storage Blob Data Contributor" --scopes /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>

Grant Access to the Service Principal on ADLS Gen2: Assign the necessary role to the service principal:

  az role assignment create --assignee <appId> --role "Storage Blob Data Contributor" --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>

Configure the External Data Source in Synapse: Use the service principal credentials in your SQL script:

  CREATE DATABASE SCOPED CREDENTIAL MyADLSCredential
  WITH
  IDENTITY = 'service-principal-id',
  SECRET = 'service-principal-password';

  CREATE EXTERNAL DATA SOURCE MyADLS
  WITH (
      TYPE = HADOOP,
      LOCATION = 'abfss://myfilesystem@myadlsaccount.dfs.core.windows.net',
      CREDENTIAL = MyADLSCredential
  );

Shared Access Signature (SAS) Setup

Generate a SAS Token: You can generate a SAS token through the Azure portal, Azure Storage Explorer, Azure CLI, or programmatically using Azure Storage SDKs. Here is an example using Azure CLI:
```
  az storage account generate-sas --permissions rwdlacup --account-name <storage-account> --services b --resource-types co --expiry <expiry-date>
```

Configure the External Data Source in Synapse: Use the SAS token in your SQL script:

  CREATE DATABASE SCOPED CREDENTIAL MyADLSSASCredential
  WITH
  IDENTITY = 'SHARED ACCESS SIGNATURE',
  SECRET = 'sas-token';

  CREATE EXTERNAL DATA SOURCE MyADLS
  WITH (
      TYPE = HADOOP,
      LOCATION = 'abfss://myfilesystem@myadlsaccount.dfs.core.windows.net',
      CREDENTIAL = MyADLSSASCredential
  );

Managed Identity is usually recommended because it’s easy to use and secure, but Service Principal and SAS can be useful in certain situations where Managed Identity might not work.

Alternative to CETAS?

1. Azure Synapse Spark Pools

Spark capability, pyspark-notebooks comes with Synapse. You can use spark pools in Synapse. This can be more efficeint than JDBC for huge volume of data handling.

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("SynapseSparkPoolExample") \
    .getOrCreate()

# Read data from ADLS
df = spark.read \
    .format("parquet") \
    .load("abfss://myfilesystem@myadlsaccount.dfs.core.windows.net/path/to/data/")

# Process the data
processed_df = df.filter(df['SalesDate'] >= '2023-01-01')

# Write the processed data back to ADLS
processed_df.write \
    .mode("overwrite") \
    .format("parquet") \
    .save("abfss://myfilesystem@myadlsaccount.dfs.core.windows.net/processeddata/")

# Register the Parquet files as a table in the Spark session
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS SalesDataExternal USING parquet LOCATION 'abfss://myfilesystem@myadlsaccount.dfs.core.windows.net/processeddata/'")

2. Azure Data Factory (ADF)

ADF is one of the core component of Synapse. It’s main job is to tranfer data from here to there. This can be a preferred option to avoid lengthy coding etc. Typical steps to perform the activity would be:

Create a pipeline: This is the workflow for the entire activity.
Use Copy Activity: This is a very important activity to copy data and used frequently in pipelines.

3. PolyBase

PolyBase is a technology for MSSQL Server. It allows you to query external data as if it were part of the database.

In Synapse we created CETAS etc. We were already using Polybase technology.

Let’s test our knowledge

What function is used to read the data in files stored in a data lake?
- FORMAT
- ROWSET
- OPENROWSET
Answer: OPENROWSET
What character in file path can be used to select all the file/folders that match rest of the path?
- &
- *
- /
Answer: *
Which external database object encapsulates the connection information to a file location in a data lake store?
- FILE FORMAT
- DATA SOURCE
- EXTERNAL TABLE
Answer: DATA SOURCE

Based on the given ferocity rankings for the animals, here’s how you can create an output table:

Background

If you have Synapse, why not save the data in Synapse SQL warehouse?

Give me some real examples

Removing Patient’s Identity and Sharing Their Data as ADLS Tables

Inlet wants to store chip Sensor data and Share as Tables for university students

CETAS Scenarios

Pull Data from Synapse Warehouse and put It in ADLS CETAS Tables

Get the access sorted

Custom DB to store connection info - Serverless SQL only

Pull-from-Warehouse and Put-in ADLS CETAS

Pull Data from ADLS and Pur as ADLS CETAS Tables

Get the access sorted

Pull from ADLS & Put in ADLS as CETAS

Service Principal Method and SAS Methods

Service Principal Authentication Setup

Shared Access Signature (SAS) Setup

Alternative to CETAS?

1. Azure Synapse Spark Pools

2. Azure Data Factory (ADF)

3. PolyBase

Let’s test our knowledge