Ways to load data in DW from External Data Source

Introduction

After performing transformation on data, then it needs to save it in Data Warehouse for further analysis. This process comes under ETL (Extract, Transform, Load). It involves extracting data from various sources, transforming it…


This content originally appeared on DEV Community and was authored by Ayush Kumar

Introduction

After performing transformation on data, then it needs to save it in Data Warehouse for further analysis. This process comes under ETL (Extract, Transform, Load). It involves extracting data from various sources, transforming it as per business need, and then loading it into destination.

There are many ways to load data in DW: -

  • POLYBASE
  • COPY INTO command
  • ADF Copy activity
  • Spark Pool in Synapse
  • Databricks

In this blog I will explain these processes and talk about some limitations of some activity and its workaround.

NOTE: Whenever we use other tools for inserting data in Synapse like ADF, Data Flow, Spark Pool and Databricks, it highly recommended to enable staging or under hood it can be implemented automatically. It allows for efficient data transfer and can handle large datasets more effectively. Basically, this approach minimizes the load on both the source and the destination during the transfer process. Internally the data is first loaded in Staging layer(temporary storage layer), then from there, data is loaded in DW. After loading the data in DW, the temporary data in staging layer is deleted.

POLYBASE

PolyBase is a technology that uses metadata to allows SQL Server and dedicated SQL pools to query and import data from external data sources using T-SQL.

Step 1:

Create an object of Database Scoped Credential. This indicates how it going to connect to external data source(Azure Data Lake Gen 2). In the Database Scoped Credentials, I am using Managed Identity (means Synapse uses its own credentials to access Data Lake).

You also have to give Storage Blob Data Contributor role to the Managed Identity of Synapse Workspace.

CREATE DATABASE SCOPED CREDENTIAL adlsgen2synp
WITH IDENTITY = 'MANAGED IDENTITY'
GO

Step 2:

Create an External Data Source. It's a definition of External Data Sorce Location.

CREATE EXTERNAL DATA SOURCE taxi_raw_data
WITH (
    LOCATION = 'abfss://raw@synapselearningadls.dfs.core.windows.net/',
    CREDENTIAL = adlsgen2synp,
    TYPE = HADOOP
);
GO

Step 3:

Create an External File Format. It defines format of the file like CSV.

CREATE EXTERNAL FILE FORMAT csv_file_format
    WITH(
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
            FIELD_TERMINATOR = ',',
            STRING_DELIMITER = '"',
            FIRST_ROW = 2,
            USE_TYPE_DEFAULT = FALSE
        )
    );
GO

Step 4:

Create External Table so that it able to query the external data directly from synapse without importing it.

CREATE EXTERNAL TABLE temp_table.ext_taxi_zone
    (   LocationID SMALLINT,
        Borough VARCHAR(15),
        Zone VARCHAR(50),
        service_zone VARCHAR(15))
    WITH(
        LOCATION = 'taxi_zone.csv',
        DATA_SOURCE = taxi_raw_data,
        FILE_FORMAT = csv_file_format
    );
GO

Step 5:

Use CTAS statement to import the data in Synapse SQL DW.

CREATE TABLE temp_table.taxi_zone
WITH (
    DISTRIBUTION = ROUND_ROBIN
) AS 
SELECT * FROM temp_table.ext_taxi_zone;

uisng polybase

NOTE:

  • Polybase does not support DELTA file format.
  • External Table can't be modified. For altering columns, you must have to drop the table and then recreate again.

COPY INTO command

It is a newer way of loading the data in DW. It doesn't need any extra objects. It copies the data directly to the table in DW. It is faster and simpler compared to POLYBASE.

COPY INTO [temp_table].[taxi_zone]
FROM 'https://synapselearningadls.blob.core.windows.net/raw/taxi_zone.csv'
WITH (
    CREDENTIAL = (IDENTITY = 'MANAGED IDENTITY'),
    FILE_TYPE = 'CSV',
    FIRSTROW = 2
)

Using Copy

NOTE: It supports only these CSV, PARQUET, ORC file types.

ADF Copy Activity

Step 1:

Giving some permissions to ADF Managed Identity to access Synapse Database.

Here EXTERNAL PROVIDER refers to the Microsoft Entra.

CREATE USER [adf-rnd-learning] FROM EXTERNAL PROVIDER;
GO

EXEC sp_addrolemember db_owner, [adf-rnd-learning];
GRANT ADMINISTER DATABASE BULK OPERATIONS TO [adf-rnd-learning];

Step 2:

  • Create linked service for ADLS Gen 2 (External data source) and Azure Synapse Analytics.
  • Create Dataset for source external file in ADLS and sink Table in synapse DB.
  • Create a pipeline and use COPY Activity to copy data from ADLS to Synapse database.
  • Set the source and sink dataset in COPY activity and enable the Staging. Using copy

Here in staging, the data stored temporary and after loading in Synapse DB it deleted.

Staging data

NOTE: ADF COPY activity doesn't support DELTA. But a workaround is you can use Data Flow in ADF. While adding the source in Data Flow, choose source type as Inline.

Delta Support

Spark Pool in Synapse

In Spark Pool we can use Synapse DW Objects with the help of spark connector. And the best part is, Synapse automatically manages all authentication for us. So, we don't have to focus on managing credentials.

sqlanalytics contains a spark connector that is used to connect to dedicated SQL Pool from Spark Pool.

Spark Pool Connector

Here, under the hood staging data is loaded automatically in some default location.

Spark Pool Staging

Databricks

In Databricks, for accessing Synapse DW Objects first we have to manage credentials for both Synapse and ADLS.

Step 1:

Give required permissions to Databricks to access ADLS Gen 2.

Create a service principal in Microsoft Entra Id and add Client Secret to this service principal.

Create Service Principal

Assign the Storage Blob Data Contributor role of ADLS to this service principal.

Role assign to ADLS

Then add the client_id, tenant_id and client secrets of service principal to Azure Key Vault.

Key Vault

Then create Scope in Databricks and add the required details of your key vault. Databricks keeps the Secret Scope in a hidden user interface. To reveal that interface, type this “#secrets/createScope” in the end of URL in the Databricks homepage.

Databricks Scope

Step 2:

Give permission to Service Principal to access Synapse Database. Execute this below query in Synapse Database.

CREATE USER [databricks_learning_appreg] FROM EXTERNAL PROVIDER;
GO

EXEC sp_addrolemember db_owner, [databricks_learning_appreg];
GRANT ADMINISTER DATABASE BULK OPERATIONS TO [databricks_learning_appreg];

Step 3:

Fetch the credentials from Databricks Secrets Scope.

client_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-id')
tenant_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-tenant-id')
client_secret = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-secret')

Step 4:

Set the configurations in Databricks, for ADLS.

spark.conf.set("fs.azure.account.auth.type.synapselearningadls.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.synapselearningadls.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.synapselearningadls.dfs.core.windows.net", client_id)
spark.conf.set("fs.azure.account.oauth2.client.secret.synapselearningadls.dfs.core.windows.net", client_secret)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.synapselearningadls.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")

Step 5:

Set the configurations in Databricks, for Synapse SQL DW.

spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", client_id)
spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", client_secret)

Step 6:

Now load the data from ADLS into data frame and perform some transformations.

read df from ADLS

Step 7:

Now write the transformed data to Synapse SQL DW.

df.write.format("sqldw") \
.option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \
.option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \
.option("forwardSparkAzureStorageCredentials", "false") \
.option("dbTable", "temp_table.taxi_zone") \
.option('enableServicePrincipalAuth', 'true') \
.mode('append') \
.save()

Step 8:

Now check the table, transformed data is inserted into Synapse SQL DW.

df2 = spark.read.format("sqldw") \
.option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \
.option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \
.option("forwardSparkAzureStorageCredentials", "false") \
.option("dbTable", "temp_table.taxi_zone") \
.option('enableServicePrincipalAuth', 'true') \
.load()

DW table

NOTE:
Here you notice that in pyspark code, I am defining staging location for both reading and writing.

Staging from Databricks


This content originally appeared on DEV Community and was authored by Ayush Kumar


Print Share Comment Cite Upload Translate Updates
APA

Ayush Kumar | Sciencx (2024-07-21T09:13:32+00:00) Ways to load data in DW from External Data Source. Retrieved from https://www.scien.cx/2024/07/21/ways-to-load-data-in-dw-from-external-data-source/

MLA
" » Ways to load data in DW from External Data Source." Ayush Kumar | Sciencx - Sunday July 21, 2024, https://www.scien.cx/2024/07/21/ways-to-load-data-in-dw-from-external-data-source/
HARVARD
Ayush Kumar | Sciencx Sunday July 21, 2024 » Ways to load data in DW from External Data Source., viewed ,<https://www.scien.cx/2024/07/21/ways-to-load-data-in-dw-from-external-data-source/>
VANCOUVER
Ayush Kumar | Sciencx - » Ways to load data in DW from External Data Source. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/07/21/ways-to-load-data-in-dw-from-external-data-source/
CHICAGO
" » Ways to load data in DW from External Data Source." Ayush Kumar | Sciencx - Accessed . https://www.scien.cx/2024/07/21/ways-to-load-data-in-dw-from-external-data-source/
IEEE
" » Ways to load data in DW from External Data Source." Ayush Kumar | Sciencx [Online]. Available: https://www.scien.cx/2024/07/21/ways-to-load-data-in-dw-from-external-data-source/. [Accessed: ]
rf:citation
» Ways to load data in DW from External Data Source | Ayush Kumar | Sciencx | https://www.scien.cx/2024/07/21/ways-to-load-data-in-dw-from-external-data-source/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.