- Microsoft Fabric What?!
- Let’s learn some basics about LakeHouse
- Let’s create a Fabric Lakehouse
- Apache Spark In Microsoft Fabric
- Delta Lake in Microsoft Fabric
- Delta Lake Tables
- Delta Lake - Spark Streaming
Microsoft Fabric What?!
One-stop low-to-no-code platform.
To understand Fabric, first understand OneLake.
OneLake is Fabric’s Database.
Each fabric tenant gets OneOneLake by default.
OneLake ~ OneDrive.
OneLake is cover over ADLS.
Either importa data in OneLake or Create Shortcuts to external data.
Default storage us Delta.
Let’s learn some basics about LakeHouse
Lake house is Data Lake and Warehouse.
The base of Fabric is Data lakehouse.
Features of Lakehouse:
Lakehous is Data Lake + Warehouse. Flexibility of storing data in lake and SQL query like warehouse!
Pre-requisites for a LakeHouse
Before you can create a lakehouse, you create a workspace in the Microsoft Fabric platform.
You create and configure a new lakehouse in the Data Engineering workload. Each L produces three named items in the Fabric-enabled workspace:
Lakehouse is the lakehouse storage and metadata, where you interact with files, folders, and table data. Semantic model (default) is an automatically created data model based on the tables in the lakehouse. Power BI reports can be built from the semantic model. SQL Endpoint is a read-only SQL endpoint through which you can connect and query data with Transact-SQL.
You can work with the data in the lakehouse in two modes:
Lakehouse enables you to add and interact with tables, files, and folders in the lakehouse. SQL analytics endpoint enables you to use SQL to query the tables in the lakehouse and manage its relational data model.
Shortcuts in Lakehouse
Shortcuts enable you to integrate data into your lakehouse while keeping it stored in external storage.
Shortcuts can be created in both lakehouses and KQL databases, and appear as a folder in the lake. Spark, SQL, Real-Time Analytics, and Analysis Services can access data via shortcuts when querying data.
Shortcuts have limited data source connectors, so when you can’t use shortcuts, you can ingest data directly into your lakehouse.
Source data permissions and credentials are all managed by OneLake.
Ways to ingest data into Lakehouse
Upload: Upload local files or folders to the lakehouse. You can then explore and process the file data, and load the results into tables.
Dataflows (Gen2): Import and transform data from a range of sources using Power Query Online, and load it directly into a table in the lakehouse.
Notebooks: Use notebooks in Fabric to ingest and transform data, and load it into tables or files in the lakehouse.
Data Factory pipelines: Copy data and orchestrate data processing activities, loading the results into tables or files in the lakehouse.
Ways to transform data in Fabric Lakehouse
Apache Spark: Pyspark, SparkSQL, Notebooks, Spark job definitions SQL analytic endpoint: Transact SQL Dataflows (Gen2): Power Query Data pipelines:
Ways to Visualize
Create an end-to-end solution using Power BI and Fabric Lakehouse.
Let’s create a Fabric Lakehouse
Create a LakeHouse Workspace
- Go to https://app.fabric.microsoft.com. Select Synapse Data Engineering
Create a LakeHouse
- Click on Create then Lakehouse. Give it any name.
- Fabric will create everything automatically and you will have your Lakehouse
Upload a simple excel to the LakeHouse
- Download and save sample excel file from here https://raw.githubusercontent.com/MicrosoftLearning/dp-data/main/sales.csv
- Go to Explorer, create a Data subfolder under Files, then upload the downloaded csv in it
Load the excel data into a table
Now, if you want to use SQL you need to import the excel into a table. It is pretty simple.
Just click on the elipses next to the excel file and select Load to Tables
Once loaded you can see your table in tabular format
Query the excel file table using SQL
Write a normal SQL query using Editor
Write a Visual Query
Create a Report
- At the bottom of the SQL Endpoint page, click on the Model tab to view the data model schema for the semantic model.
- Navigate to the Reporting tab in the menu ribbon and select New report to open a new browser tab for report design.
- In the Data pane, expand the sales table and select Item and Quantity. This will add a table visualization to your report.
- Hide the Data and Filters panes to create more space. Change the visualization to a Clustered bar chart and resize it.
- Save the report by selecting Save from the File menu and name it Item Sales Report.
- Close the report tab and return to the SQL endpoint page. In the hub menu, select your workspace to verify it contains your lakehouse, the SQL analytics endpoint, a default semantic model, and the Item Sales Report.
Connect external data using shortcuts
It is easy to import data into Fabric. But, what if the data wants to stay outside? Then we create shortcuts to that external data in Lakehouse. It appears like a folder.
Lets create a shortcut to dataverse.
- Click on the cli
Note: The region of Dataverse and Fabric should be same.
Apache Spark In Microsoft Fabric
Spark divides and conquers: It splits a large job across computers. SparkContext does the splitting etc. Spark can use many languages, but in Industry PySpark and Spark SQL are most used.
In Fabric. One Workspace gets One Spark Cluster.
Run a Spark Notebook
Run a simple Pyspark code in Notebook
Here, the spark session is already created. All you have to do is create dataframe and start coding!
Create a Spark job definition
-
Access the Spark Job Definition Page.
- Create a PySpark Job Definition:
-
Develop a main definition file named
anyname.py
. The file should include the following code:from pyspark.sql import SparkSession # This code executes only when the .py file is run directly. if __name__ == "__main__": # Initialize a Spark session specifically for this job. spark = SparkSession.builder.appName("Sales Aggregation").getOrCreate() # Read data from a CSV file into a DataFrame. df = spark.read.csv('Files/data/sales.csv', header=True, inferSchema=True) # Write the DataFrame to a Delta table, overwriting existing data. df.write.mode('overwrite').format('delta').save('Files/data/delta')
-
- Upload and Schedule the File:
Delta Lake in Microsoft Fabric
Delta lake is just Data Lake with a SQL Cover.
In Fabric, any table imported from .csv/excel etc. automatically becomes a Delta Lake table.
For these tables, you’ll find .parquet files and delta log folders when you view the files.
Since every table is automatically a Delta Lake table, this is a very useful feature. There’s no need to convert files into Delta tables separately.
Delta Lake Tables
Using df.write
Managed table:
df.write.format(“delta”).saveAsTable(“tableName”)
External table:
df.write.format(“delta”).saveAsTable(“tableName”, path=”Files/folderX”)
Using API - DeltaTableBuilder
Managed table:
from delta.tables import *
DeltaTable.create(spark) \
.tableName("Planet") \
.addColumn("Size", "INT") \
.addColumn("Name", "STRING") \
.execute()
Using Spark SQL
Create Managed table:
CREATE TABLE salesorders
(
Orderid INT NOT NULL,
OrderDate TIMESTAMP NOT NULL,
CustomerName STRING,
SalesTotal FLOAT NOT NULL
)
USING DELTA;
Create External table:
CREATE TABLE MyExternalTable
USING DELTA
LOCATION 'Files/mydata';
Insert rows:
This is the most common way:
spark.sql("INSERT INTO products VALUES (1, 'Widget', 'Accessories', 2.99)")
Insert rows - %%sql magic:
%%sql
UPDATE products
SET Price = 2.6 WHERE ProductId = 1;
No table - Just Delta Files
To just save the dataframe as delta format. No table created.
df.write.format("delta").mode("overwrite / append").save("Folder/Path")
After running the code you will see a folder with:
- parquet files
- _delta_log sub-folder:
Later you can create an DeltaTable from the folder and modify it:
from delta.tables import * from pyspark.sql.functions import *
# Create a DeltaTable object
delta_path = "Files/mytable"
deltaTable = DeltaTable.forPath(spark, delta_path)
# Update the table
deltaTable.update(
condition = "Category == 'Accessories'",
set = { "Price": "Price * 0.9" })
Use time travel
The command will show all the transactions to the table:
%%sql
DESCRIBE HISTORY products
For a specifc version:
df = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
For a specific date:
df = spark.read.format("delta").option("timestampAsOf", '2022-01-01').load(delta_path)
Delta Lake - Spark Streaming
Delta table - streaming source
Here a delta table stores internet sales data. When new data added a stream is created:
from pyspark.sql.types import *
from pyspark.sql.functions import *
# Load a streaming dataframe from the Delta Table
stream_df = spark.readStream.format("delta") \
.option("ignoreChanges", "true") \
.load("Files/delta/internetorders")
# Now you can process the streaming data in the dataframe
# for example, show it:
stream_df.show()
Delta table - streaming sink
Here a folder has JSON files. As new JSON files are added. The contents are added to a Delta Lake table:
from pyspark.sql.types import *
from pyspark.sql.functions import *
# Create a stream that reads JSON data from a folder
inputPath = 'Files/streamingdata/'
jsonSchema = StructType([
StructField("device", StringType(), False),
StructField("status", StringType(), False)
])
stream_df = spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger", 1).json(inputPath)
# Write the stream to a delta table
table_path = 'Files/delta/devicetable'
checkpoint_path = 'Files/delta/checkpoint'
delta_stream = stream_df.writeStream.format("delta").option("checkpointLocation", checkpoint_path).start(table_path)
To stop writing to the delta lake table use stop:
delta_stream.stop()