Polytomic connects to a specific Databricks warehouse using the server hostname and HTTP path.

To determine the settings for your warehouse, log into your Databricks workspace. In the sidebar, click SQL > SQL Warehouses. Click the name of the warehouse you want to use with Polytomic. The Connection Details tab contains the hostname, port, and HTTP path, which you'll enter in Polytomic.

Polytomic authenticates to Databricks using a Personal Access Token. See the Databricks documentation for details on generating a token.

Configuring the connection

In Polytomic, go to Connections → Add Connection → Databricks.

Enter the server hostname, port, access token, and HTTP path.
If you don't have Unity Catalog enabled on your Databricks cluster, make sure to unmark the Unity Catalog enabled checkbox.
Click Save.

Writing to Databricks

If you'd like to also write to Databricks, you'll need to select your cloud provider and provide its access settings:

Writing to Databricks on AWS

(For connecting to Databricks on Azure, see the next section.)

To enable Polytomic to write to your AWS Databricks cluster, you'll need to provide the following information:

AWS Access Key ID.
AWS Secret Access Key.
S3 bucket name (Polytomic uses an S3 bucket to stage data for syncs into Databricks).
S3 bucket region (e.g. us-east-1 or such).
Either:
- An AWS Access Key ID and Secret, or
- An AWS IAM Role ARN which Polytomic will assume when staging data into the bucket.

Using an AWS IAM Role to stage data for Databricks

If you're on AWS and require authenticating using an IAM role rather than an access key and secret, see instructions here.

Writing to Databricks on Azure

To enable Polytomic to write to your Azure Databricks cluster, you'll need to provide an Azure Storage container for us to stage data in.

🚧
Azure Storage Account Type
Databricks requires that your Azure storage account is Data Lake Gen2 compatible. For this reason, it is recommended you create a new storage account/container for Polytomic to use. When creating a new storage account, enable the "Hierarchical Namespaces" feature on the Advanced setting tab.
You can read more about creating or migrating existing accounts on the Azure Support Portal.

You will need to enter the following values in Polytomic:

Azure Storage account name - the account name that contains the container Polytomic will write to.
Azure Storage access key - the access key associated with the storage account.
Azure Storage container name - the container that Polytomic will write to.

Modifying table retention policy

When syncing to Databricks using Polytomic's bulk syncs (i.e. ELT workloads), you can override the default Databricks table retention period by turning on the Configure data retention for tables setting at the bottom of your Databricks connection configuration:

Per-sync table retention policy

You can override the global retention policy per bulk sync by going to Advanced settings at the bottom of your bulk sync configuration and turning on the Configure data retention for tables setting:

This will override whatever you have set in your Polytomic Databricks connection config.

Advanced: limiting concurrent queries on Databricks

You can choose to limit the number of concurrent queries Polytomic issues on Databricks by turning on the Concurrency query limit option. This will allow you to enter an integer limit of concurrent queries that Polytomic can issue:

Unless you have a good reason for setting this, you should leave the option unset. During your operation of Polytomic, there are a couple of signs that would indicate the need to set a limit:

An occasional failed to execute query: Invalid OperationHandle error from your Databricks cluster while Polytomic is running.
Any variance in sync running times when writing data to Databricks, where occasionally a sync takes much longer than usual.

Both are symptoms of Polytomic hitting your cluster's capacity for concurrency. The way to get around this is to turn on this option, thus limiting the number of concurrent queries Polytomic issues against your Databricks cluster.

Writing to external tables

If you'd like to write to an external location that you've already set up within Databricks, you can specify this in Step 1 of your Polytomic Bulk Sync configuration:

Databricks permissions

The specific Databricks permissions required vary based on whether you're reading or writing data, and if you're writing, whether you're using an external location.

Reading from Databricks

The following permissions are required to read from Databricks.

USE CATALOG on any catalogs you wish to access
USE SCHEMA on any schemas you wish to access
SELECT on any tables you wish to access

Writing to Databricks

In addition to the permissions required for reading, the following permissions are required to write to Databricks.

CREATE SCHEMA on catalogs you are writing to
CREATE TABLE on catalogs and any existing schemas you are writing to
MODIFY on existing schemas and tables you are writing to

Polytomic will create new schemas and tables as needed; Polytomic's user will be owner of any newly created schemas and tables.

If you are writing to tables stored in an external location, the following additional permissions are required.

CREATE EXTERNAL TABLE on the external location and associated storage credential
READ FILES on the external location
READ VOLUME on the associated volume, if any

Configuring the connection

Writing to Databricks

Writing to Databricks on AWS

Using an AWS IAM Role to stage data for Databricks

Writing to Databricks on Azure

🚧Azure Storage Account Type

Modifying table retention policy

Per-sync table retention policy

Advanced: limiting concurrent queries on Databricks

Writing to external tables

Databricks permissions

Reading from Databricks

Writing to Databricks

🚧
Azure Storage Account Type