GuidesRecipesAPI ReferenceChangelog
HomeSee demo
Guides

Databricks

Source and destination

Databricks provides instructions on how to setup a SQL endpoint with correct permissions and retrieve login credentials. We recommend you follow the official Databricks documentation for setting up an ODBC connection.

By the end of the above instructions you should have the following credentials from Databricks. You'll use them to connect Polytomic to your Databricks cluster:

  • Server hostname
  • Port
  • Access token
  • HTTP Path
  1. In Polytomic, go to ConnectionsAdd ConnectionDatabricks.
  1. Enter the server hostname, port, access token, and HTTP path.

  2. If you don't have Unity Catalog enabled on your Databricks cluster, make sure to unmark the Unity Catalog enabled checkbox.

  3. Click Save.

Writing to Databricks

If you'd like to also write to Databricks, you'll need to select your cloud provider and provide its access settings:

Writing to Databricks on AWS

(For connecting to Databricks on Azure, see the next section.)

To enable Polytomic to write to your AWS Databricks cluster, you'll need to provide the following information:

  • AWS Access Key ID.
  • AWS Secret Access Key.
  • S3 bucket name (Polytomic uses an S3 bucket to stage data for syncs into Databricks).
  • S3 bucket region (e.g. us-east-1 or such).
  • Either:
    • An AWS Access Key ID and Secret, or
    • An AWS IAM Role ARN which Polytomic will assume when staging data into the bucket.

Using an AWS IAM Role to stage data for Databricks

If you're on AWS and require authenticating using an IAM role rather than an access key and secret, see instructions here.

Writing to Databricks on Azure

To enable Polytomic to write to your Azure Databricks cluster, you'll need to provide an Azure Storage container for us to stage data in.

🚧

Azure Storage Account Type

Databricks requires that your Azure storage account is Data Lake Gen2 compatible. For this reason, it is recommended you create a new storage account/container for Polytomic to use. When creating a new storage account, enable the "Hierarchical Namespaces" feature on the Advanced setting tab.

You can read more about creating or migrating existing accounts on the Azure Support Portal.

You will need to enter the following values in Polytomic:

  • Azure Storage account name - the account name that contains the container Polytomic will write to.
  • Azure Storage access key - the access key associated with the storage account.
  • Azure Storage container name - the container that Polytomic will write to.

Modifying table retention policy

When syncing to Databricks using Polytomic's bulk syncs (i.e. ELT workloads), you can override the default Databricks table retention period by turning on the Configure data retention for tables setting at the bottom of your Databricks connection configuration:

Per-sync table retention policy

You can override the global retention policy per bulk sync by going to Advanced settings at the bottom of your bulk sync configuration and turning on the Configure data retention for tables setting:

This will override whatever you have set in your Polytomic Databricks connection config.

Advanced: limiting concurrent queries on Databricks

You can choose to limit the number of concurrent queries Polytomic issues on Databricks by turning on the Concurrency query limit option. This will allow you to enter an integer limit of concurrent queries that Polytomic can issue:

Unless you have a good reason for setting this, you should leave the option unset. During your operation of Polytomic, there are a couple of signs that would indicate the need to set a limit:

  • An occasional failed to execute query: Invalid OperationHandle error from your Databricks cluster while Polytomic is running.
  • Any variance in sync running times when writing data to Databricks, where occasionally a sync takes much longer than usual.

Both are symptoms of Polytomic hitting your cluster's capacity for concurrency. The way to get around this is to turn on this option, thus limiting the number of concurrent queries Polytomic issues against your Databricks cluster.