Databricks
Databricks now offer a native DB API 2.0 driver, databricks-sql-connector
, that can be used with the sqlalchemy-databricks
dialect. You can install both with:
pip install "superset[databricks]"
To use the Hive connector you need the following information from your cluster:
- Server hostname
- Port
- HTTP path
These can be found under "Configuration" -> "Advanced Options" -> "JDBC/ODBC".
You also need an access token from "Settings" -> "User Settings" -> "Access Tokens".
Once you have all this information, add a database of type "Databricks Native Connector" and use the following SQLAlchemy URI:
databricks+connector://token:{access_token}@{server_hostname}:{port}/{database_name}
You also need to add the following configuration to "Other" -> "Engine Parameters", with your HTTP path:
{
"connect_args": {"http_path": "sql/protocolv1/o/****"},
"http_headers": [["User-Agent", "Apache Superset"]]
}
The User-Agent
header is optional, but helps Databricks identify traffic from Superset. If you need to use a different header please reach out to Databricks and let them know.
Older driver
Originally Superset used databricks-dbapi
to connect to Databricks. You might want to try it if you're having problems with the official Databricks connector:
pip install "databricks-dbapi[sqlalchemy]"
There are two ways to connect to Databricks when using databricks-dbapi
: using a Hive connector or an ODBC connector. Both ways work similarly, but only ODBC can be used to connect to SQL endpoints.
Hive
To connect to a Hive cluster add a database of type "Databricks Interactive Cluster" in Superset, and use the following SQLAlchemy URI:
databricks+pyhive://token:{access_token}@{server_hostname}:{port}/{database_name}
You also need to add the following configuration to "Other" -> "Engine Parameters", with your HTTP path:
{"connect_args": {"http_path": "sql/protocolv1/o/****"}}
ODBC
For ODBC you first need to install the ODBC drivers for your platform.
For a regular connection use this as the SQLAlchemy URI after selecting either "Databricks Interactive Cluster" or "Databricks SQL Endpoint" for the database, depending on your use case:
databricks+pyodbc://token:{access_token}@{server_hostname}:{port}/{database_name}
And for the connection arguments:
{"connect_args": {"http_path": "sql/protocolv1/o/****", "driver_path": "/path/to/odbc/driver"}}
The driver path should be:
/Library/simba/spark/lib/libsparkodbc_sbu.dylib
(Mac OS)/opt/simba/spark/lib/64/libsparkodbc_sb64.so
(Linux)
For a connection to a SQL endpoint you need to use the HTTP path from the endpoint:
{"connect_args": {"http_path": "/sql/1.0/endpoints/****", "driver_path": "/path/to/odbc/driver"}}