Supported Languages: Python, SQL, Scala, and R
Databricks notebooks support four languages — Python, SQL, Scala, and R — with the ability to mix them freely within a single notebook using magic commands. Each language has full access to the Spark session and Unity Catalog, and data can be shared between languages through temporary views and the Spark catalog. Choose your language based on your workload: Python for general-purpose engineering and ML, SQL for analytics, Scala for performance-critical code, and R for statistical modelling.
- Understand the capabilities and trade-offs of each supported language
- Learn how to mix languages within a single notebook using magic commands
- Share data between cells of different languages
Who this is for: Developers, analysts, and data scientists who want to understand language options and interoperability in Databricks notebooks.
Part of the Databricks Notebooks section of the Databricks tutorial series.
Architecture / Concept Overview: Supported Languages: Python, SQL, Scala, and R
Every Databricks notebook has a default language set at creation time. Individual cells can override this language using magic commands (%python, %sql, %scala, %r). All languages share the same Spark session, which means they can access the same catalog, schemas, and temporary views. Data passes between languages through the Spark catalog — register a DataFrame as a temporary view in Python, then query it in SQL.
*All four languages share the same Spark session, enabling data access and interoperability through the catalog.*
Data sharing between languages uses temporary views as the interchange format.
*Register a Python DataFrame as a temporary view, then query it from SQL or any other language.*
*Each language has a primary strength: Python for general purpose, SQL for analytics, Scala for JVM performance, R for statistics.*
Key Terms
- Magic Command
- A cell prefix (
%python,%sql,%scala,%r,%md) that sets the language for that cell. - PySpark
- The Python API for Apache Spark, providing DataFrame and SQL operations.
- Spark SQL
- Spark's SQL interface for querying structured data using standard SQL syntax.
- Temporary View
- A session-scoped virtual table that makes a DataFrame queryable via SQL from any language.
- SparkR
- The R API for Apache Spark, enabling distributed data processing from R.
Prerequisites and Setup
- A Databricks notebook attached to compute
- Understanding of at least one supported language
- Unity Catalog enabled for data access
- For Scala: awareness of JVM and Spark internals is helpful
- For R: familiarity with tidyverse and base R
Step-by-Step Implementation
Configuration Reference
| Language | Magic Command | API | Best For |
|---|---|---|---|
| Python | %python | PySpark, pandas, scikit-learn | General ETL, ML, analysis |
| SQL | %sql | Spark SQL | Queries, analytics, dashboards |
| Scala | %scala | Spark Scala API | Performance-critical, JVM integration |
| R | %r | SparkR, tidyverse | Statistical modelling, visualisation |
| Markdown | %md | Markdown syntax | Documentation, notes |
Monitoring, Cost, and Security Considerations
Monitoring
Each cell's execution time is shown regardless of language. Use the Spark UI to inspect query plans and stage execution for all languages. Slow cells in any language may indicate data skew or inefficient transformations.
Cost Optimisation
- Use SQL for simple queries and aggregations — it is often more optimised than equivalent Python code.
- Avoid collecting large DataFrames to Python or R local memory; use Spark's distributed processing.
- Prefer built-in Spark functions over Python UDFs for better performance and Photon compatibility.
Security and Governance
- Unity Catalog enforces the same access policies regardless of which language runs the query.
- On Standard (shared) clusters, some Scala features are restricted to prevent bypassing Lakeguard isolation.
- R and Python run in isolated processes on Standard clusters.
Common Pitfalls and Recommended Patterns
- Collecting large datasets to local memory: use
.limit()or aggregation before.collect()ortoPandas(). - Using Python UDFs when built-in functions exist: UDFs prevent Photon acceleration and are slower.
- Mixing too many languages in one notebook: stick to 1-2 languages for readability; use temp views for handoffs.
- Forgetting that variables do not share across languages: Python variables are not visible in Scala cells.
- Not using temporary views for cross-language data sharing: this is the only supported interchange mechanism.
- Writing complex logic in SQL when Python is more maintainable: use the right tool for the task complexity.
Frequently Asked Questions
Can I share variables between Python and SQL?
Not directly. Use temporary views (createOrReplaceTempView) to share DataFrames. You can also use spark.sql() in Python to execute SQL and return results as a DataFrame.
Which language is fastest?
For SQL and DataFrame operations, all languages compile to the same Spark execution plan, so performance is equivalent. Scala avoids Python-to-JVM serialisation overhead for some operations. Python UDFs are slower than built-in functions.
Can I install additional Python packages?
Yes. Use %pip install package_name in a notebook cell. The package is installed for the duration of the cluster session.
Does R have full Spark support?
SparkR provides DataFrame operations and SQL access. For advanced Spark features, use PySpark or Scala and share results via temporary views.