.github/workflows/test.yml (example)

Unit testing notebook code on Databricks involves extracting business logic into testable Python functions, then running those tests using frameworks like pytest or unittest either within a notebook or as part of a CI/CD pipeline. By separating transformation logic from notebook orchestration, you can write fast, deterministic tests that catch bugs before they reach production, without requiring a running Spark cluster for every test.

Extract notebook logic into importable Python modules for testability
Write and run unit tests using pytest inside Databricks notebooks
Integrate tests into CI/CD pipelines for automated validation

Who this is for: Data engineers and developers who want to apply software engineering best practices to Databricks notebook code.

Part of the Databricks Notebooks section of the Databricks tutorial series.

Architecture / Concept Overview: .github/workflows/test.yml (example)

The core principle is separating concerns: notebooks orchestrate and visualise, while Python modules contain the business logic. Modules are testable in isolation with mock data, while notebooks handle Spark session management, widget parameters, and display. Tests run in three contexts: interactively in a notebook cell, in a CI job on a Databricks cluster, or locally without Spark using mocked DataFrames.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED NB[Notebook]:::serving --> Orchestrate[Orchestration]:::serving NB --> Module[Python Module]:::processing Module --> Tests[Unit Tests]:::governance Tests --> CI[CI/CD Pipeline]:::governance

*Notebooks orchestrate; modules contain logic; tests validate modules; CI/CD automates testing.*

Tests can run at multiple levels, from pure Python unit tests to Spark integration tests.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Unit[Unit Tests - No Spark]:::source --> Integration[Integration Tests - Spark]:::processing Integration --> E2E[End-to-End - Full Pipeline]:::serving

*The testing pyramid: fast unit tests at the bottom, slower integration tests above, full pipeline tests at the top.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED PR[Pull Request]:::source --> Tests[Run Tests]:::processing Tests --> Pass{Pass?}:::governance Pass -->|Yes| Merge[Merge]:::serving Pass -->|No| Fix[Fix and Re-test]:::source

*Tests run automatically on pull requests, blocking merge until all tests pass.*

Key Terms

Unit Test: A test that validates a single function or method in isolation, typically without external dependencies.
Integration Test: A test that validates the interaction between components, often requiring a Spark session.
pytest: A popular Python testing framework that Databricks supports natively.
Test Fixture: A reusable piece of test setup, like a sample DataFrame or SparkSession, provided to test functions.
Mocking: Replacing real dependencies (like database connections) with fake implementations for isolated testing.
CI/CD: Continuous Integration/Continuous Deployment pipelines that automate testing and deployment.

Prerequisites and Setup

A Databricks workspace with Repos enabled for code organisation
pytest installed on the cluster (pre-installed on modern runtimes)
Understanding of Python testing basics
Code structured as importable Python modules (not just notebook cells)

Step-by-Step Implementation

Configuration Reference

.github/workflows/test.yml (example) configuration options
Test Type	Requires Spark	Speed	Use Case
Pure Python unit test	No	Fast (ms)	Utility functions, parsing, validation
Spark unit test	Yes	Medium (seconds)	DataFrame transforms, SQL logic
Integration test	Yes + data	Slow (minutes)	End-to-end pipeline validation
Notebook test	Yes	Medium	`dbutils.notebook.run()` workflows

Monitoring, Cost, and Security Considerations

Monitoring

Track test results in job run history. Use pytest's JUnit XML output (--junitxml=results.xml) for integration with CI/CD dashboards. Monitor test execution time to catch performance regressions.

Cost Optimisation

- Use single-node clusters (num_workers: 0) for unit tests to minimise DBU consumption.

- Run pure Python tests locally (no cluster) to avoid any Databricks cost.

- Schedule test jobs on spot instances for lower cost.

Security and Governance

- Test data should be synthetic or anonymised; never test against production PII.

- Use separate catalogs or schemas for test data to prevent accidental production writes.

- Store test credentials in Databricks secrets, not in test code.

Common Pitfalls and Recommended Patterns

Testing logic inside notebook cells: extract into modules first; cells are not importable.
Skipping tests because "it works in the notebook": production failures from untested code are costly.
Testing with full production datasets: use small synthetic fixtures for speed and reproducibility.
Not mocking external dependencies: tests that call APIs or databases are slow and fragile.
Running tests on multi-node clusters: unit tests need only a single node.
Forgetting to assert: tests that run code without assertions do not catch bugs.

Frequently Asked Questions

Can I run pytest in a Databricks notebook?

Yes. Use pytest.main() with the test directory path. The test results display in the cell output.

Do I need a running cluster for unit tests?

For pure Python tests (no Spark), you do not need a cluster — run them locally. For tests that use Spark DataFrames, you need a cluster or local Spark session.

How do I test SQL transformations?

Write the SQL as a Spark SQL call in a Python function, then test the function with sample DataFrames using spark.createDataFrame().

Should every notebook have tests?

Focus on testing the business logic (transformation functions, validation rules). Notebooks that only orchestrate (calling modules, displaying results) need less testing. Aim for high coverage on extracted modules.