Notebook cell calls the module
Production-quality Databricks notebooks follow software engineering principles: extract logic into testable modules, use version control via Repos, parameterise with widgets, document with markdown cells, and structure execution for top-to-bottom reproducibility. Treat notebooks as thin orchestration layers that call well-tested library code, not as monolithic scripts with hundreds of cells.
- Structure notebooks for readability, reproducibility, and collaboration
- Apply software engineering patterns: modularity, testing, version control
- Avoid common anti-patterns that lead to maintenance debt
Who this is for: Data engineers, analysts, and data scientists who want to write maintainable, production-ready notebooks on Databricks.
Part of the Databricks Notebooks section of the Databricks tutorial series.
Architecture / Concept Overview: Notebook cell calls the module
A well-structured notebook separates concerns into layers: parameters (widgets), imports, configuration, transformation logic (from modules), orchestration, and output. Business logic lives in importable Python modules stored in Repos, making it testable, reusable, and reviewable through standard code review workflows.
*Notebooks follow a standard section order: parameters, imports, configuration, logic, and output.*
*Notebooks are thin wrappers calling tested library code, with CI/CD ensuring quality.*
*Anti-pattern: monolithic, hardcoded, untested. Best practice: modular, parameterised, tested.*
Key Terms
- Thin Notebook
- A notebook that orchestrates work by calling library functions rather than containing all logic inline.
- Repos
- Git integration for managing notebook and module source code with version control.
- Idempotent
- A notebook that produces the same result when run multiple times on the same input data.
- Top-to-Bottom Execution
- Designing notebooks so every cell runs correctly in sequential order without manual intervention.
- Feature Branch
- A Git branch used to develop and review changes before merging to the main branch.
Prerequisites and Setup
- A Databricks workspace with Repos enabled
- A Git repository for source code management
pytestavailable on the cluster for testing- Familiarity with Python packaging and module imports
Step-by-Step Implementation
Configuration Reference
| Practice | Description | Priority |
|---|---|---|
| Sections in order | Params → Imports → Config → Logic → Output | High |
| Extract logic to modules | Importable, testable Python files | High |
| Use widgets for parameters | No hardcoded dates, tables, or configs | High |
| Top-to-bottom execution | Every cell runs in order without errors | High |
| Idempotent writes | Safe to re-run without data corruption | High |
| Version control via Repos | Git branches, PRs, and code review | High |
| Markdown documentation | Purpose, owner, schedule, assumptions | Medium |
| Validation checkpoints | Schema, row count, null checks | Medium |
| Error handling | Try/except with dbutils.notebook.exit() | Medium |
| Unit tests | pytest for all extracted modules | Medium |
Monitoring, Cost, and Security Considerations
Monitoring
Add logging at key checkpoints so job run output provides visibility into what happened. Use dbutils.notebook.exit() with a JSON result string for structured exit status.
Cost Optimisation
- Modular, well-tested code reduces debugging time and failed job re-runs.
- Short, focused notebooks run faster and use less compute.
- Use incremental writes (replaceWhere) instead of full table overwrites to reduce processing time.
Security and Governance
- Never hardcode credentials; use dbutils.secrets.get().
- Use Repos for auditable change history and mandatory code review.
- Store notebooks that access sensitive data in restricted workspace folders with appropriate permissions.
Common Pitfalls and Recommended Patterns
- Writing 100+ cell notebooks: keep notebooks under 20-30 cells; extract logic into modules.
- Running cells out of order and relying on hidden state: always verify with "Run All".
- Hardcoding connection strings, passwords, or table names: use widgets and secrets.
- Not documenting the notebook purpose: add a markdown header cell with purpose, owner, and schedule.
- Skipping validation: adding
assert df.count() > 0saves hours of debugging downstream issues. - Using workspace folders without Git: Repos provide proper versioning, branching, and code review.
- Copy-pasting code between notebooks: extract shared logic into a common module.
- Not using exit codes:
dbutils.notebook.exit("SUCCESS")provides structured job output.
Frequently Asked Questions
How many cells should a notebook have?
Aim for 10-30 cells. If a notebook exceeds this, extract logic into Python modules. Each cell should have a clear, single purpose.
Should I use notebooks or Python scripts for production?
Use notebooks for orchestration and visualisation, and Python modules for business logic. This gives you the best of both worlds: interactive development with testable, reviewable code.
How do I share code between notebooks?
Use %run for simple cases (utility notebooks) or import from Python modules in Repos for larger codebases. Prefer modules for testability.
Should every notebook have documentation?
Yes. At minimum, include a markdown cell at the top with the notebook's purpose, owner, schedule (if applicable), and key assumptions.