DataEngineering | ameritechsoft

Databricks notebooks:
Develop code using Python, SQL, Scala, and R.
Customize your environment with the Python, ML, DL libraries of your choice.
Auto schedule jobs to run tasks, including multi-notebook workflows and passing data across the notebooks.
Export results and notebooks in .html or .ipynb format.
Use a Git-based repository to store your notebooks with associated files and dependencies.
Build and share dashboards.

Interact with object storage using directory and file semantics instead of cloud-specific API commands.
Mount cloud object storage locations so that you can map storage credentials to paths in the Databricks workspace.
Simplifies the process of persisting files to object storage, allowing virtual machines and attached volume storage to be safely deleted on cluster termination.
Provides a convenient location for storing init scripts, JARs, libraries, and configurations for cluster initialization.
Provides a convenient location for checkpoint files created during model training with OSS deep learning libraries.

Workspace libraries serve as a local repository from which you create cluster-installed libraries. A workspace library might be custom code created by your organization, or might be a particular version of an open-source library that your organization has standardized on.
Cluster libraries can be used by all notebooks running on a cluster. You can install a cluster library directly from a public repository such as PyPI or Maven, or create one from a previously installed workspace library.
Notebook-scoped libraries, available for Python and R, allow you to install libraries and create an environment scoped to a notebook session. These libraries do not affect other notebooks running on the same cluster.
Notebook-scoped libraries do not persist and must be re-installed for each session. Use notebook-scoped libraries when you need a custom environment for a specific notebook.

Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations.
Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on-disk sorting.

Data streaming on Databricks means you benefit from the foundational components of the Lakehouse Platform — Unity Catalog and Delta Lake.
Raw data is optimized with Delta Lake, the only open source storage framework designed from the ground up for both streaming and batch data.
Unity Catalog gives fine-grained, integrated governance for all your data and AI assets with one consistent model to discover, access and share data across clouds.
Unity Catalog also provides native support for Delta Sharing, the industry’s first open protocol for simple and secure data sharing with other organizations.

Databricks is an optimized platform for Apache Spark, providing an efficient and simple and scalable platform for running Apache Spark workloads in a clustered parallel processing environment
Apache Spark is at the heart of the Databricks Lakehouse Platform and is the technology powering compute clusters and SQL warehouses on the platform.
Databricks AI ML DL Runtime provides pre-built deep learning infrastructure and includes common deep learning libraries like Hugging Face transformers, PyTorch, TensorFlow, and Keras

Run a Delta Live Tables pipeline that ingests raw clickstream data from cloud storage, cleans and prepares the data, sessionizes the data, and persists the final sessionized data set to Delta Lake.
Run a Delta Live Tables pipeline that ingests order data from cloud storage, cleans and transforms the data for processing, and persist the final data set to Delta Lake.
Join the order and sessionized clickstream data to create a new data set for analysis.
Extract features from the prepared data.
Perform tasks in parallel to persist the features and train a machine learning model.

Delta Live Tables for near real-time data ingestion, processing, machine learning, and AI for streaming data.
Run Structured Streaming workloads
Streaming table: Each record is processed exactly once. This assumes an append-only source.
Materialized views: Records are processed as required to return accurate results for the current data state. Materialized views should be used for data sources with updates, deletions, or aggregations, and for change data capture processing (CDC).
Views: Records are processed each time the view is queried. Use views for intermediate transformations and data quality checks that should not be published to public datasets

INGEST