Leveraging dbt and Python for Predictive Modeling

Exploring dbt’s Python entry point functionality in v1.5

4 min readJul 29, 2023

Last week, I had the opportunity to attend a wonderful talk titled “Unleashing the Power of dbt and Python for Modern Data Stack” by Features and Labels (fal-ai) at EuroPython 2023. The talk focused on the integration of dbt and Python in building and training ML modeling pipelines highlighting the untapped potential of combining dbt and Python for efficient data transformation and analysis.

While dbt currently supports Python integrations with Snowflake and Databricks, support for other adapters such as Redshift does not exist yet. In this Medium post, I would like to share my perspective on leveraging new dbt functionality in v1.5 for predictive modeling.

We will walk through the entire process, including fetching external data, storing it in a Postgres database, performing transformations and data cleansing with dbt, and finally, making inferences using a pre-trained lightGBM model for profit scoring of loans and storing scores in the same database.

More information about how the model was trained and the data it was trained on is available in this LinkedIn post.

Getting Started

To follow along with this tutorial, make sure you have Python 3.11 installed, and set up Docker for the Postgres database. We will use Python’s internal venv package to create a virtual environment for our project. I am using macOS Monterey version 12.6.3 (M2 chip).

Setting up the Virtual Environment

Install Python 3.11 using brew: brew install python@3.11
Download Docker for your machine from Docker’s website

First, you will need to create a project folder, which is in my casedbt-python.
Now, let’s set up the virtual environment and install the necessary packages:

python3.11 -m venv my_env 
source my_env/bin/activate 
pip install pandas requests scikit-learn lightgbm==3.3.5 psycopg2-binary sqlalchemy dbt-core dbt-postgres

Setting up the PostgreSQL Database

We need to build a Docker container with our PostgreSQL database. You can download Postgres by running:

docker pull postgres

To set up a local instance of Postgres, you can follow this guide from Week 1 Data Engineering Zoomcamp. Remember to mount the database to your local drive to avoid data loss if the Docker container is stopped. More guidance about using Postgres with Docker is available in this post.

Below is an example of spinning up a local Postgres database instance:

docker run -d \
 - name postgres_db \
 -e POSTGRES_USER="root" \
 -e POSTGRES_PASSWORD="root" \
 -e POSTGRES_DB="postgres_db" \
 -v pg_data:/var/lib/postgresql/data \
 -p 5432:5432 \
 postgres

Creating the dbt Project

Now, let’s set up our dbt project. Initiate a dbt project folder within the same directory:

dbt init my_dbt_project - skip-profile-setup

Next, set up a profiles.yml file, which contains our credentials:

cd my_dbt_project
nano profiles.yml

Copy the following configuration into the yaml file:

dbt_playground:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: "root"
      pass: "root"
      port: 5432
      dbname: postgres_db
      schema: dbt_playground
      threads: 1

Verify your dbt setup:

dbt debug

If you see “All checks passed”, your dbt setup is complete.

Pipeline

The first module within our pipeline is a Python script fetch_and_write_data.py that fetches external data from Bondora and ingests it to Postgres. The resulting table will appear under the name dbt_playground.bondora_loan_dataset in our Postgres database.

The second part of the pipeline is dbt_transform_data.py, which transforms the raw data obtained in the previous step into a dataset which we can feed to our scoring model.

To apply the transformations to the extracted dataset, we will need to create a script called dbt_transformation.sql inside the /models folder of my_dbt_project.

touch dbt_transformation.sql
nano dbt_transformation.sql

Copy the transformation script into dbt_transformation.sqlfrom here.

After the second script executes, the data will become available in the database in the table dbt_playground.dbt_transformation.

The final step is to score loans using the module run_scoring.py. In this step we download our pickled model and apply predictions on the transformed dataset. After running the script, we should see that the average predicted Annualized Rate of Return (ARR) in our sample being equal to 7%.

The scoring outputs are stored into a table dbt_playground.scoring_outputs, which contain the contract ID information, profit score, and a timestamp.

Conclusion

And that’s about it. In this post, we have demonstrated a powerful and flexible approach to predictive modeling by combining the power of dbt and Python.

Integration using dbt’s Python entry point enables smooth data transformation, analysis, and model inference, making it a valuable addition to any predictive modeling workflow. With the example of P2P scoring using Bondora’s loan dataset, we hope this guide inspires you to explore dbt + Python for your own data projects. Happy modeling!

You can find the code for the project here and you follow me on LinkedIn to learn more about Credit Risk Modeling.