Hi, and welcome to this demo on how to combine the powerful capabilities of IBM Db2 Warehouse with the flexibility of a Jupyter notebook. We will use PySpark to analyse data that is stored in IBM Db2 Warehouse and visualize the results using matplotlib. This notebook is connected to IBM Db2 Warehouse using Livy, which is an open-source REST service for Apache Spark. You can use Livy with sparkmagic in a Jupyter notebook to execute Spark jobs easily. Refer to the README file for information about how to configure a Livy connection. A Livy connection enables you to work on a Jupyter notebook which you open locally but let run on a Spark cluster with IBM Db2 Warehouse. Let’s go back to our notebook. Our use case is to analyze tornado distribution in the US in order to assign a risk score to real-estate insurance holders. Let’s get started! First, you have to configure Livy on your client. Credentials can be modified in the config.json file, which is located in the .sparkmagic directory.
In the first cell of this notebook, we import the necessary libraries. You see that, by running this first cell, you do more than just import the libraries: a Livy session and a Spark session are also created. The cell shown here uses the ibmdbpy python library to establish a connection from this notebook to the database. We use sample data that is already available in Db2 Warehouse. Here, we check if the tables containing data about tornados and home insurance holders have already been created. If not, we build them. Ibmdbpy is an open-source project launched by IBM to accelerate Python analytics by in-database processing. For more details about this package and how to install it, refer to the IBM Knowledge Center. Last but not least, we import a few libraries, which will be useful for local data visualisation. Note the %%local at the beginning, which indicates that the images will be outputted on the client, not on the server. Now, let’s go back to our use-case. We can now load the data with PySpark and explore it. We use the idaxsource format because it enables a collocated data read, which improves efficiency. When using the idaxsource format, the Spark data partitions correspond directly to the Db2 partitions, and this saves a lot of time. This is not the case when using a standard JDBC datasource with PySpark. Let’s take a look at the tornado dataset. We have around 60,000 records. Our tornado dataset consists in all tornado records in the US between 1950 and 2013. We have information about each tornado such as when it started, the geographical coordinates of the location where it started and ended, and its magnitude. Since we have geographical coordinates, let’s display this data on a map so we gain a better idea of the scale and distribution of tornadoes. If you wish, you can sample down the dataset. To visualize the data, we will use GeoPandas, which is a Python tool for geographical data. We fetch the data through an sql query in order to plot it locally. After converting it to the required format and loading a map of the United States as background, we obtain the following image. We see clearly that tornados mostly impacted the East and South side of the country. However, this is not precise enough for our analysis. Let’s dig deeper in the data. More insightful visualizations can be obtained with a heatmap – in red, you see the hotspots where more tornado started – It corroborates with the official map that you see here. We display a density map that reveals that Texas is by far the most deeply impacted state, with over 8000 tornado start locations within its borders!
From now on, we’ll focus on Texas. Let’s use the KMeans algorithm to build tornado clusters in Texas. Here we chose to build 4 clusters. We first select the features we want to use, then define our model settings such as number of clusters and random seed and finally we fit to our dataset. Next we have a look at our model. Here the total number of cases in each cluster. You can take a look at the cluster centers and size on this map. On this other map you see which tornado belongs to which cluster. Next, we open the Texas_insurance customers table which contains, among other features, the geographical coordinates of their real-estate property. Again, if you wish, you can down sample the data for visualization. In the next table, we have a new cluster ID column. You can see to which cluster each insurance holder belongs. In the second table, you see how many insured properties each cluster contains. Let’s display this information on a map. Again, we make a bridge between remote and local with an SQL temporary view in order to output the map locally. On this map, each grey dot corresponds to an insured property. On this other map, we assigned each customer to its nearest tornado cluster. The result corresponds to the Voronoi cells of the clusters. Now, it’s up to you to define a metric to score the risk associated which each insurance holder. You can easily test your ideas by downloading and completing this notebook. Deploy your application easily and run it anytime on IBM Db2!
Thanks for watching!