Bubbles

Introduction

Bubbles is a python framework for data processing and data quality measurement. Basic concept are abstract data objects, operations and dynamic operation dispatch. When a data scientist works with data, typically that data is stored in CSV files, excel files, databases, and other formats. Also, this data is commonly loaded as pandas DataFrame. For simplicity in the examples, I’ll be using Python lists that contains our data. I’m assuming that you have some knowledge about Python data types, functions, methods, and packages. If you don’t have that knowledge, I suggest you read my previous article that covers these topics. Bubble charts display data as a cluster of circles. The required data to create bubble chart needs to have the xy coordinates, size of the bubble and the colour of the bubbles. The colours can be supplied by the library itself.

Bubble chart can be created using the DataFrame.plot.scatter() methods.

import matplotlib.pyplot as plt
import numpy as np

# create data
x = np.random.rand(40)
y = np.random.rand(40)
z = np.random.rand(40)
colors = np.random.rand(40)
# use the scatter function
plt.scatter(x, y, s=z*1000,c=colors)
plt.show()

Bubbles is still a prototype and the next iteration will have a slightly different approach. mETL is a nice Python ETL framework. The main difference is that mETL is streaming the data and the operations work on a row/record level. In mETL you need to fetch the data to apply the operation in-Python. Where in Bubbles the operation is executed in the source system if it is possible. For example "keep all records from year 2014" will be executed depending on the input. If the input is SQL statement, then WHERE year = 2014 will be composed, if the input is an iterator (or any unknown object that has no such operation implemented) then Python iterator filter will be appled and data will be streamed. Bubbles: a framework for ETL (Extract, Transform and Load) written in Python. Uses metadata to describe the data processing pipeline (ETL) instead of script based description.

There are many types of visualizations. Some of the most famous are: line plot, scatter plot, histogram, box plot, bar chart, and pie chart. But among so many options how do we choose the right visualization? First, we need to make some exploratory data analysis. After we know the shape of the data, the data types, and other useful statistical information, it will be easier to pick the right visualization type. By the way, when I used the words “plot”, “chart”, and “visualization” I mean the same thing. Here, I found an image for chart suggestion that can be useful.

To be more concrete, take a simple filtering for example. Say we have sample of Tweets stored in a SQL database, MongoDB and obviously on Twitter. We want to get all tweets by OKFN. In SQL we use a SQL driver, connect to the database and do:

SELECT * FROM WHERE screen_name = 'okfn'

in Mongo we use a mongodb driver, connect to the database and do:

# create data x = np.random.rand(40) y = np.random.rand(40) z = np.random.rand(40) colors = np.random.rand(40) # use the scatter function plt.scatter(x, y, s=z*1000,c=colors) plt.show()

SELECT * FROM WHERE screen_name = 'okfn'

db.tweets.find( { }, { screen_name: 'okfn'} )

# create data
x = np.random.rand(40)
y = np.random.rand(40)
z = np.random.rand(40)
colors = np.random.rand(40)
# use the scatter function
plt.scatter(x, y, s=z*1000,c=colors)
plt.show()

db.tweets.find(
{ },
{ screen_name: 'okfn'}
)