AECData Python Library Step 3— Visualizations & Statistical Analysis
TL;DR
We go step by step and explain how to start with the filtering and retrieval of data on
aecdata
(see code below).
Done with data retrieval, on with the stats!
Welcome to the third tutorial on using the open-source AECdata library provided by 2050 Materials.
In this tutorial, we’ll learn how to plot visualizations and derive statistics from your data. This guide will cover grouping data by category and location, removing outliers, and calculating median values and quartiles. Plus, we’ll show how to create a distribution plot.
Setting Up Your Environment
Before diving into the statistics and plots, ensure you’ve imported the necessary classes from the aecdata
library:
from aecdata import ProductData, ProductStatistics
import pandas as pd
Initializing the ProductStatistics Class
Start by creating an instance of the ProductStatistics
class. This class extends the functionalities of the ProductData
class, allowing for advanced data analysis.
# Assuming you have already fetched data and stored it in a product_data object
stats_obj = ProductStatistics(product_data.dataframe, unit='m2')
Grouping and Filtering Data
One of the powerful features of the ProductStatistics class is its ability to group and filter data efficiently. Here’s how you can do it:
# Define the criteria for grouping
group_by = [
'country',
'manufacturing_continent',
'material_type',
]
# Retrieve all available fields in the output
all_available_fields = stats_obj.get_available_fields()
#Generate the grouped dataframe with the statistics you want
stat_df = stats_obj.get_statistics(group_by=group_by, fields=all_available_fields, statistical_metrics=['count', 'mean', 'median'], include_estimated_values=False, remove_outliers=True, method='IQR', sqrt_tranf=True, min_count=3)
print(stat_df)
This code groups the data by country and material type, which is particularly useful for regional analysis and comparisons between different materials.
Outliers can skew the results of your data analysis. The
ProductStatistics
class includes methods to remove these effectively. Notice above in the get_statistics() function the parameter remove_outliers=True
Plotting Data Distributions
Visualizations can help understand the distribution of data. Let’s plot a histogram and a boxplot:
import matplotlib.pyplot as plt
def plot_histogram(df, field):
bin_count = min(len(df[field].unique()), 50) # Limit the number of bins to a maximum of 50
plt.figure(figsize=(10, 6))
n, bins, patches = plt.hist(df[field], bins=bin_count, color='#2ab0ff', alpha=0.7, rwidth=0.85)
plt.grid(axis='y', alpha=0.75)
plt.xlabel('Value', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title(f'Distribution of {field}', fontsize=15)
plt.axvline(x=df[field].mean(), color='r', linestyle='-', label=f'Mean: {df[field].mean():.4f}')
plt.axvline(x=df[field].median(), color='m', linestyle='-', label=f'Median: {df[field].median():.4f}')
plt.legend(loc='upper right')
plt.show()
#Plot distribution without removing outliers (set remove_outliers=True to remove outliers)
filters = {
'material_type':'Ceramic',
}
field = 'material_facts.manufacturing'
df = stats_obj.get_field_distribution(field='material_facts.manufacturing', filters=filters, include_estimated_values=True, remove_outliers=False, method='IQR', sqrt_tranf=True)
plot_histogram(df, field)
def plot_boxplot(grouped_data_dict, field, group_by_field):
plt.figure(figsize=(10, 6))
plt.boxplot(grouped_data_dict.values(), patch_artist=True, labels=grouped_data_dict.keys())
plt.grid(axis='y', alpha=0.75)
plt.xlabel(group_by_field, fontsize=15)
plt.ylabel('Value', fontsize=15)
plt.xticks(fontsize=12, rotation=45) # Rotate labels if there are many groups
plt.yticks(fontsize=12)
plt.title(f'Distribution of {field} by {group_by_field}', fontsize=15)
plt.show()
# Plot boxplots
group_by_field = 'product_type'
# Get a dictionary with keys product types and values dataframe series
grouped_data_dict = stats_obj.get_field_distribution_boxplot(field=field, group_by_field=group_by_field, filters=None, include_estimated_values=True, remove_outliers=True, method='IQR', sqrt_tranf=True)
plot_boxplot(grouped_data_dict, field, group_by_field)
These plots will provide visual insights into the distribution and variance of the impact factors across different material types.
Done, for now!
You’re now set up with aecdata
and have used theProductStatistics
class, to perform detailed statistical analysis and visualizations.
This tutorial covered grouping data, removing outliers, and visualizing distributions, which are crucial for making informed decisions based on your data.
Stay tuned for our next tutorial, where we’ll go over how to implement aecdata
within a data-science environment!
Stay tuned, and happy coding!
This library is provided by 2050 Materials, a company dedicated to unlocking the value of data in the construction industry to enable the climate transition.
If you are interested in embedding this data within your workflows, or have a specific problem, reach out to us at api@2050-materials.com