Introduction: Automated Feature Engineering¶

In this notebook, we will look at an exciting development in data science: automated feature engineering. A machine learning model can only learn from the data we give it, and making sure that data is relevant to the task is one of the most crucial steps in the machine learning pipeline (this is made clear in the excellent paper "A Few Useful Things to Know about Machine Learning").

However, manual feature engineering is a tedious task and is limited by both human imagination - there are only so many features we can think to create - and by time - creating new features is time-intensive. Ideally, there would be an objective method to create an array of diverse new candidate features that we can then use for a machine learning task. This process is meant to not replace the data scientist, but to make her job easier and allowing her to supplement domain knowledge with an automated workflow.

In this notebook, we will walk through an implementation of using Featuretools, an open-source Python library for automatically creating features with relational data (where the data is in structured tables). Although there are now many efforts working to enable automated model selection and hyperparameter tuning, there has been a lack of automating work on the feature engineering aspect of the pipeline. This library seeks to close that gap and the general methodology has been proven effective in both machine learning competitions with the data science machine and business use cases.

Dataset¶

To show the basic idea of featuretools we will use an example dataset consisting of three tables:

clients: information about clients at a credit union
loans: previous loans taken out by the clients
payments: payments made/missed on the previous loans

The general problem of feature engineering is taking disparate data, often distributed across multiple tables, and combining it into a single table that can be used for training a machine learning model. Featuretools has the ability to do this for us, creating many new candidate features with minimal effort. These features are combined into a single table that can then be passed on to our model.

First, let's load in the data and look at the problem we are working with.

In [1]:

# Run this if featuretools is not already installed
# !pip install -U featuretools

In [2]:

# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import featuretools as ft

# ignore warnings from pandas
import warnings
warnings.filterwarnings('ignore')

In [3]:

# Read in the data
clients = pd.read_csv('data/clients.csv', parse_dates = ['joined'])
loans = pd.read_csv('data/loans.csv', parse_dates = ['loan_start', 'loan_end'])
payments = pd.read_csv('data/payments.csv', parse_dates = ['payment_date'])

In [4]:

clients.head()

Out[4]:

	client_id	joined	income	credit_score
0	46109	2002-04-16	172677	527
1	49545	2007-11-14	104564	770
2	41480	2013-03-11	122607	585
3	46180	2001-11-06	43851	562
4	25707	2006-10-06	211422	621

In [5]:

loans.sample(10)

Out[5]:

	client_id	loan_type	loan_amount	repaid	loan_id	loan_start	loan_end	rate
254	29841	cash	4718	0	11175	2014-03-08	2015-09-13	2.74
196	26326	other	10902	1	11559	2014-09-06	2016-11-14	6.73
45	41480	cash	12581	1	11461	2009-10-29	2011-07-08	1.94
415	49624	cash	8621	0	10454	2014-07-26	2015-12-29	3.18
26	49545	other	4131	0	10939	2008-10-13	2010-02-26	4.07
417	49624	credit	11219	1	11132	2013-08-12	2016-03-15	3.77
137	32726	credit	7499	1	10285	2012-10-07	2015-03-12	2.72
200	26326	credit	5275	0	11072	2014-11-11	2016-07-17	1.45
180	48177	cash	6521	0	11235	2014-07-12	2016-02-22	6.17
8	46109	home	11062	1	11611	2012-09-12	2014-03-14	5.48

In [6]:

payments.sample(10)

Out[6]:

	loan_id	payment_amount	payment_date	missed
965	11221	1511	2009-07-14	1
1285	11264	218	2009-11-06	1
1664	10769	210	2012-03-18	1
1074	10106	1003	2003-09-09	1
2521	10868	349	2013-10-04	1
2909	10537	1292	2005-11-23	1
44	11141	1220	2007-09-20	0
2445	10109	806	2014-07-23	0
1581	10624	1176	2001-03-20	1
1344	10262	770	2007-03-02	1

Manual Feature Engineering Examples¶

Let's show a few examples of features we might make by hand. We will keep this relatively simple to avoid doing too much work! First we will focus on a single dataframe before combining them together. In the clients dataframe, we can take the month of the joined column and the natural log of the income column. Later, we see these are known in featuretools as transformation feature primitives because they act on column in a single table.

In [7]:

# Create a month column
clients['join_month'] = clients['joined'].dt.month

# Create a log of income column
clients['log_income'] = np.log(clients['income'])

clients.head()

Out[7]:

	client_id	joined	income	credit_score	join_month	log_income
0	46109	2002-04-16	172677	527	4	12.059178
1	49545	2007-11-14	104564	770	11	11.557555
2	41480	2013-03-11	122607	585	3	11.716739
3	46180	2001-11-06	43851	562	11	10.688553
4	25707	2006-10-06	211422	621	10	12.261611

To incorporate information about the other tables, we use the df.groupby method, followed by a suitable aggregation function, followed by df.merge. For example, let's calculate the average, minimum, and maximum amount of previous loans for each client. In the terms of featuretools, this would be considered an aggregation feature primitive because we using multiple tables in a one-to-many relationship to calculate aggregation figures (don't worry, this will be explained shortly!).

In [8]:

# Groupby client id and calculate mean, max, min previous loan size
stats = loans.groupby('client_id')['loan_amount'].agg(['mean', 'max', 'min'])
stats.columns = ['mean_loan_amount', 'max_loan_amount', 'min_loan_amount']
stats.head()

Out[8]:

	mean_loan_amount	max_loan_amount	min_loan_amount
client_id
25707	7963.950000	13913	1212
26326	7270.062500	13464	1164
26695	7824.722222	14865	2389
26945	7125.933333	14593	653
29841	9813.000000	14837	2778

In [9]:

# Merge with the clients dataframe
clients.merge(stats, left_on = 'client_id', right_index=True, how = 'left').head(10)

Out[9]:

	client_id	joined	income	credit_score	join_month	log_income	mean_loan_amount	max_loan_amount	min_loan_amount
0	46109	2002-04-16	172677	527	4	12.059178	8951.600000	14049	559
1	49545	2007-11-14	104564	770	11	11.557555	10289.300000	14971	3851
2	41480	2013-03-11	122607	585	3	11.716739	7894.850000	14399	811
3	46180	2001-11-06	43851	562	11	10.688553	7700.850000	14081	1607
4	25707	2006-10-06	211422	621	10	12.261611	7963.950000	13913	1212
5	39505	2011-10-14	153873	610	10	11.943883	7424.050000	14575	904
6	32726	2006-05-01	235705	730	5	12.370336	6633.263158	14802	851
7	35089	2010-03-01	131176	771	3	11.784295	6939.200000	13194	773
8	35214	2003-08-08	95849	696	8	11.470529	7173.555556	14767	667
9	48177	2008-06-09	190632	769	6	12.158100	7424.368421	14740	659

We could go further and include information about payments in the clients dataframe. To do so, we would have to group payments by the loan_id, merge it with the loans, group the resulting dataframe by the client_id, and then merge it into the clients dataframe. This would allow us to include information about previous payments for each client.

Clearly, this process of manual feature engineering can grow quite tedious with many columns and multiple tables and I certainly don't want to have to do this process by hand! Luckily, featuretools can automatically perform this entire process and will create more features than we would have ever thought of. Although I love pandas, there is only so much manual data manipulation I'm willing to stand!

Featuretools¶

Now that we know what we are trying to avoid (tedious manual feature engineering), let's figure out how to automate this process. Featuretools operates on an idea known as Deep Feature Synthesis. You can read the original paper here, and although it's quite readable, it's not necessary to understand the details to do automated feature engineering. The concept of Deep Feature Synthesis is to use basic building blocks known as feature primitives (like the transformations and aggregations done above) that can be stacked on top of each other to form new features. The depth of a "deep feature" is equal to the number of stacked primitives.

I threw out some terms there, but don't worry because we'll cover them as we go. Featuretools builds on simple ideas to create a powerful method, and we will build up our understanding in much the same way.

The first part of Featuretools to understand is an entity. This is simply a table, or in pandas, a DataFrame. We corral multiple entities into a single object called an EntitySet. This is just a large data structure composed of many individual entities and the relationships between them.

EntitySet¶

Creating a new EntitySet is pretty simple:

In [10]:

es = ft.EntitySet(id = 'clients')

Entities¶

An entity is simply a table, which is represented in Pandas as a dataframe. Each entity must have a uniquely identifying column, known as an index. For the clients dataframe, this is the client_id because each id only appears once in the clients data. In the loans dataframe, client_id is not an index because each id might appear more than once. The index for this dataframe is instead loan_id.

When we create an entity in featuretools, we have to identify which column of the dataframe is the index. If the data does not have a unique index we can tell featuretools to make an index for the entity by passing in make_index = True and specifying a name for the index. If the data also has a uniquely identifying time index, we can pass that in as the time_index parameter.

Featuretools will automatically infer the variable types (numeric, categorical, datetime) of the columns in our data, but we can also pass in specific datatypes to override this behavior. As an example, even though the repaid column in the loans dataframe is represented as an integer, we can tell featuretools that this is a categorical feature since it can only take on two discrete values. This is done using an integer with the variables as keys and the feature types as values.

In the code below we create the three entities and add them to the EntitySet. The syntax is relatively straightforward with a few notes: for the payments dataframe we need to make an index, for the loans dataframe, we specify that repaid is a categorical variable, and for the payments dataframe, we specify that missed is a categorical feature.

In [11]:

# Create an entity from the client dataframe
# This dataframe already has an index and a time index
es = es.entity_from_dataframe(entity_id = 'clients', dataframe = clients, 
                              index = 'client_id', time_index = 'joined')

In [12]:

# Create an entity from the loans dataframe
# This dataframe already has an index and a time index
es = es.entity_from_dataframe(entity_id = 'loans', dataframe = loans, 
                              variable_types = {'repaid': ft.variable_types.Categorical},
                              index = 'loan_id', 
                              time_index = 'loan_start')

In [13]:

# Create an entity from the payments dataframe
# This does not yet have a unique index
es = es.entity_from_dataframe(entity_id = 'payments', 
                              dataframe = payments,
                              variable_types = {'missed': ft.variable_types.Categorical},
                              make_index = True,
                              index = 'payment_id',
                              time_index = 'payment_date')

In [14]:

es

Out[14]:

Entityset: clients
  Entities:
    clients [Rows: 25, Columns: 6]
    loans [Rows: 443, Columns: 8]
    payments [Rows: 3456, Columns: 5]
  Relationships:
    No relationships

All three entities have been successfully added to the EntitySet. We can access any of the entities using Python dictionary syntax.

In [15]:

es['loans']

Out[15]:

Entity: loans
  Variables:
    client_id (dtype: numeric)
    loan_type (dtype: categorical)
    loan_amount (dtype: numeric)
    loan_start (dtype: datetime_time_index)
    loan_end (dtype: datetime)
    rate (dtype: numeric)
    repaid (dtype: categorical)
    loan_id (dtype: index)
  Shape:
    (Rows: 443, Columns: 8)

Featuretools correctly inferred each of the datatypes when we made this entity. We can also see that we overrode the type for the repaid feature, changing if from numeric to categorical.

In [16]:

es['payments']

Out[16]:

Entity: payments
  Variables:
    loan_id (dtype: numeric)
    payment_amount (dtype: numeric)
    payment_date (dtype: datetime_time_index)
    missed (dtype: categorical)
    payment_id (dtype: index)
  Shape:
    (Rows: 3456, Columns: 5)

Relationships¶

After defining the entities (tables) in an EntitySet, we now need to tell featuretools how they are related with a relationship. The most intuitive way to think of relationships is with the parent to child analogy: a parent-to-child relationship is one-to-many because for each parent, there can be multiple children. The client dataframe is therefore the parent of the loans dataframe because while there is only one row for each client in the client dataframe, each client may have several previous loans covering multiple rows in the loans dataframe. Likewise, the loans dataframe is the parent of the payments dataframe because each loan will have multiple payments.

These relationships are what allow us to group together datapoints using aggregation primitives and then create new features. As an example, we can group all of the previous loans associated with one client and find the average loan amount. We will discuss the features themselves more in a little bit, but for now let's define the relationships.

To define relationships, we need to specify the parent variable and the child variable. This is the variable that links two entities together. In our example, the client and loans dataframes are linked together by the client_id column. Again, this is a parent to child relationship because for each client_id in the parent client dataframe, there may be multiple entries of the same client_id in the child loans dataframe.

We codify relationships in the language of featuretools by specifying the parent variable and then the child variable. After creating a relationship, we add it to the EntitySet.

In [17]:

# Relationship between clients and previous loans
r_client_previous = ft.Relationship(es['clients']['client_id'],
                                    es['loans']['client_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_client_previous)

The relationship has now been stored in the entity set. The second relationship is between the loans and payments. These two entities are related by the loan_id variable.

In [18]:

# Relationship between previous loans and previous payments
r_payments = ft.Relationship(es['loans']['loan_id'],
                                      es['payments']['loan_id'])

# Add the relationship to the entity set
es = es.add_relationship(r_payments)

es

Out[18]:

Entityset: clients
  Entities:
    clients [Rows: 25, Columns: 6]
    loans [Rows: 443, Columns: 8]
    payments [Rows: 3456, Columns: 5]
  Relationships:
    loans.client_id -> clients.client_id
    payments.loan_id -> loans.loan_id

We now have our entities in an entityset along with the relationships between them. We can now start to making new features from all of the tables using stacks of feature primitives to form deep features. First, let's cover feature primitives.

Feature Primitives¶

A feature primitive a at a very high-level is an operation applied to data to create a feature. These represent very simple calculations that can be stacked on top of each other to create complex features. Feature primitives fall into two categories:

Aggregation: function that groups together child datapoints for each parent and then calculates a statistic such as mean, min, max, or standard deviation. An example is calculating the maximum loan amount for each client. An aggregation works across multiple tables using relationships between tables.
Transformation: an operation applied to one or more columns in a single table. An example would be extracting the day from dates, or finding the difference between two columns in one table.

Let's take a look at feature primitives in featuretools. We can view the list of primitives:

In [19]:

primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100
primitives[primitives['type'] == 'aggregation'].head(10)

Out[19]:

	name	type	description
0	mode	aggregation	Finds the most common element in a categorical feature.
1	any	aggregation	Test if any value is 'True'.
2	last	aggregation	Returns the last value.
3	all	aggregation	Test if all values are 'True'.
4	median	aggregation	Finds the median value of any feature with well-ordered values.
5	min	aggregation	Finds the minimum non-null value of a numeric feature.
6	avg_time_between	aggregation	Computes the average time between consecutive events.
7	max	aggregation	Finds the maximum non-null value of a numeric feature.
8	n_most_common	aggregation	Finds the N most common elements in a categorical feature.
9	percent_true	aggregation	Finds the percent of 'True' values in a boolean feature.

In [20]:

primitives[primitives['type'] == 'transform'].head(10)

Out[20]:

	name	type	description
19	add	transform	Creates a transform feature that adds two features.
20	seconds	transform	Transform a Timedelta feature into the number of seconds.
21	time_since_previous	transform	Compute the time since the previous instance.
22	cum_min	transform	Calculates the min of previous values of an instance for each value in a time-dependent entity.
23	divide	transform	Creates a transform feature that divides two features.
24	hour	transform	Transform a Datetime feature into the hour.
25	and	transform	For two boolean values, determine if both values are 'True'.
26	or	transform	For two boolean values, determine if one value is 'True'.
27	years	transform	Transform a Timedelta feature into the number of years.
28	isin	transform	For each value of the base feature, checks whether it is in a provided list.

If featuretools does not have enough primitives for us, we can also make our own.

To get an idea of what a feature primitive actually does, let's try out a few on our data. Using primitives is surprisingly easy using the ft.dfs function (which stands for deep feature synthesis). In this function, we specify the entityset to use; the target_entity, which is the dataframe we want to make the features for (where the features end up); the agg_primitives which are the aggregation feature primitives; and the trans_primitives which are the transformation primitives to apply.

In the following example, we are using the EntitySet we already created, the target entity is the clients dataframe because we want to make new features about each client, and then we specify a few aggregation and transformation primitives.

In [21]:

# Create new features using specified primitives
features, feature_names = ft.dfs(entityset = es, target_entity = 'clients', 
                                 agg_primitives = ['mean', 'max', 'percent_true', 'last'],
                                 trans_primitives = ['years', 'month', 'subtract', 'divide'])

In [22]:

pd.DataFrame(features['MONTH(joined)'].head())

Out[22]:

	MONTH(joined)
client_id
25707	10
26326	5
26695	8
26945	11
29841	8

In [23]:

pd.DataFrame(features['MEAN(payments.payment_amount)'].head())

Out[23]:

	MEAN(payments.payment_amount)
client_id
25707	1178.552795
26326	1166.736842
26695	1207.433824
26945	1109.473214
29841	1439.433333

In [24]:

features.head()

Out[24]:

	income	credit_score	join_month	log_income	MEAN(loans.loan_amount)	MEAN(loans.rate)	MAX(loans.loan_amount)	MAX(loans.rate)	LAST(loans.loan_type)	LAST(loans.loan_amount)	...	log_income - income / MAX(payments.payment_amount)	LAST(loans.loan_amount) / income	log_income - credit_score / join_month	join_month - log_income / MEAN(loans.loan_amount)	join_month / income - join_month	credit_score - log_income / LAST(loans.rate)	join_month - log_income / MAX(loans.loan_amount)	credit_score - income / join_month - log_income	join_month - income / MEAN(payments.payment_amount)	join_month - credit_score / log_income - credit_score
client_id
25707	211422	621	10	12.261611	7963.950000	3.477000	13913	9.44	home	2203	...	-78.184075	0.010420	-60.873839	-0.000284	0.000047	82.261944	-0.000163	93208.319781	-179.382715	1.003715
26326	227920	633	5	12.336750	7270.062500	2.517500	13464	6.73	credit	5275	...	-85.744042	0.023144	-124.132650	-0.001009	0.000022	428.043621	-0.000545	30979.248435	-195.343964	1.011821
26695	174532	680	8	12.069863	7824.722222	2.466111	14865	6.51	other	13918	...	-59.522486	0.079745	-83.491267	-0.000520	0.000046	742.144596	-0.000274	42716.912967	-144.541255	1.006093
26945	214516	806	11	12.276140	7125.933333	2.855333	14593	5.65	cash	9249	...	-77.494120	0.043116	-72.156715	-0.000179	0.000051	277.525825	-0.000087	167466.003631	-193.339503	1.001608
29841	38354	523	8	10.554614	9813.000000	3.445000	14837	6.76	home	7223	...	-13.231003	0.188325	-64.055673	-0.000260	0.000209	100.676893	-0.000172	14808.890291	-26.639650	1.004985

5 rows × 797 columns

Already we can see how useful featuretools is: it performed the same operations we did manually but also many more in addition. Examining the names of the features in the dataframe brings us to the final piece of the puzzle: deep features.

Deep Feature Synthesis¶

While feature primitives are useful by themselves, the main benefit of using featuretools arises when we stack primitives to get deep features. The depth of a feature is simply the number of primitives required to make a feature. So, a feature that relies on a single aggregation would be a deep feature with a depth of 1, a feature that stacks two primitives would have a depth of 2 and so on. The idea itself is lot simpler than the name "deep feature synthesis" implies. (I think the authors were trying to ride the way of deep neural network hype when they named the method!) To read more about deep feature synthesis, check out the documentation or the original paper by Max Kanter et al.

Already in the dataframe we made by specifying the primitives manually we can see the idea of feature depth. For instance, the MEAN(loans.loan_amount) feature has a depth of 1 because it is made by applying a single aggregation primitive. This feature represents the average size of a client's previous loans.

In [25]:

# Show a feature with a depth of 1
pd.DataFrame(features['MEAN(loans.loan_amount)'].head(10))

Out[25]:

	MEAN(loans.loan_amount)
client_id
25707	7963.950000
26326	7270.062500
26695	7824.722222
26945	7125.933333
29841	9813.000000
32726	6633.263158
32885	9920.400000
32961	7882.235294
35089	6939.200000
35214	7173.555556

As well scroll through the features, we see a number of features with a depth of 2. For example, the LAST(loans.(MEAN(payments.payment_amount))) has depth = 2 because it is made by stacking two feature primitives, first an aggregation and then a transformation. This feature represents the average payment amount for the last (most recent) loan for each client.

In [26]:

# Show a feature with a depth of 2
pd.DataFrame(features['LAST(loans.MEAN(payments.payment_amount))'].head(10))

Out[26]:

	LAST(loans.MEAN(payments.payment_amount))
client_id
25707	293.500000
26326	977.375000
26695	1769.166667
26945	1598.666667
29841	1125.500000
32726	799.500000
32885	1729.000000
32961	282.600000
35089	110.400000
35214	1410.250000

We can create features of arbitrary depth by stacking more primitives. However, when I have used featuretools I've never gone beyond a depth of 2! After this point, the features become very convoluted to understand. I'd encourage anyone interested to experiment with increasing the depth (maybe for a real problem) and see if there is value to "going deeper".

Automated Deep Feature Synthesis¶

In addition to manually specifying aggregation and transformation feature primitives, we can let featuretools automatically generate many new features. We do this by making the same ft.dfs function call, but without passing in any primitives. We just set the max_depth parameter and featuretools will automatically try many all combinations of feature primitives to the ordered depth.

When running on large datasets, this process can take quite a while, but for our example data, it will be relatively quick. For this call, we only need to specify the entityset, the target_entity (which will again be clients), and the max_depth.

In [27]:

# Perform deep feature synthesis without specifying primitives
features, feature_names = ft.dfs(entityset=es, target_entity='clients', 
                                 max_depth = 2)

In [28]:

features.iloc[:, 4:].head()

Out[28]:

	SUM(loans.loan_amount)	SUM(loans.rate)	STD(loans.loan_amount)	STD(loans.rate)	MAX(loans.loan_amount)	MAX(loans.rate)	SKEW(loans.loan_amount)	SKEW(loans.rate)	MIN(loans.loan_amount)	MIN(loans.rate)	...	NUM_UNIQUE(loans.WEEKDAY(loan_end))	MODE(loans.MODE(payments.missed))	MODE(loans.DAY(loan_start))	MODE(loans.DAY(loan_end))	MODE(loans.YEAR(loan_start))	MODE(loans.YEAR(loan_end))	MODE(loans.MONTH(loan_start))	MODE(loans.MONTH(loan_end))	MODE(loans.WEEKDAY(loan_start))	MODE(loans.WEEKDAY(loan_end))
client_id
25707	159279	69.54	4044.418728	2.421285	13913	9.44	-0.172074	0.679118	1212	0.33	...	6	0	27	1	2010	2007	1	8	3	0
26326	116321	40.28	4254.149422	1.991819	13464	6.73	0.135246	1.067853	1164	0.50	...	5	0	6	6	2003	2005	4	7	5	2
26695	140845	44.39	4078.228493	1.517660	14865	6.51	0.154467	0.820060	2389	0.22	...	6	0	3	14	2003	2005	9	4	1	1
26945	106889	42.83	4389.555657	1.564795	14593	5.65	0.156534	-0.001998	653	0.13	...	6	0	16	1	2002	2004	12	5	0	1
29841	176634	62.01	4090.630609	2.063092	14837	6.76	-0.212397	0.050600	2778	0.26	...	7	1	1	15	2005	2007	3	2	5	1

5 rows × 90 columns

Deep feature synthesis has created 90 new features out of the existing data! While we could have created all of these manually, I am glad to not have to write all that code by hand. The primary benefit of featuretools is that it creates features without any subjective human biases. Even a human with considerable domain knowledge will be limited by their imagination when making new features (not to mention time). Automated feature engineering is not limited by these factors (instead it's limited by computation time) and provides a good starting point for feature creation. This process likely will not remove the human contribution to feature engineering completely because a human can still use domain knowledge and machine learning expertise to select the most important features or build new features from those suggested by automated deep feature synthesis.

Next Steps¶

While automatic feature engineering solves one problem, it provides us with another problem: too many features! Although it's difficult to say which features will be important to a given machine learning task ahead of time, it's likely that not all of the features made by featuretools add value. In fact, having too many features is a significant issue in machine learning because it makes training a model much harder. The irrelevant features can drown out the important features, leaving a model unable to learn how to map the features to the target.

This problem is known as the "curse of dimensionality" and is addressed through the process of feature reduction and selection, which means removing low-value features from the data. Defining which features are useful is an important problem where a data scientist can still add considerable value to the feature engineering task. Feature reduction will have to be another topic for another day!

Conclusions¶

In this notebook, we saw how to apply automated feature engineering to an example dataset. This is a powerful method which allows us to overcome the human limits of time and imagination to create many new features from multiple tables of data. Featuretools is built on the idea of deep feature synthesis, which means stacking multiple simple feature primitives - aggregations and transformations - to create new features. Feature engineering allows us to combine information across many tables into a single dataframe that we can then use for machine learning model training. Finally, the next step after creating all of these features is figuring out which ones are important.

Featuretools is currently the only Python option for this process, but with the recent emphasis on automating aspects of the machine learning pipeline, other competitiors will probably enter the sphere. While the exact tools will change, the idea of automatically creating new features out of existing data will grow in importance. Staying up-to-date on methods such as automated feature engineering is crucial in the rapidly changing field of data science. Now go out there and find a problem on which to apply featuretools!

For more information, check out the documentation for featuretools. Also, read about how featuretools is used in the real world by Feature Labs, the company behind the open-source library.

Automated Feature Engineering via Featuretools

Take aways

Introduction: Automated Feature Engineering¶

Dataset¶

Manual Feature Engineering Examples¶

Featuretools¶

EntitySet¶

Entities¶

Relationships¶

Feature Primitives¶

Deep Feature Synthesis¶

Automated Deep Feature Synthesis¶

Next Steps¶

Conclusions¶

Take aways

Introduction: Automated Feature Engineering¶

Dataset¶

Manual Feature Engineering Examples¶

Featuretools¶

EntitySet¶

Entities¶

Relationships¶

Feature Primitives¶

Deep Feature Synthesis¶

Automated Deep Feature Synthesis¶

Next Steps¶

Conclusions¶

Related Posts: