
The easy way.
DataFrame concatenations is an expensive action, especially in terms of processing time. Imagine having 12 Pandas DataFrames of varying sizes that you want to concatenate on the column axis, as seen in the following box.
df1 Shape: (24588, 31201)
df2 Shape: (24588, 1673)
df3 Shape: (24588, 5)
df4 Shape: (24588, 1)
df5 Shape: (24588, 148)
df6 Shape: (24588, 1)
df7 Shape: (24588, 6)
df8 Shape: (24588, 1)
df9 Shape: (24588, 1)
df10 Shape: (24588, 1)
df11 Shape: (24588, 1)
df12 Shape: (24588, 19)
In order to speed up your pd.concate(), there are two things you need to remember.
For every DataFrame, always df = df.reset_index(drop=true). Keep in mind that the concatenation command uses the index, without a proper index you will get misaligned DataFrames.
Always try to concatenate a list of DataFrames. Concatenating a list is faster than Concatenating separate DataFrames, i.e., df_concat = pd.concat([df1, df2,….], axis = 1)
df_concat = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12], axis=1)
That’s all you need to know :)
Dr. Ori Cohen has a Ph.D. in Computer Science with a focus on machine learning. He is a Senior Director of Data and the author of the ML & DL Compendium and StateOfMLOps.com.