Recently my colleague was struggling on removing duplicate records based on composite keys in a large csv file before inserting into Oracle database. I wrote a simple python program to address the problem.
Given that python has already been installed in the server. I could install pandas
in virtualenv using pip. The program is pretty short. First we need to import pandas. Pandas is a software library for data manipulation and analysis.
import pandas as pd
Then, we use read_csv()
to read a csv file into DataFrame, which is a 2d size-mutatable, potentially heterogeneous tabular data structure with labeled axes. Since our data contains a value of NA, so we need to set keep_default_na
to False.
d = pd.read_csv('LARGE_CSV_FILE.csv', keep_default_na = False)
Once we have the dataframe, we can call drop_duplicates()
to remove duplicate rows. Since we remove them based on composite keys, we can pass those keys to subset
. Setting inplace
to True can drop duplicates in place instead of returning a copy. If depulicate records are found, we only keep the first one.
d.drop_duplicates(subset = ['COMPOSITE_KEY1', 'COMPOSITE_KEY2', 'COMPOSITE_KEY3', 'COMPOSITE_KEY4', 'COMPOSITE_KEY5', 'COMPOSITE_KEY6', 'COMPOSITE_KEY7', 'COMPOSITE_KEY8', 'COMPOSITE_KEY9', 'COMPOSITE_KEY10'], inplace = True, keep = 'first')
At the end, we would like to save the result to another csv file for verification. By default, it comes with index. We can disable it by setting index
to False.
d.to_csv('LARGE_CSV_FILE_PROCESSED.csv', index = False)
No comments:
Post a Comment