Sunday, 29 December 2019

Removing duplicate records in a CSV file

Recently my colleague was struggling on removing duplicate records based on composite keys in a large csv file before inserting into Oracle database. I wrote a simple python program to address the problem.

Given that python has already been installed in the server. I could install pandas in virtualenv using pip. The program is pretty short. First we need to import pandas. Pandas is a software library for data manipulation and analysis.

import pandas as pd

Then, we use read_csv() to read a csv file into DataFrame, which is a 2d size-mutatable, potentially heterogeneous tabular data structure with labeled axes. Since our data contains a value of NA, so we need to set keep_default_na to False.

d = pd.read_csv('LARGE_CSV_FILE.csv', keep_default_na = False)

Once we have the dataframe, we can call drop_duplicates() to remove duplicate rows. Since we remove them based on composite keys, we can pass those keys to subset. Setting inplace to True can drop duplicates in place instead of returning a copy. If depulicate records are found, we only keep the first one.

d.drop_duplicates(subset = ['COMPOSITE_KEY1', 'COMPOSITE_KEY2', 'COMPOSITE_KEY3', 'COMPOSITE_KEY4', 'COMPOSITE_KEY5', 'COMPOSITE_KEY6', 'COMPOSITE_KEY7', 'COMPOSITE_KEY8', 'COMPOSITE_KEY9', 'COMPOSITE_KEY10'], inplace = True, keep = 'first')

At the end, we would like to save the result to another csv file for verification. By default, it comes with index. We can disable it by setting index to False.

d.to_csv('LARGE_CSV_FILE_PROCESSED.csv', index = False)

No comments:

Post a Comment

A Fun Problem - Math

# Problem Statement JATC's math teacher always gives the class some interesting math problems so that they don't get bored. Today t...