Sunday 29 December 2019

Removing duplicate records in a CSV file

Recently my colleague was struggling on removing duplicate records based on composite keys in a large csv file before inserting into Oracle database. I wrote a simple python program to address the problem. Given that python has already been installed in the server. I could install ``pandas`` in virtualenv using pip. The program is pretty short. First we need to import pandas. Pandas is a software library for data manipulation and analysis. ```python import pandas as pd ``` Then, we use ``read_csv()`` to read a csv file into DataFrame, which is a 2d size-mutatable, potentially heterogeneous tabular data structure with labeled axes. Since our data contains a value of NA, so we need to set `keep_default_na` to False. ```python d = pd.read_csv('LARGE_CSV_FILE.csv', keep_default_na = False) ``` Once we have the dataframe, we can call ``drop_duplicates()`` to remove duplicate rows. Since we remove them based on composite keys, we can pass those keys to `subset`. Setting `inplace` to True can drop duplicates in place instead of returning a copy. If depulicate records are found, we only keep the first one. ```python d.drop_duplicates(subset = ['COMPOSITE_KEY1', 'COMPOSITE_KEY2', 'COMPOSITE_KEY3', 'COMPOSITE_KEY4', 'COMPOSITE_KEY5', 'COMPOSITE_KEY6', 'COMPOSITE_KEY7', 'COMPOSITE_KEY8', 'COMPOSITE_KEY9', 'COMPOSITE_KEY10'], inplace = True, keep = 'first') ``` At the end, we would like to save the result to another csv file for verification. By default, it comes with index. We can disable it by setting `index` to False. ```python d.to_csv('LARGE_CSV_FILE_PROCESSED.csv', index = False) ```

No comments:

Post a Comment

A Fun Problem - Math

# Problem Statement JATC's math teacher always gives the class some interesting math problems so that they don't get bored. Today t...