Sunday, 29 December 2019
Removing duplicate records in a CSV file
Recently my colleague was struggling on removing duplicate records based on composite keys in a large csv file before inserting into Oracle database. I wrote a simple python program to address the problem.
Given that python has already been installed in the server. I could install ``pandas`` in virtualenv using pip. The program is pretty short. First we need to import pandas. Pandas is a software library for data manipulation and analysis.
```python
import pandas as pd
```
Then, we use ``read_csv()`` to read a csv file into DataFrame, which is a 2d size-mutatable, potentially heterogeneous tabular data structure with labeled axes. Since our data contains a value of NA, so we need to set `keep_default_na` to False.
```python
d = pd.read_csv('LARGE_CSV_FILE.csv', keep_default_na = False)
```
Once we have the dataframe, we can call ``drop_duplicates()`` to remove duplicate rows. Since we remove them based on composite keys, we can pass those keys to `subset`. Setting `inplace` to True can drop duplicates in place instead of returning a copy. If depulicate records are found, we only keep the first one.
```python
d.drop_duplicates(subset = ['COMPOSITE_KEY1', 'COMPOSITE_KEY2', 'COMPOSITE_KEY3', 'COMPOSITE_KEY4', 'COMPOSITE_KEY5', 'COMPOSITE_KEY6', 'COMPOSITE_KEY7', 'COMPOSITE_KEY8', 'COMPOSITE_KEY9', 'COMPOSITE_KEY10'], inplace = True, keep = 'first')
```
At the end, we would like to save the result to another csv file for verification. By default, it comes with index. We can disable it by setting `index` to False.
```python
d.to_csv('LARGE_CSV_FILE_PROCESSED.csv', index = False)
```
Subscribe to:
Post Comments (Atom)
A Fun Problem - Math
# Problem Statement JATC's math teacher always gives the class some interesting math problems so that they don't get bored. Today t...
-
SHA stands for Secure Hashing Algorithm and 2 is just a version number. SHA-2 revises the construction and the big-length of the signature f...
-
Contest Link: [https://www.e-olymp.com/en/contests/19775](https://www.e-olymp.com/en/contests/19775) Full Solution: [https://github.com/...
No comments:
Post a Comment