What is the best way to deal with corrupt data files?

ttcao · November 30, 2022, 8:33pm

Hi, I have some corruption with the data files from SD card. Some value is negative, or #NAME?. How do I clean the dataset?

dhhagan · November 30, 2022, 10:10pm

Hi @ttcao,

Thanks for submitting a question on our new forum! What program or programming language are you familiar with to perform this cleaning step? At QuantAQ, we prefer Python and have some python-based tools available for customers (and others!) to use. Are you familiar with Python?

ttcao · November 30, 2022, 10:15pm

Yes, I can use python for data cleaning.

dhhagan · December 1, 2022, 4:12am

Great! While this example will apply to the raw data files taken directly from the µSD card, it is a generic solution that should work for all tabular data.

To use this approach, you must have pandas installed; pandas is a great library for working with analytical like that from QuantAQ sensors.

Loading the data

First, you will want to import the library and load your data using the read_csv function available in pandas.

import pandas as pd

# Load your file as a csv
df = pd.read_csv("<your-filepath-here>", skiprows=3, 
                    on_bad_lines='skip', encoding='unicode_escape', 
                    parse_dates=['timestamp_iso']
)

There are a few key arguments we used above to make loading data that may contain invalid characters a bit easier:

`skiprows`

We use the skiprows=3 argument to indicate that we want to skip the top three rows of the file, which in this case contain meta data that is not part of the tabular data we are trying to clean.

`on_bad_lines`

We set on_bad_lines='skip' to skip the bad lines without raising an exception or warning when they are encountered.

`encoding`

We set encoding='unicode_escape' as a flexible way to read in all data, even when it may not be utf-8 compliant.

`parse_dates`

We are setting the parse_dates argument to parse_dates=['timestamp_iso'] to indicate that we want to convert the timestamp_iso column to a python datetime object, which will make analysis easier moving forward.

Cleaning the data

In order to use our time-series data, we need to convert all of the columns to be numeric if that’s what they should be (they should!). Rather than search through the entire dataframe for bad data, we can simply force all of the columns to be numeric and set non-complying data points to NaNs. To do this, we use the to_numeric method from the pandas library as follows:

# First, set the index to be the timestamp
df.set_index("timestamp_iso", inplace=True)

# Next, force the entire dataframe to be numeric values
df = df.apply(pd.to_numeric, errors='coerce')

It is important that we use the errors='coerce' argument above so that we force all aberrant values to be NaNs.

Finally, if we want to drop all rows where there is a NaN (depends on what you’re trying to do…), we can use the dropna method as follows:

df = df.dropna(how='any')

At this point, you should have a clean dataframe to use for your analysis.

ttcao · December 2, 2022, 10:08pm

Thank you. That works well for the data cleaning. I just want to comment that some data corruption is in the timestamp_iso column and it leadss to an error when loading data into dataframe. For these case, I just open the file and delete these rows manually.