Hi, I have some corruption with the data files from SD card. Some value is negative, or #NAME?. How do I clean the dataset?
Hi @ttcao,
Thanks for submitting a question on our new forum! What program or programming language are you familiar with to perform this cleaning step? At QuantAQ, we prefer Python and have some python-based tools available for customers (and others!) to use. Are you familiar with Python?
Yes, I can use python for data cleaning.
Great! While this example will apply to the raw data files taken directly from the µSD card, it is a generic solution that should work for all tabular data.
To use this approach, you must have pandas
installed; pandas
is a great library for working with analytical like that from QuantAQ sensors.
Loading the data
First, you will want to import the library and load your data using the read_csv
function available in pandas
.
import pandas as pd
# Load your file as a csv
df = pd.read_csv("<your-filepath-here>", skiprows=3,
on_bad_lines='skip', encoding='unicode_escape',
parse_dates=['timestamp_iso']
)
There are a few key arguments we used above to make loading data that may contain invalid characters a bit easier:
skiprows
We use the skiprows=3
argument to indicate that we want to skip the top three rows of the file, which in this case contain meta data that is not part of the tabular data we are trying to clean.
on_bad_lines
We set on_bad_lines='skip'
to skip the bad lines without raising an exception or warning when they are encountered.
encoding
We set encoding='unicode_escape'
as a flexible way to read in all data, even when it may not be utf-8 compliant.
parse_dates
We are setting the parse_dates
argument to parse_dates=['timestamp_iso']
to indicate that we want to convert the timestamp_iso
column to a python datetime object, which will make analysis easier moving forward.
Cleaning the data
In order to use our time-series data, we need to convert all of the columns to be numeric if that’s what they should be (they should!). Rather than search through the entire dataframe for bad data, we can simply force all of the columns to be numeric and set non-complying data points to NaNs. To do this, we use the to_numeric
method from the pandas
library as follows:
# First, set the index to be the timestamp
df.set_index("timestamp_iso", inplace=True)
# Next, force the entire dataframe to be numeric values
df = df.apply(pd.to_numeric, errors='coerce')
It is important that we use the errors='coerce'
argument above so that we force all aberrant values to be NaNs.
Finally, if we want to drop all rows where there is a NaN (depends on what you’re trying to do…), we can use the dropna
method as follows:
df = df.dropna(how='any')
At this point, you should have a clean dataframe to use for your analysis.
Thank you. That works well for the data cleaning. I just want to comment that some data corruption is in the timestamp_iso column and it leadss to an error when loading data into dataframe. For these case, I just open the file and delete these rows manually.