Hi, I have some corruption with the data files from SD card. Some value is negative, or #NAME?. How do I clean the dataset?
Thanks for submitting a question on our new forum! What program or programming language are you familiar with to perform this cleaning step? At QuantAQ, we prefer Python and have some python-based tools available for customers (and others!) to use. Are you familiar with Python?
Yes, I can use python for data cleaning.
Great! While this example will apply to the raw data files taken directly from the µSD card, it is a generic solution that should work for all tabular data.
To use this approach, you must have
pandas is a great library for working with analytical like that from QuantAQ sensors.
First, you will want to import the library and load your data using the
read_csv function available in
import pandas as pd # Load your file as a csv df = pd.read_csv("<your-filepath-here>", skiprows=3, on_bad_lines='skip', encoding='unicode_escape', parse_dates=['timestamp_iso'] )
There are a few key arguments we used above to make loading data that may contain invalid characters a bit easier:
We use the
skiprows=3 argument to indicate that we want to skip the top three rows of the file, which in this case contain meta data that is not part of the tabular data we are trying to clean.
on_bad_lines='skip' to skip the bad lines without raising an exception or warning when they are encountered.
encoding='unicode_escape' as a flexible way to read in all data, even when it may not be utf-8 compliant.
We are setting the
parse_dates argument to
parse_dates=['timestamp_iso'] to indicate that we want to convert the
timestamp_iso column to a python datetime object, which will make analysis easier moving forward.
In order to use our time-series data, we need to convert all of the columns to be numeric if that’s what they should be (they should!). Rather than search through the entire dataframe for bad data, we can simply force all of the columns to be numeric and set non-complying data points to NaNs. To do this, we use the
to_numeric method from the
pandas library as follows:
# First, set the index to be the timestamp df.set_index("timestamp_iso", inplace=True) # Next, force the entire dataframe to be numeric values df = df.apply(pd.to_numeric, errors='coerce')
It is important that we use the
errors='coerce' argument above so that we force all aberrant values to be NaNs.
Finally, if we want to drop all rows where there is a NaN (depends on what you’re trying to do…), we can use the
dropna method as follows:
df = df.dropna(how='any')
At this point, you should have a clean dataframe to use for your analysis.
Thank you. That works well for the data cleaning. I just want to comment that some data corruption is in the timestamp_iso column and it leadss to an error when loading data into dataframe. For these case, I just open the file and delete these rows manually.