Data Exploration
- Data Quality Report
The “Occupancy Detection Data Set” was selected. In general, it represents information about detection occupancy of an office room from light, temperature, humidity, CO2 measurements.
Here is the link: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Short metadata description for the Occupancy dataset on UCI:
- This dataset has 20K rows, where each row consists of 6 attributes (which are the mix of multivariate and time-series data) and 1 label (whether room was occupied or not).
- Dataset is divided into three parts: one for training and two for testing. I concatenated all of them, for testing and for training for more informative analysis.
- From my observation, there is no information about missing values in dataset.
- I am assuming, there is no information about duplicates in dataset.
2. Data Exploration
Python Pandas was used for exploring the data.First 5 examples in dataset:
- Attributes presented here: “date”, “Temperature”, “Humidity”, “Light”, “CO2”, “HumidityRatio”, and label “Occupancy”
Dataset have 6 attributes and 1 label.
Number of instances: 20560.
Number of instances and number of attributes align with main information about this dataset on UCI site.
- Use pandas dtypes to get information about data types in dataset.
Datatypes in main information were: “Multivariate, Time-Series”
Datatypes we got: object, float64, int64. In Python date and time types are objects. Multivariate types are float64, int64.
So, attributes types align with main information about this dataset on UCI site.
- There is no information about missing data in the dataset on UCI site. I can get any information about missing data by checking it with the help of pandas library with command isnull().values.any() where data is DataFrame containing data from Occupancy dataset. According to result given by the comman execution, there is no missing data in the Occupancy dataset.
- Trying to find duplicates in data using command duplicated().values.any() == True
As I can see, there is no duplicates in Occupancy dataset. To find outliers in dataset I was using boxplot. Columns Light and CO2 have much more outliers unlike other columns.
At this stage, I am going to analyze count of outliers, and their maximumvalue. It’s can be done with the help of quantile; in statistics and theory of probability, quantiles are cut points, such that a given random value can’t be greater these cut points values with a fixed probability.
So I have next results with exploring quantiles on Occupancy dataset:
Occupancy dataset have 206 records with values, which differ too much from the average.Generated summary stats for the entire dataset.
3.Data Transformation
Transform data in column which contains date time information in several ones to process them easier.
Convert date column to three columns: Month, Week Day and Hour.
Transform data in column which contains date time information in several ones to process them easier.
Convert date column to three columns: Month, Week Day and Hour.
Result dataset with reduced date column and three new columns Month, Week Day and Hour:
Removing outliers from dataset
Filter columns Light and CO2. Leaving only that ones, where the values of the certain column within say -3 and +3 standard deviations from mean.
As I could remember, we have 20560 rows and 7 columns in dataset. After transformation and removing outliers dataset has 20127 rows and 9 columns. Check Light and CO2 columns:
Now I am going to get Occupancy label equal to one columns percent and Occupancy label equal to zero columns percent.
Sampling data
Use Under Sampling scenario. I have about 21% of records with Occupancy label equal to one and about 79% of records which have their Occupancy label equal to zero. Let’s reduce the number of records with Occupancy label equal to zero. Now under-sampling target size is equal to number of records with Occupancy label equal to one (it’s 4376 records). I will select rows with Occupancy label equal to zero randomly by the appropriate percentage until target size reached.
4.Summary
Occupancy dataset was chosen and downloaded from UCI site. This dataset represents the main dependence of the occupancy of a room or office on the following characteristics of a room or office such as date, temperature, humidity and humidity ratio, room lighting level and also the concentration level of carbon dioxide (CO2).
Initially, immediately after downloading, it could be seen, that Occupancy dataset consisted of three parts: one for training and two for testing. The first part for testing had 2664 records, the second part for testing had 9752, and the third part for training had 8143 records. It was decided to combine all parts for a common analysis.
After the first acquaintance with the data the following conclusions are made by me:
a dataset column showing the date and time of the measurement is difficult to process.
It was decided to transform the value in this column to get rid of the object data type and make all the values in the dataset numeric.
Then the graphs, clearly visualizing the data, were built, and it was noticed that there are some values that are too prominent from all data. It was decided to get rid of them in order to avoid noise in the data during it further analysis. These data outliers could be seen on boxplot in the beginning of this report. This graph is in the form of Japanese candles and it reflects the spread of the values of each particular column of the dataset. That points, which are above the top of the candle named outliers, and must be deleted from dataset. As it also could be seen, there are two columns in the dataset, the Light and the CO2, which have outliers.
During all the analysis, python and pandas framework were used.After it analysis, it was decided to apply some transformations on the data. At first, it had to transform date column and extract all special features. “Apply” function of the pandas data frame was used, and date column was transformed into three columns: “Month”, “Week Day”, “Hour”. After that the date column was dropped. It means, that dataset will contain nine columns instead of seven.
I also deleted records, which had abnormal big or small values in Light and CO2 columns as leaving only the ones, where the values of the certain column within -3 and +3 standard deviations from mean.
As result, I had a cleaner data. If we also look on the second boxplot, which represents Light and CO2 columns, we can notice that they had no outliers any more.
In the end sampling was applied. As it could be noticed from statistics, new clean dataset had 21% of records which Occupancy label was equal to one, and 79% of records which Occupancy label was equal to zero. It was decided to use under-sampling to make groups of samples with the equal count of records of Occupancy level one and Occupancy level zero.
And it also was shuffled. As result, we have dataset of about 9K records, and all examples in this dataset are differ from each other, and their common quantity is sufficient for further processing.
In brief, I can summarize the following: after applying all transformations on chosen dataset from UCI we have the following results:
- This dataset has about 9K rows, where each row consists of 8 attributes (all of them contains numeric data) and 1 label (whether room was occupied or not).
- Dataset has the same number of examples have Occupancy label equal to one and Occupancy label equal to zero.
- There are no missing values in dataset.
- There are no duplicates in dataset.