Machine Learning Project

Wednesday, March 3, 2021

Recognizing Handwritten Digits

Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. Think about, for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. Perfect recognition of these codes is necessary in order to sort mail automatically and efficiently. Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read handwritten text, or pages of printed books, for general electronic documents in which each character is well defined.

The Digits Dataset

The scikit-learn library provides numerous datasets that are useful for testing many problems of data analysis and prediction of the results. Also in this case there is a dataset of images called Digits.

This dataset consists of 1,797 images that are 8x8 pixels in size

Let's start with importing the dataset from scikit-learn:

The images of the handwritten digits are contained in a digits.images array. Each element of this array is an image that is represented by an 8x8 matrix of numerical values that correspond to grayscale from white, with a value of 0, to black, with the value 15.

We can visually check the contents of this result using the matplotlib library.

The dataset is a training set consisting of 1,797 images

The Hypothesis to be tested:

The Digits data set of scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.

An estimator that is useful in this case is sklearn.svm.SVC, which uses the technique of Support Vector Classification (SVC).As it works with image datasets better compared to other machine learning algorithms.

Support vector machine is one of the simple algorithms that every machine learning expert should have used in his/her arsenal. The Support vector machine is highly preferred by many as it produces significant accuracy with less computation power. Support Vector Machine, abbreviated as SVM can be used for both regression and classification tasks. But, it is widely used in classification objectives.

What is Support Vector Machine?

from sklearn import svm

svc = svm.SVC(gamma=0.001, C=100.)

once we have defined a predictive model, we must instruct it with a training set, which is a set of data in which we already know the belonging class

Given the large number of elements contained in the Digits dataset, we shall certainly obtain a very effective model, i.e., one that’s capable of recognizing with good certainty the handwritten number.

This dataset contains 1,797 elements, and so you can consider the first 1,750 as a training set and will use the remaining as a validation set. We can see in detail a few of these remaining handwritten digits by using the matplotlib library:

Training the model :

Now we can train the svc estimator that we have defined earlier on the training data.

We can see that the svc estimator has learned correctly. It is able to recognize the handwritten digits, interpreting correctly all digits of the validation set.

In the above case, we have got 100% accurate predictions, but this may not be the case at all times.

We will be running for at least 3 cases, each case for the different range of training and validation sets.

Case-1:

Case_2:

Case_3:

We will take 80% of data as a training set and 20% as validation.

In all of the above cases, we have got 96.36% accurate predictions. That is quite a remarkable performance our model has given us.

Tuesday, March 2, 2021

Data Visualization Meteorological data

In this blog, we are gonna perform the analysis on the Meteorological data, and prove the hypothesis based on visualizations

The Null Hypothesis H0 is "Has the Apparent temperature and humidity compared monthly across 10 years of the data indicate an increase due to Global warming".

The H0 means we need to find whether the average Apparent temperature for the month of a month says April starting from 2006 to 2016 and the average humidity for the same period has increased or not. This monthly analysis has to be done for all 12 months over the 10 year period.

so lets start: -

## Importing libraries

import numpy as np ## for linear algebra

import pandas as pd ## for data manipulation and visualization

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

data = pd.read_csv('weatherHistory.csv')

data.head()

data.info()

data.isnull.sum()

As it is clear from above that our desired features, Apparent Temperature, Humidity and Formatted Date has no null values. So no need to perform interpolation here.

Here our Formmated Date column is of an object type, hence first we will convert it into DateTime format.

follow step-

We have extracted year, month, and days from the Date attribute.

Now we have cleaned our data, it is time to move on to prove the null Hypothesis. That is to check if Apparent Temperature and Humidity has increased during the last 10 years due to Global Warming.

To prove the check the hypothesis, we will visualize variation the attributes yearly for each month. I am gonna use plotly and cufflinks libraries first to plot bar graphs.

from plotly.offline import iplot

import plotly as py

import plotly.tools as tls

import cufflinks as cf

py.offline.init_notebook_mode(connected=True)

cf.go_offline()

Now we have imported and connected our notebook with Plotly, so let's visualize the attributes to get some insights.

jan = data.loc[data['Month']==1]

jan.iplot(x="Year", y=["Humidity", "Apparent Temperature (C)"], kind="bar")

We can analyze from this plot that Humidity for January month for each year is constant, does not vary with the year. But Apparent Temperature shows variations year by year.

But this plot does not give us an exact idea about the variation measure in the Apparent Temperature.

So now we will first resample the desired attributes by its mean(average) and then will visualize our hypothesis.

data.set_index('Formatted Date', inplace = True)

data = data[["Humidity", "Apparent Temperature (C)"]].resample('MS').mean()

data.head()

As we can analyze there is not any change in humidity in the past 10 years(2006–2016) for the month of January. whereas Apparent temperature increases sharply in 2006 to 2007 and decrease in 2007 to 2010 and again increases in 2010 to 2014 but decrease in 2014 to 2016

This plot has given us a clear idea about the variations. So let's do the same plotting for each month as well.

fab & march-----

aprtil & may

June & July

August & September

October & November

And the last December

Now we have plotted the variation plots for each month with respect to 10 years period of time.

Anlaysis: From the month of april to the month of august there is slightly change in Apparent Temperature but nearly no change in humidity for the 10 years(2006-2016). Whereas, for the month from september to march there is a vast variation in the temperature but again humidity remains unchanged.So our null hypothesis is not so true.

Check the Null Hypothesis

verified our analysis by T-test as well.

So the conclusion here is that we are rejecting Null Hypothesis i.e. Apparent Temperature and Humidity have not been increasing since the last 10 years(2006-2016) due to the Global Warming.

Now we have imported and connected our notebook with Plotly, so let's visualize the attributes to get some insights.

o let's start:

Machine Learning Project

Wednesday, March 3, 2021

Recognizing Handwritten Digits

Training the model :

Case-1:

Case_3:

Tuesday, March 2, 2021

Data Visualization Meteorological data

Recognizing Handwritten Digits

Report Abuse

Labels