Understanding Data Science Project Pipeline — Part 1

Kranti
4 min readJul 3, 2020

TL;DR

Data-Driven, Neural Networks, Powered by AI are some of the buzz words that has been making rounds in recent times. With almost all the companies, jumping to AI/ML bandwagon, it becomes all the more critical on what is that we(the company) want to achieve? What type of business insights are required? Do we have enough ‘oil’ aka ‘data’ to understand these focus areas? How is this data gathered? What approaches we should be choosing for understanding this data?

Let’s dive in, to systematically find out the answers to all the questions. It will be easy to understand the various steps involved in the whole process by solving a problem.

Credit Card Fraud Detection — Kaggle Problem

A typical data science project pipeline consists of the following

In the given problem: Input Data -> Refers to Credit Card Transactions

Here’s a quick summary of the dataset

  • Two days transactions made by European cardholders in September 2013
  • Number of Transactions: 284,807 ; Fraud [492], Genuine [284315]
  • The column ‘Class’ indicates a given transaction is fraud[1] or genuine [0]
  • The columns ‘V1’ to ‘V28’ are of PCA normalized
  • Screenshot of the dataset along with its columns

With that intro, begin our data analysis

Exploratory Data Analysis (EDA) the most fundamental and building block in the data science pipeline. By doing EDA, we can quickly visualize what is available in the given data, what kind of insights it can bring, validate different hypotheses on-the-go. The following sections discussthe typical methods to understand the data in-depth

1.1 Check for NULL: Are there any NULL values in the given dataset?

In the given dataset, there are no NULL Values for any of the features. If there are any NULL values, depending on the importance (will discuss in future sections, how to understand if a feature is important) of that feature data imputation to be applied else the feature can be dropped from further analysis.

1.2 Check for Class Imbalance: What is the % of Fraud vs Genuine transactions?

Fraudulent transactions are of 0.17% of the whole dataset. This factor has to be considered for the next steps in the pipeline.

1.3 Uni-variate Analysis:

  • What is the range of values for ‘Amount’ in Fraud and Genuine transactions?
Close to 99% of transactions are of value ≤ 2500
  • Any specific time where fraudulent transactions are happening?
Not observed any patterns. Fraud, Genuine Tranasactions are happening throughout the day

1.4 Bi-Variate Analysis

  • Any patterns are emerging if we use two features?
No specific patterns emerged from Amount vs Time

1.5 Summary

  • Time seems to be an optional feature and can be dropped
  • Amount is not normalized and to be transformed to be useful for the next steps
  • V1 to V28 are PCA transformed variables. Identify if any of them can be dropped by understanding the variance among them

— -

Thanks to @Co-learning Lounge and its mentors @Netali Agrawal, @Ashu Prasad, @Ankit Kumar Bhagat , @Yogesh Kothiya for bringing us together ❤️

Also, Thanks to the entire team for taking the first step in solving the Kaggle problem. Let’s learn Data Science together :)

@VINAYKUMAR GANDHAPU, @Srinivas Dasu, @Vamsi Krishna, @JITENDRA KUMAR T, @Aditi Kothiya, @Vidya Sankar, @Rinu Badjatya, @Dilshadbegum Shaik, @Sagar Pahlajani, @Raja Simha, @Poulami Das

#colearninglounge #deeplearning #machinelearning #neuralnetworks #datascience #artificialintelligence #kaggle

--

--