Feature Engineering for AI and Machine Learning

4 min readAug 19, 2020

Feature Engineering คือ การบวนการใช้ความรู้ Domain Knowledge ในการสร้าง Feature ใหม่ขึ้นมา ตัด Feature ที่ไม่เกี่ยวข้องทิ้งไป เพื่อช่วยทำให้อัลกอริทึมเรียนรู้ได้ดีขึ้น มีวิธีการดังนี้

Imputation
Handling Outliers
Drop Outlier with Standard Deviation
Drop with Percentiles
Binning
Log Transform
One-Hot Encoding

Understanding Data Quality

เราจะต้องมีการสำรวจข้อมูลเพื่อพิจารณาคุณภาพของมันในแง่ต่างๆ แล้วจึงเลือกเทคนิคในการทำ Feature Engineering

ขั้นตอนแรกทำการติดตั้ง Pandas Profiling Library ตามตัวอย่างเลยครับ

pip install pandas-profiling[notebook]

Import Library ที่จำเป็น

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv("titanic.csv")
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

เราจะใช้ Function isnull() และ sum() คำนวนจำนวน Missing Value ในแต่ละ Column

print(df.isnull().sum())

Imputation

จากคำสั่ง df.isnull().sum() เราพบ Column Age เป็น Missing Values ถึง 177 Cell ซึ่งเราจะทดลองแทนที่ Missing Value เหล่านั้นด้วยค่าเฉลี่ยของอายุ ทั้งหมด ใน Column Age โดยใช้ Function fillna()

new_df = df.copy()
new_df['Age'].fillna(df['Age'].mean(), inplace =True)
print(new_df.isnull().sum())

จะเห็นว่าใน Column Age ถูกแทนด้วยค่าเฉลี่ยทำให้ Missing Value เป็น 0

การแทนที่ Missing Value ด้วยค่าเฉลี่ยก็อาจจะทำให้ได้ข้อมูลที่ไม่ตรงกับที่ต้องการเราจะแก้ปัญหาโดยการลบ Row หรือ Column ทิ้งจากการกำหนดค่า Threshold ดังตัวอย่างด้านล่าง

df.isnull().mean()
threshold = 0.5
new_df = df[df.columns[df.isnull().mean() < threshold]]
new_df.isnull().mean()

Handling Outliers

Outlier หรือค่าที่ผิดปกติ คือ ข้อมูลที่มีค่าสูง หรือต่ำกว่าข้อมูลส่วนใหญ่ในชุดข้อมูลหนึ่งๆ อย่างมาก การจัดการกับ Outlier จะพิจารณาจาก Standard Deviation และ Percentile

Import Library ที่จำเป็น ทำ Data Visualization เป็นการกระจายข้อมูล

import seaborn as sns
from matplotlib import pyplot as pltfig = plt.figure(figsize=(12,8))
sns.boxplot(x=df['Age'], color='lime')
plt.xlabel('Age Featured', fontsize=14)
plt.show()

df['Age'].describe()

จากภาพ จะบอกค่าเฉลี่ย คนส่วนมากจะอายุ29

Drop Outlier with Standard Deviation

เป็นการลบแถวที่มี Outlier ด้วย Standard Deviation

print(df.shape)

factor = 3
upper_lim = df['Age'].mean () + df['Age'].std () * factor
lower_lim = df['Age'].mean () - df['Age'].std () * factor

drop_outlier1 = df[(df['Age'] < upper_lim)&(df['Age'] > lower_lim)]

print(drop_outlier1.shape)

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=drop_outlier1[‘Age’], color=’lime’)
plt.xlabel(‘Age Featured’, fontsize=14)
plt.show(

drop_outlier2['Age'].describe()

Drop with Percentiles

เราสามารถลบแถวที่พบ Outlier ใน Column Age ที่น้อยกว่าหรือเท่ากับ Quantile 0.5 และมากกว่าหรือเท่ากับ Quantile 0.95

print(df.shape)
upper_lim = df['Age'].quantile(.95)
lower_lim = df['Age'].quantile(.05)
drop_outlier2 = df[(df['Age'] < upper_lim) & (df['Age'] >lower_lim)]
print(drop_outlier2.shape)

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=drop_outlier2['Age'], color='lime')
plt.xlabel('Age', fontsize=14)
plt.show()

drop_outlier2['Age'].describe()

Binning

การทำ Binning หรือการแบ่งข้อมูลออกตามช่วงที่กำหนด จะทำให้สามารถป้องกันการเกิด Overfitting เมื่อมีการ Train Model ได้ในระดับหนึ่ง

เช่น การแบ่งราคาไวน์เป็น Low, Mid, High ตามช่วง bin = [0, 20, 40, 100]

labels = ['Childhood', 'teens', 'Mature', 'Elderly']
bins = [0., 12., 22., 60., 100.]
drop_outlier2['Age'] = pd.cut(drop_outlier2['Age'], labels=labels, bins=bins, include_lowest=False)
drop_outlier2.sample(n=5).head()

Log Transform

Log Transform เป็นการใช้ Log ทางคณิตศาสตร์แปลงข้อมูล ซึ่งจะช่วยลดการเบ้ของข้อมูล โดยหลังการแปลงข้อมูลแล้ว จะทำให้การกระจายตัวเข้าสู่ Normal Distribution มากขึ้น

drop_outlier2['log'] = (drop_outlier2['Age']).transform(np.log)
drop_outlier2.sample(n=5).head()

One-hot encoding

One-hot Encoding เป็นการเข้ารหัสข้อมูลแบบหนึ่งที่มักจะใช้กันบ่อยในงานทางด้าน Machine Learning โดยการขยายข้อมูลจากเดิมที่มี Column เดียว เป็นค่า 0 และ 1 หลายๆ Column ตามจำนวนหมวดหมู่ของข้อมูลใน Column เดิม โดยจะมีการกำหนดค่าเป็น 1 ใน Column ใหม่ และตำแหน่งของ Column จะแทนลำดับของหมวดหมู่ของข้อมูลเดิม แล้วกำหนดค่า 0 ใน Column อื่นๆ

encoded_columns = pd.get_dummies(drop_outlier2[‘Age_bin’])
drop_outlier2 =drop_outlier2.join(encoded_columns)
drop_outlier2.sample(n=4).head()