Haribo ML, AI, MATH, Algorithm

당뇨병 예측[숫자형 데이터 훑어보기]

2021-03-22
Haribo
ML  AI

Numercial Data Explore

discrete

  • AGE_GROUP
  • OLIG_PROTE_CD

countinous

  • HEIGHT
  • WEIGHT
  • WAIST
  • SIGHT_LEFT
  • SIGHT_RIGHT
  • BP_HIGH
  • BP_LWST
  • TOT_CHOLE
  • TRIGLYCERIDE
  • HDL_CHOLE
  • LDL_CHOLE
  • HMG
  • CREATININE
  • SGOT_AST
  • SGPT_ALT
  • GAMMA_GTP

Discrete Feature

AGE_GROUP

결측치 없음
정규분포를 따른다

40대 이상 사람들이 당뇨병걸릴 확률이 높다

g = sns.FacetGrid(df, col='BLDS')
g.map(plt.hist, 'AGE_GROUP', bins=20)
<seaborn.axisgrid.FacetGrid at 0x7fbcfec04dc0>

png

OLIG_PROTE_CD(요단백 수치)

결측치 9274

포아송분포? 지수분포? 멱분포? 절대 정규분포를 따르지 않는다.
OLIG_PROTE_CD는 압도적으로 1이 많으므로 결측치는 전부 1로 채워주도록 한다.

df['OLIG_PROTE_CD'].fillna(1, inplace = True)
plt.figure(figsize=(15,10))
sns.countplot(df['OLIG_PROTE_CD'])
plt.title("요단백 수치",fontsize=15)
plt.show()

png

g = sns.FacetGrid(df[df['OLIG_PROTE_CD'] > 1], col='BLDS')
g.map(plt.hist, 'OLIG_PROTE_CD', bins=20)
<seaborn.axisgrid.FacetGrid at 0x7fbcfd8ccd30>

png

Continous Feature

Height

결측치 없음
정규분포를 따름

plt.figure(figsize=(15,10))
sns.countplot(df['HEIGHT'])
plt.title("HEIGHT",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

WEIGHT

결측치 없음
정규분포를 따름

plt.figure(figsize=(15,10))
sns.countplot(df['WEIGHT'])
plt.title("WEIGHT",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

WAIST

결측치 680개
이상치(999) 70개

허리둘레는 비만도와 관련있으므로, BMI feature를 생성하고 결측치를 채운다
BMI와 WAIST는 상관도가 0.8로 높은 상관성을 가진다.

WAIST_LM 이라는 새로운 column을 만들어 결측치를 쳐리해준다.

print("The average person waist for {:.1f}cm, 99% of people is {}cm or less, while the biggest waist {}cm.".format(df['WAIST'].mean(), df['WAIST'].quantile(0.99), df['WAIST'].max()))
The average person waist for 81.3cm, 99% of people is 105.0cm or less, while the biggest waist 999.0cm.
# 900이상의 허리둘레는 잘못 기입된 값이므로 이상치로 처리한다.
df['WAIST'] = np.where(df['WAIST']>900, np.nan, df['WAIST'])
plt.figure(figsize=(15,10))
sns.distplot(df['WAIST'])
plt.title("WAIST",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

png

create BMI feature

df['BMI'] = df['WEIGHT'] / (df['HEIGHT']/100)**2
sns.jointplot(x="BMI", y="WAIST", data=df)
plt.suptitle("BMI수치와 허리둘레", y=1.02)
plt.show()

png

# BMI와 허리둘레는 높은 상관관계를 가진다.
df[['BMI','WAIST']].corr()
BMI WAIST  
BMI 1.000000 0.802917
WAIST 0.802917 1.000000

Linear Regression으로 WAIST 결측치 및 이상치 채우기

70% 정도의 정확성을 가진 Linear Regression 모델로 WAIST결측치를 채운 새로운 WAIST_LM feature를 생성하고 기존 WAIST는 삭제

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df_copy = df[df['WAIST'].notnull()][['BMI', 'WAIST', 'HEIGHT', 'WEIGHT']]
X_train, X_test, y_train, y_test = train_test_split(df_copy[['BMI', 'HEIGHT', 'WEIGHT']],
                                                    df_copy['WAIST'], random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)
(1494975, 3) (498326, 3) (1494975,) (498326,)

#모델 정확도
0.7010780973495436
print(lr.coef_, lr.intercept_)
[1.78121448 0.15046672 0.12376376] 6.352174058556571
# 결측치가 없으면 기존값, 결측치면 precict값으로 채운 새로운 feature WAIST_LM
df['WAIST_pred'] = lr.predict(df[['BMI', 'HEIGHT', 'WEIGHT']])
df['WAIST_LM'] = np.where(df['WAIST_pred']>0, df['WAIST'], df['WAIST_pred'])
# 기존 WAIST, predict로만 구성된 WAIST_pred 삭제
drop_cols = ['WAIST_pred', 'WAIST']
df.drop(drop_cols, axis = 1, inplace = True)

SIGHT

실명한사람(9.9) -> 0
BLDS와의 상관관계가 0이므로 시력 평균치인 1을 결측치 464개에 대입

print("The average person sight_left for {:.1f}, 99% of people is {} or less, while the biggest sight_left {}.".format(df['SIGHT_LEFT'].mean(), df['SIGHT_LEFT'].quantile(0.99), df['SIGHT_LEFT'].max()))
print("The average person sight_right for {:.1f}, 99% of people is {} or less, while the biggest sight_right {}.".format(df['SIGHT_RIGHT'].mean(), df['SIGHT_RIGHT'].quantile(0.99), df['SIGHT_RIGHT'].max()))
The average person sight_left for 1.0, 99% of people is 2.0 or less, while the biggest sight_left 9.9.
The average person sight_right for 1.0, 99% of people is 2.0 or less, while the biggest sight_right 9.9.
df['SIGHT_LEFT'] = np.where(df['SIGHT_LEFT']>2.5, 0, df['SIGHT_LEFT'])
df['SIGHT_RIGHT'] = np.where(df['SIGHT_RIGHT']>2.5, 0, df['SIGHT_RIGHT'])
sns.countplot(x="variable", hue="value", data=pd.melt(df[['SIGHT_LEFT', 'SIGHT_RIGHT']]))
plt.title("SIGHT",fontsize=15)
plt.show()

png

df[(df['SIGHT_LEFT'].isnull()) | (df['SIGHT_RIGHT'].isnull())].shape
(464, 45)
df[['SIGHT_LEFT', 'SIGHT_RIGHT','BLDS']].corr()
SIGHT_LEFT SIGHT_RIGHT BLDS  
SIGHT_LEFT 1.00000 0.711090 -0.081750
SIGHT_RIGHT 0.71109 1.000000 -0.080196
BLDS -0.08175 -0.080196 1.000000
df['SIGHT_LEFT'].fillna(1, inplace = True)
df['SIGHT_RIGHT'].fillna(1, inplace = True)

BP_HIGH

결측치 40개

  • 40개의 결측치는 평균으로 매꾼다.

10자리 단위마다 데이터가 몰려있는것으로 보아 반올림의 문제인듯하다
구간을 20개로 나누어 discrete 형태로 변형시켜 안정적인 정규분포를 따르게 바꿨다.

print("Average BP_HIGH : {:.1f}, 99% of people is {} or less, 1% of people is {} or less, max BP_HIGH : {}, min BP_HIGH {}".format(df['BP_HIGH'].mean(), df['BP_HIGH'].quantile(0.99), df['BP_HIGH'].quantile(0.01), df['BP_HIGH'].max(), df['BP_HIGH'].min()))
Average BP_HIGH : 122.5, 99% of people is 162.0 or less, 1% of people is 91.0 or less, max BP_HIGH : 273.0, min BP_HIGH 58.0
plt.figure(figsize=(15,10))
sns.countplot(df['BP_HIGH'])
plt.title("BP_HIGH",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

df['BP_HIGH_BD'] = pd.cut(df['BP_HIGH'], 20, labels=[x for x in range(20)])
df.drop(['BP_HIGH'], axis = 1, inplace = True)
df.head()
SEX AGE_GROUP HEIGHT WEIGHT SIGHT_LEFT SIGHT_RIGHT HEAR_LEFT HEAR_RIGHT BP_LWST BLDS TOT_CHOLE TRIGLYCERIDE HDL_CHOLE LDL_CHOLE HMG OLIG_PROTE_CD CREATININE SGOT_AST SGPT_ALT GAMMA_GTP SMK_STAT_TYPE_CD DRK_YN HCHK_OE_INSPEC_YN CRS_YN TTR_YN SIDO_50 SIDO_busan SIDO_chongB SIDO_chongN SIDO_dagu SIDO_dajeon SIDO_gangwon SIDO_gyeongB SIDO_gyeongN SIDO_gyeonggi SIDO_incheon SIDO_jeonB SIDO_jeonN SIDO_kwangju SIDO_sejong SIDO_seoul SIDO_ulsan BMI WAIST_LM BP_HIGH_BD  
0 1 8 170 75 1.0 1.0 1.0 1.0 80.0 0.0 193.0 92.0 48.0 126.0 17.1 1.0 1.0 21.0 35.0 40.0 1.0 1.0 1 NaN 1.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25.951557 90.0 5
1 1 7 180 80 0.9 1.2 1.0 1.0 82.0 0.0 228.0 121.0 55.0 148.0 15.8 1.0 0.9 20.0 36.0 27.0 0.0 0.0 1 NaN 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 24.691358 89.0 6
2 1 9 165 75 1.2 1.5 1.0 1.0 70.0 0.0 136.0 104.0 41.0 74.0 15.8 1.0 0.9 47.0 32.0 68.0 1.0 0.0 0 NaN NaN 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 27.548209 91.0 5
3 1 11 175 80 1.5 1.2 1.0 1.0 87.0 0.0 201.0 106.0 76.0 104.0 17.6 1.0 1.1 29.0 34.0 18.0 1.0 0.0 1 NaN 0.0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 26.122449 91.0 8
4 1 11 165 60 1.0 1.2 1.0 1.0 82.0 0.0 199.0 104.0 61.0 117.0 13.8 1.0 0.8 19.0 12.0 25.0 1.0 0.0 1 NaN 0.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 22.038567 80.0 7
plt.figure(figsize=(15,10))
sns.countplot(df['BP_HIGH_BD'])
plt.title("BP_HIGH",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

df['BP_HIGH_BD'].fillna(5, inplace = True)
df = df.astype({'BP_HIGH_BD': 'int'})

BP_LWST

결측치 40개

  • 40개의 결측치는 평균으로 매꾼다.

10자리 단위마다 데이터가 몰려있는것으로 보아 반올림의 문제인듯하다
구간을 20개로 나누어 discrete 형태로 변형시켜 안정적인 정규분포를 따르게 바꿨다.

print("Average BP_LWST : {:.1f}, 99% of people is {} or less, 1% of people is {} or less, max BP_LWST : {}, min BP_LWST {}".format(df['BP_LWST'].mean(), df['BP_LWST'].quantile(0.99), df['BP_LWST'].quantile(0.01), df['BP_LWST'].max(), df['BP_LWST'].min()))
Average BP_LWST : 76.1, 99% of people is 102.0 or less, 1% of people is 56.0 or less, max BP_LWST : 185.0, min BP_LWST 27.0
plt.figure(figsize=(15,10))
sns.countplot(df['BP_LWST'])
plt.title("BP_LWST",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

df['BP_LWST_BD'] = pd.cut(df['BP_LWST'],20, labels=[x for x in range(20)])
df.drop(['BP_LWST'], axis = 1, inplace = True)
df.head()
SEX AGE_GROUP HEIGHT WEIGHT SIGHT_LEFT SIGHT_RIGHT HEAR_LEFT HEAR_RIGHT BLDS TOT_CHOLE TRIGLYCERIDE HDL_CHOLE LDL_CHOLE HMG OLIG_PROTE_CD CREATININE SGOT_AST SGPT_ALT GAMMA_GTP SMK_STAT_TYPE_CD DRK_YN HCHK_OE_INSPEC_YN CRS_YN TTR_YN SIDO_50 SIDO_busan SIDO_chongB SIDO_chongN SIDO_dagu SIDO_dajeon SIDO_gangwon SIDO_gyeongB SIDO_gyeongN SIDO_gyeonggi SIDO_incheon SIDO_jeonB SIDO_jeonN SIDO_kwangju SIDO_sejong SIDO_seoul SIDO_ulsan BMI WAIST_LM BP_HIGH_BD BP_LWST_BD  
0 1 8 170 75 1.0 1.0 1.0 1.0 0.0 193.0 92.0 48.0 126.0 17.1 1.0 1.0 21.0 35.0 40.0 1.0 1.0 1 NaN 1.0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25.951557 90.0 5 6
1 1 7 180 80 0.9 1.2 1.0 1.0 0.0 228.0 121.0 55.0 148.0 15.8 1.0 0.9 20.0 36.0 27.0 0.0 0.0 1 NaN 2.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 24.691358 89.0 6 6
2 1 9 165 75 1.2 1.5 1.0 1.0 0.0 136.0 104.0 41.0 74.0 15.8 1.0 0.9 47.0 32.0 68.0 1.0 0.0 0 NaN NaN 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 27.548209 91.0 5 5
3 1 11 175 80 1.5 1.2 1.0 1.0 0.0 201.0 106.0 76.0 104.0 17.6 1.0 1.1 29.0 34.0 18.0 1.0 0.0 1 NaN 0.0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 26.122449 91.0 8 7
4 1 11 165 60 1.0 1.2 1.0 1.0 0.0 199.0 104.0 61.0 117.0 13.8 1.0 0.8 19.0 12.0 25.0 1.0 0.0 1 NaN 0.0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 22.038567 80.0 7 6
plt.figure(figsize=(15,10))
sns.countplot(df['BP_LWST_BD'])
plt.title("BP_LWST",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

df['BP_LWST_BD'].fillna(6, inplace = True)
df = df.astype({'BP_LWST_BD': 'int'})
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1994043 entries, 0 to 999999
Data columns (total 45 columns):
 #   Column             Dtype  
---  ------             -----  
 0   SEX                int64  
 1   AGE_GROUP          int64  
 2   HEIGHT             int64  
 3   WEIGHT             int64  
 4   SIGHT_LEFT         float64
 5   SIGHT_RIGHT        float64
 6   HEAR_LEFT          float64
 7   HEAR_RIGHT         float64
 8   BLDS               float64
 9   TOT_CHOLE          float64
 10  TRIGLYCERIDE       float64
 11  HDL_CHOLE          float64
 12  LDL_CHOLE          float64
 13  HMG                float64
 14  OLIG_PROTE_CD      float64
 15  CREATININE         float64
 16  SGOT_AST           float64
 17  SGPT_ALT           float64
 18  GAMMA_GTP          float64
 19  SMK_STAT_TYPE_CD   float64
 20  DRK_YN             float64
 21  HCHK_OE_INSPEC_YN  int64  
 22  CRS_YN             float64
 23  TTR_YN             float64
 24  SIDO_50            uint8  
 25  SIDO_busan         uint8  
 26  SIDO_chongB        uint8  
 27  SIDO_chongN        uint8  
 28  SIDO_dagu          uint8  
 29  SIDO_dajeon        uint8  
 30  SIDO_gangwon       uint8  
 31  SIDO_gyeongB       uint8  
 32  SIDO_gyeongN       uint8  
 33  SIDO_gyeonggi      uint8  
 34  SIDO_incheon       uint8  
 35  SIDO_jeonB         uint8  
 36  SIDO_jeonN         uint8  
 37  SIDO_kwangju       uint8  
 38  SIDO_sejong        uint8  
 39  SIDO_seoul         uint8  
 40  SIDO_ulsan         uint8  
 41  BMI                float64
 42  WAIST_LM           float64
 43  BP_HIGH_BD         int64  
 44  BP_LWST_BD         int64  
dtypes: float64(21), int64(7), uint8(17)
memory usage: 513.5 MB

TOT_CHOLE

너무많은 결측치(661336)가 있기 때문에 보류

df['TOT_CHOLE'].isnull().sum()
661336
plt.figure(figsize=(15,10))
sns.countplot(df['TOT_CHOLE'])
plt.title("TOT_CHOLE",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

TRIGLYCERIDE

너무많은 결측치(661346)가 있기 때문에 보류

df['TRIGLYCERIDE'].isnull().sum()
661346
plt.figure(figsize=(15,10))
sns.countplot(df['TRIGLYCERIDE'])
plt.title("TRIGLYCERIDE",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

HDL_CHOLE

너무많은 결측치(661347)가 있기 때문에 보류

df['HDL_CHOLE'].isnull().sum()
661347
plt.figure(figsize=(15,10))
sns.countplot(df['HDL_CHOLE'])
plt.title("HDL_CHOLE",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

LDL_CHOLE

너무많은 결측치(671083)가 있기 때문에 보류

df['LDL_CHOLE'].isnull().sum()
671083
plt.figure(figsize=(15,10))
sns.countplot(df['LDL_CHOLE'])
plt.title("LDL_CHOLE",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

HMG

결측치 26개

  • 결측치는 평균치로 넣어준다

정규분포를 따른다.

df['HMG'].isnull().sum()
26
plt.figure(figsize=(15,10))
sns.countplot(df['HMG'])
plt.title("HMG",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

df['HMG'].fillna(df['HMG'].mean(), inplace = True)

CREATININE

결측치 4개
상위 99% 이상값을 가지는 데이터들을 하나로 묶고 분포도를 출력한 결과 예쁜 정규분포형이 나옴

  • 이상치를 제거해도 될듯함
df[df['CREATININE'] > 80]['CREATININE'].value_counts()
96.0    3
90.0    2
86.0    2
98.0    2
93.0    2
87.0    2
95.0    2
89.0    1
84.0    1
83.0    1
97.0    1
92.0    1
85.0    1
94.0    1
81.0    1
Name: CREATININE, dtype: int64
print("Average CREATININE : {:.1f}, 99% of people is {} or less, 1% of people is {} or less, max CREATININE : {}, min CREATININE {}".format(df['CREATININE'].mean(), df['CREATININE'].quantile(0.99), df['CREATININE'].quantile(0.01), df['CREATININE'].max(), df['CREATININE'].min()))
Average CREATININE : 0.9, 99% of people is 1.4 or less, 1% of people is 0.5 or less, max CREATININE : 98.0, min CREATININE 0.1
plt.figure(figsize=(15,10))
sns.countplot(df['CREATININE'])
plt.title("CREATININE",fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

temp= df.copy()
temp.loc[temp['CREATININE'] > temp['CREATININE'].quantile(0.99)] = '1.5+'
plt.figure(figsize=(15,10))
sns.countplot(temp['CREATININE'].astype('str').sort_values())
plt.title('CREATININE')
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

SGOT_AST

결측치 2개

심하게 오른쪽으로 치우쳐진 분포

feature = 'SGOT_AST'
df[feature].isnull().sum()
1
print("Average {} : {:.1f}, 99% of people is {} or less, 1% of people is {} or less, max {} : {}, min CREATININE {}".format(feature, df[feature].mean(), df[feature].quantile(0.99), df[feature].quantile(0.01), feature, df[feature].max(), feature, df[feature].min()))
Average SGOT_AST : 26.1, 99% of people is 81.0 or less, 1% of people is 12.0 or less, max SGOT_AST : 9999.0, min CREATININE SGOT_AST
plt.figure(figsize=(15,10))
sns.countplot(df[feature])
plt.title(feature,fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

plt.figure(figsize=(15,10))
sns.countplot(df[df[feature] < df[feature].quantile(0.99)][feature])
plt.title(feature,fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

SGPT_ALT

결측치 3개 심하게 오른쪽으로 치우쳐진 분포

feature = 'SGPT_ALT'
df[feature].isnull().sum()
2
print("Average {} : {:.1f}, 99% of people is {} or less, 1% of people is {} or less, max {} : {}, min CREATININE {}".format(feature, df[feature].mean(), df[feature].quantile(0.99), df[feature].quantile(0.01), feature, df[feature].max(), feature, df[feature].min()))
Average SGPT_ALT : 26.0, 99% of people is 107.0 or less, 1% of people is 7.0 or less, max SGPT_ALT : 7210.0, min CREATININE SGPT_ALT
plt.figure(figsize=(15,10))
sns.countplot(df[feature])
plt.title(feature,fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

plt.figure(figsize=(15,10))
sns.countplot(df[df[feature] < df[feature].quantile(0.99)][feature])
plt.title(feature,fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

GAMMA_GTP

결측치 6개 심하게 오른쪽으로 치우쳐진 분포

feature = 'GAMMA_GTP'
df[feature].isnull().sum()
0
print("Average {} : {:.1f}, 99% of people is {} or less, 1% of people is {} or less, max {} : {}, min CREATININE {}".format(feature, df[feature].mean(), df[feature].quantile(0.99), df[feature].quantile(0.01), feature, df[feature].max(), df[feature].min()))
Average GAMMA_GTP : 37.4, 99% of people is 233.0 or less, 1% of people is 8.0 or less, max GAMMA_GTP : 999.0, min CREATININE 1.0
plt.figure(figsize=(15,10))
sns.countplot(df[feature])
plt.title(feature,fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

plt.figure(figsize=(15,10))
sns.countplot(df[df[feature] < 233][feature])
plt.title(feature,fontsize=15)
plt.show()
/Users/minsuha/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

df = df.dropna(subset = ['GAMMA_GTP_BD'], how = 'any', axis=0)
df = df.astype({'GAMMA_GTP_BD': 'int'})

이전 포스트 Logistic Regression

다음 포스트 Shallow NN

Comments

Content