Data Observation – Données clients¶
Approfondissons nos compétences en matière de manipulation de données réelles à partir de la bibliothèque Pandas
Dans ce défi, nous explorerons les informations présentes dans une base de données clients.
Notre objectif est de visualiser la répartition de nos données.
Q1. Téléchargez le jeu de données à l’adresse suivante puis placez-le dans le dossier ‘Sample Data’.
Q2. Importez la bibliothèque pandas
qui sera utilisée tout au long dae ce projet.
# Importez pandas avec l'alias pd
import pandas as pd
%whos
Variable Type Data/Info --------------------------------- df DataFrame ID Gender Ever<...>n[8068 rows x 11 columns] pd module <module 'pandas' from '/u<...>ages/pandas/__init__.py'>
2. Exploration du jeu de données¶
Dans cette partie notre objectifs est d’explorer le jeu de données contenant les informations clients.
Voici la liste des modules de la bibliotèque pandas qui seront utilisés dans cette partie.
.read_csv('chemin du fichier')
: Pour importer un jeu de données à partir de son ‘nom de fichier’,-
.head(n)
: Pour afficher les n premières lignes du le jeu de données, -
.columns
: Pour afficher la liste des variables. -
.shape
: Pour afficher les dimensions du jeu de données (nombre de lignes et nombre de colonnes), .isna().sum()
: Pour afficher le nombre de valeurs manquantes par colonnes
.info()
: Pour afficher les informations du jeu de données,.describe()
: Pour afficher les données statistiques du jeu de données.
Q3. Importez votre le jeux de données client dans une variable nomée df. Affichez ensuite les 10 premières lignes de ce jeu de données.
# Importez le jeu de données dans une variable df
df = pd.read_csv('sample_data/data.csv')
# Affichez la variable df
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
Q4. Affichez la liste des colonnes du jeu de données.
#Affichez la liste des colonnes du jeu de données df
df.columns
Index(['ID', 'Gender', 'Ever_Married', 'Age', 'Graduated', 'Profession', 'Work_Experience', 'Spending_Score', 'Family_Size', 'Var_1', 'Segmentation'], dtype='object')
Q5. Affichez les dimensions du jeu de données ainsi que le nombre de valeurs manquantes.
# TODO - Affichez les dimensions
df.shape
(8068, 11)
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
False + False + True + True
2
df.isna().sum()
ID 0 Gender 0 Ever_Married 140 Age 0 Graduated 78 Profession 124 Work_Experience 829 Spending_Score 0 Family_Size 335 Var_1 76 Segmentation 0 dtype: int64
# TODO - Affichez le nombre de valeurs manquantes
df.isna().sum()
ID 0 Gender 0 Ever_Married 140 Age 0 Graduated 78 Profession 124 Work_Experience 829 Spending_Score 0 Family_Size 335 Var_1 76 Segmentation 0 dtype: int64
Q6. Notez vos premières observations sur l’ensembles des données affichées,
- que contient le jeu de données, quelles sont les variables,
- quelles sont les dimensions,
- quelles sont les variables qui contiennent le plus de données manquantes
Le jeu de données contient 8068 lignes, 11 colonnes : La variable qui contient le plus de données manquantes est la variable Work_Experience
avec 829 données manquantes.
[Zone de texte]
Q7. Affichez les informations puis les données statistics du jeu de données.
# TODO - Affichez les informations du jeu de données
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8068 entries, 0 to 8067 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 8068 non-null int64 1 Gender 8068 non-null object 2 Ever_Married 7928 non-null object 3 Age 8068 non-null int64 4 Graduated 7990 non-null object 5 Profession 7944 non-null object 6 Work_Experience 7239 non-null float64 7 Spending_Score 8068 non-null object 8 Family_Size 7733 non-null float64 9 Var_1 7992 non-null object 10 Segmentation 8068 non-null object dtypes: float64(2), int64(2), object(7) memory usage: 693.5+ KB
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
# TODO - Affichez les statistiques du jeu de données
df.describe()
ID | Age | Work_Experience | Family_Size | |
---|---|---|---|---|
count | 8068.000000 | 8068.000000 | 7239.000000 | 7733.000000 |
mean | 463479.214551 | 43.466906 | 2.641663 | 2.850123 |
std | 2595.381232 | 16.711696 | 3.406763 | 1.531413 |
min | 458982.000000 | 18.000000 | 0.000000 | 1.000000 |
25% | 461240.750000 | 30.000000 | 0.000000 | 2.000000 |
50% | 463472.500000 | 40.000000 | 1.000000 | 3.000000 |
75% | 465744.250000 | 53.000000 | 4.000000 | 4.000000 |
max | 467974.000000 | 89.000000 | 14.000000 | 9.000000 |
Q8. À l’aide d’un filtre, créez un sous-échantillon du jeu de données permettant d’afficher les individus : hommes, mariés, de plus de 30 ans de profession ingénieur, affichez ensuite la moyenne des âges de ces individus.
condition1 = df.Gender == 'Male'
condition2 = df.Ever_Married == 'Yes'
condition3 = df.Age > 30
condition4 = df.Profession == 'Engineer'
df_male_engineer = df[(condition1) & (condition2) & (condition3) & (condition4)]
df_male_engineer.describe()
ID | Age | Work_Experience | Family_Size | |
---|---|---|---|---|
count | 85.000000 | 85.000000 | 74.000000 | 82.000000 |
mean | 463275.764706 | 47.964706 | 2.175676 | 3.036585 |
std | 2378.399420 | 10.922287 | 3.031154 | 1.551146 |
min | 459065.000000 | 31.000000 | 0.000000 | 1.000000 |
25% | 461530.000000 | 40.000000 | 0.000000 | 2.000000 |
50% | 463094.000000 | 45.000000 | 1.000000 | 3.000000 |
75% | 465193.000000 | 55.000000 | 2.750000 | 4.000000 |
max | 467855.000000 | 72.000000 | 12.000000 | 8.000000 |
Q9. Notez vos observtions dans la zone de texte ci-dessous :
- Quelles sont les variables qualitatives ?
- Quelles sont les variables quantitatives ?
- Quel est le poids du jeu de données ?
- Que pouvez-vous dire sur l’âge des clients ?
- Que pouvez-vous dire sur la taille des familles ?
- Que pouvez-vous dire sur le nombre d’exprience professionelle ?
[Zone de texte]
3. Nettoyage du jeu de données¶
Dans cette partie nous allons traiter les données présentes dans notre dataframe en remplaçant les données qualitative par des données quantitatives.
Ce traitement de donné nous permetra d’utiliser par la suite nos algoritmes de segmentation et de clustering.
Voici la liste des traitements que nous appliquerons à notre jeu de données :
Suppressions des lignes contenant des valeurs manquantes,
Suppressions des colonnes inutiles ainsi que les données dupliquées,
Remplacement des données qualitatives par des données quantitatives,
Sauvegarde du jeu de données après nettoyage.
Voici la liste des modules de pandas qui seront utilisés dans cette partie :
-
df.dropna()
: Pour supprimer les lignes contenant des valeurs manquantes dans un jeu de données df. Pour concerver les modifications il faudra réassigner la variable df ==> df = df.dropna() -
.drop_duplicates()
: Pour supprimer les lignes dupliquées.
df.drop(['nom de la colone'], axis=1)
: Pour supprimer une colonne du jeu de données df. Pour concerver les modifications il faudra réassigner la variable df.
-
df['nom de la colone'].value_counts()
: Pour afficher la répartion des valeurs de la varialbe mentionnée entre crochet. -
df.reset_index()
: Réinitialisez l’indexation du jeu de données
categorical(df, 'nom de la colone')
: Fonction permettant de remplacer les valeurs qualitatives par des valeurs quantitatives. Pour concerver les modifications il faudra réassigner la variable df ==> df = catégorical(df, ‘nom de la colone’).
df.to_csv('fichier.csv')
: Pour sauvegarder votre jeu de données.
#TODO - Supprimez les données dupliquées
df = df.drop_duplicates()
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
#TODO - Supprimez les données manquantes
df = df.dropna()
[Zone de texte]
Q10. Supprimez la colonne ‘Var_1’ ainsi que la colonne ‘ID’ du jeu de données.
df.drop(['ID', 'Var_1'], axis=1)
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | D |
2 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | B |
3 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | B |
5 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
6 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
8062 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
8064 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
8065 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
8066 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
8067 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
# Suppression de la variable ID et Var_1
df = df.drop(['ID', 'Var_1'], axis=1)
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | D |
2 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | B |
3 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | B |
5 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
6 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
8062 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
8064 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
8065 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
8066 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
8067 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
df
index | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | D |
1 | 2 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | B |
2 | 3 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | B |
3 | 5 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
4 | 6 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … | … |
6660 | 8062 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
6661 | 8064 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
6662 | 8065 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
6663 | 8066 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
6664 | 8067 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
6665 rows × 10 columns
# Réinitialisez l'indexation du jeu de données
df = df.reset_index(drop=True)
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | D |
1 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | B |
2 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | B |
3 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
4 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
6661 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
6662 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
6663 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
6664 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
# Supression d'une ligne (facultatif)
df.drop([0,1,2], axis=0)
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
3 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
4 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
5 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 3.0 | D |
6 | Female | Yes | 61 | Yes | Engineer | 0.0 | Low | 3.0 | D |
7 | Female | Yes | 55 | Yes | Artist | 1.0 | Average | 4.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
6661 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
6662 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
6663 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
6664 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
6662 rows × 9 columns
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | D |
1 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | B |
2 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | B |
3 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
4 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
6661 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
6662 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
6663 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
6664 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
Q11. Traitement de la colonne Gender
- Affichez la répartition des valeurs de la colonne
Gender
, - Notez vos observations dans la zonne de texte,
- Appliquez la fonction
catégorial
à la colonneGender
,
# Affichez la répartition de la colonne 'Gender'
df.Gender.value_counts()
Male 3677 Female 2988 Name: Gender, dtype: int64
# Liste des itèmes d'une colonne
df.Gender.unique()
array(['Male', 'Female'], dtype=object)
# Nombre d'itème unique d'une colonne
df.Gender.nunique()
2
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | D |
1 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | B |
2 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | B |
3 | Male | Yes | 56 | No | Artist | 0.0 | Average | 2.0 | C |
4 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | Male | Yes | 41 | Yes | Artist | 0.0 | High | 5.0 | B |
6661 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | D |
6662 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | D |
6663 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | B |
6664 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
Q12. Exécutez la cellule suivante pour définir la fonction categorical
afin de l’utiliser par la suite pour transformer nos données quantitatives en données qualitatives.
def categorical(df, column):
df = df.dropna()
liste_ = list(df[column].value_counts().index)
df[column] = df[column].apply(lambda x: liste_.index(x))
return df
df.Profession.mean()
2.3038259564891224
# Remplacez les valeurs qualitatives par des valeurs quantitatives en utilisant la fonction `categorical`, colonne 'Gender'
df = categorical(df, 'Gender')
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | No | 22 | No | 1 | 1.0 | Low | 4.0 | D |
1 | 1 | Yes | 67 | Yes | 4 | 1.0 | Low | 1.0 | B |
2 | 0 | Yes | 67 | Yes | 6 | 0.0 | High | 2.0 | B |
3 | 0 | Yes | 56 | No | 0 | 0.0 | Average | 2.0 | C |
4 | 0 | No | 32 | Yes | 1 | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | 0 | Yes | 41 | Yes | 0 | 0.0 | High | 5.0 | B |
6661 | 0 | No | 35 | No | 5 | 3.0 | Low | 4.0 | D |
6662 | 1 | No | 33 | Yes | 1 | 1.0 | Low | 1.0 | D |
6663 | 1 | No | 27 | Yes | 1 | 1.0 | Low | 4.0 | B |
6664 | 0 | Yes | 37 | Yes | 5 | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
Q13. Traitement de la colonne Ever_Married
- Affichez la répartition des valeurs de la colonne
Ever_Married
, - Notez vos observations dans la zonne de texte,
- Appliquez la fonction
catégorial
à la colonneEver_Married
,
# Affichez la répartition de la colonne 'Ever_Married'
df.Ever_Married.value_counts()
Yes 3944 No 2721 Name: Ever_Married, dtype: int64
[Zone de texte]
# Remplacez les valeurs qualitatives par des valeurs quantitatives 'Ever_Married'
df = categorical(df, 'Ever_Married')
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 22 | No | 1 | 1.0 | Low | 4.0 | D |
1 | 1 | 0 | 67 | Yes | 4 | 1.0 | Low | 1.0 | B |
2 | 0 | 0 | 67 | Yes | 6 | 0.0 | High | 2.0 | B |
3 | 0 | 0 | 56 | No | 0 | 0.0 | Average | 2.0 | C |
4 | 0 | 1 | 32 | Yes | 1 | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | 0 | 0 | 41 | Yes | 0 | 0.0 | High | 5.0 | B |
6661 | 0 | 1 | 35 | No | 5 | 3.0 | Low | 4.0 | D |
6662 | 1 | 1 | 33 | Yes | 1 | 1.0 | Low | 1.0 | D |
6663 | 1 | 1 | 27 | Yes | 1 | 1.0 | Low | 4.0 | B |
6664 | 0 | 0 | 37 | Yes | 5 | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
Q14. Traitement de la colonne Graduated
- Affichez la répartition des valeurs de la colonne
Graduated
, - Notez vos observations dans la zonne de texte,
- Appliquez la fonction
catégorial
à la colonneGraduated
,
# Affichez la répartition de la colonne 'Graduated'
df.Graduated.value_counts()
Yes 4249 No 2416 Name: Graduated, dtype: int64
[Zone de texte]
# Remplacez les valeurs qualitatives par des valeurs quantitatives
df = categorical(df, 'Graduated')
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 22 | 1 | 1 | 1.0 | Low | 4.0 | D |
1 | 1 | 0 | 67 | 0 | 4 | 1.0 | Low | 1.0 | B |
2 | 0 | 0 | 67 | 0 | 6 | 0.0 | High | 2.0 | B |
3 | 0 | 0 | 56 | 1 | 0 | 0.0 | Average | 2.0 | C |
4 | 0 | 1 | 32 | 0 | 1 | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | 0 | 0 | 41 | 0 | 0 | 0.0 | High | 5.0 | B |
6661 | 0 | 1 | 35 | 1 | 5 | 3.0 | Low | 4.0 | D |
6662 | 1 | 1 | 33 | 0 | 1 | 1.0 | Low | 1.0 | D |
6663 | 1 | 1 | 27 | 0 | 1 | 1.0 | Low | 4.0 | B |
6664 | 0 | 0 | 37 | 0 | 5 | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
Q15. Traitement de la colonne Profession
- Affichez la répartition des valeurs de la colonne
Profession
, - Notez vos observations dans la zonne de texte,
- Appliquez la fonction
catégorial
à la colonneProfession
,
#Affichez la répartition de la colonne 'Profession'
df.Profession.value_counts()
Artist 2192 Healthcare 1077 Entertainment 809 Doctor 592 Engineer 582 Executive 505 Lawyer 500 Marketing 233 Homemaker 175 Name: Profession, dtype: int64
[Zone de texte]
# Remplacez les valeurs qualitatives par des valeurs quantitatives
df = categorical(df, 'Profession')
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 22 | 1 | 1 | 1.0 | Low | 4.0 | D |
1 | 1 | 0 | 67 | 0 | 4 | 1.0 | Low | 1.0 | B |
2 | 0 | 0 | 67 | 0 | 6 | 0.0 | High | 2.0 | B |
3 | 0 | 0 | 56 | 1 | 0 | 0.0 | Average | 2.0 | C |
4 | 0 | 1 | 32 | 0 | 1 | 1.0 | Low | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | 0 | 0 | 41 | 0 | 0 | 0.0 | High | 5.0 | B |
6661 | 0 | 1 | 35 | 1 | 5 | 3.0 | Low | 4.0 | D |
6662 | 1 | 1 | 33 | 0 | 1 | 1.0 | Low | 1.0 | D |
6663 | 1 | 1 | 27 | 0 | 1 | 1.0 | Low | 4.0 | B |
6664 | 0 | 0 | 37 | 0 | 5 | 0.0 | Average | 3.0 | B |
6665 rows × 9 columns
Q16. Traitement de la colonne Spending_Score
- Affichez la répartition des valeurs de la colonne
Spending_Score
, - Notez vos observations dans la zonne de texte,
- Appliquez la fonction
catégorial
à la colonneSpending_Score
,
# Affichez la répartition de la colonne 'Spending_Score'
df.Spending_Score.value_counts()
Low 3999 Average 1662 High 1004 Name: Spending_Score, dtype: int64
# Remplacez les valeurs qualitatives par des valeurs quantitatives
df = categorical(df, 'Spending_Score')
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 22 | 1 | 1 | 1.0 | 0 | 4.0 | D |
1 | 1 | 0 | 67 | 0 | 4 | 1.0 | 0 | 1.0 | B |
2 | 0 | 0 | 67 | 0 | 6 | 0.0 | 2 | 2.0 | B |
3 | 0 | 0 | 56 | 1 | 0 | 0.0 | 1 | 2.0 | C |
4 | 0 | 1 | 32 | 0 | 1 | 1.0 | 0 | 3.0 | C |
… | … | … | … | … | … | … | … | … | … |
6660 | 0 | 0 | 41 | 0 | 0 | 0.0 | 2 | 5.0 | B |
6661 | 0 | 1 | 35 | 1 | 5 | 3.0 | 0 | 4.0 | D |
6662 | 1 | 1 | 33 | 0 | 1 | 1.0 | 0 | 1.0 | D |
6663 | 1 | 1 | 27 | 0 | 1 | 1.0 | 0 | 4.0 | B |
6664 | 0 | 0 | 37 | 0 | 5 | 0.0 | 1 | 3.0 | B |
6665 rows × 9 columns
Q17. Traitement de la colonne Segmentation
- Affichez la répartition des valeurs de la colonne
Segmentation
, - Notez vos observations dans la zonne de texte,
- Appliquez la fonction
catégorial
à la colonneSegmentation
,
#TODO - Affichez la répartition de la colonne 'Segmentation'
df.Segmentation.value_counts()
D 1757 C 1720 A 1616 B 1572 Name: Segmentation, dtype: int64
[Zone de texte]
# Remplacez les valeurs qualitatives par des valeurs quantitatives
df = categorical(df, 'Segmentation')
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 22 | 1 | 1 | 1.0 | 0 | 4.0 | 0 |
1 | 1 | 0 | 67 | 0 | 4 | 1.0 | 0 | 1.0 | 3 |
2 | 0 | 0 | 67 | 0 | 6 | 0.0 | 2 | 2.0 | 3 |
3 | 0 | 0 | 56 | 1 | 0 | 0.0 | 1 | 2.0 | 1 |
4 | 0 | 1 | 32 | 0 | 1 | 1.0 | 0 | 3.0 | 1 |
… | … | … | … | … | … | … | … | … | … |
6660 | 0 | 0 | 41 | 0 | 0 | 0.0 | 2 | 5.0 | 3 |
6661 | 0 | 1 | 35 | 1 | 5 | 3.0 | 0 | 4.0 | 0 |
6662 | 1 | 1 | 33 | 0 | 1 | 1.0 | 0 | 1.0 | 0 |
6663 | 1 | 1 | 27 | 0 | 1 | 1.0 | 0 | 4.0 | 3 |
6664 | 0 | 0 | 37 | 0 | 5 | 0.0 | 1 | 3.0 | 3 |
6665 rows × 9 columns
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Segmentation | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 22 | 1 | 1 | 1.0 | 0 | 4.0 | 0 |
1 | 1 | 0 | 67 | 0 | 4 | 1.0 | 0 | 1.0 | 3 |
2 | 0 | 0 | 67 | 0 | 6 | 0.0 | 2 | 2.0 | 3 |
3 | 0 | 0 | 56 | 1 | 0 | 0.0 | 1 | 2.0 | 1 |
4 | 0 | 1 | 32 | 0 | 1 | 1.0 | 0 | 3.0 | 1 |
… | … | … | … | … | … | … | … | … | … |
6660 | 0 | 0 | 41 | 0 | 0 | 0.0 | 2 | 5.0 | 3 |
6661 | 0 | 1 | 35 | 1 | 5 | 3.0 | 0 | 4.0 | 0 |
6662 | 1 | 1 | 33 | 0 | 1 | 1.0 | 0 | 1.0 | 0 |
6663 | 1 | 1 | 27 | 0 | 1 | 1.0 | 0 | 4.0 | 3 |
6664 | 0 | 0 | 37 | 0 | 5 | 0.0 | 1 | 3.0 | 3 |
6665 rows × 9 columns
Q18. Sauvegardez votre jeu de données dans un fichier nommé ‘data_clean.csv’
# Sauvegardez votre jeu de données
df.to_csv('data_clean.csv')
4. Data Visualisation¶
A l’aide du site suivant : https://www.python-graph-gallery.com et de la bibliothèque seaborn
, nous allons afficher quelques graphiques.
-
Import de la bibliothèque :
import seaborn as sns
-
Un histograme : https://www.python-graph-gallery.com/histogram/
`sns.histplot(data)`
-
Une boite à moustache : https://www.python-graph-gallery.com/boxplot/
`sns.boxplot(data)`
-
Un digramme circulaire : https://www.python-graph-gallery.com/pie-plot/
`import matplotlib.pyplot as plt` `plt.pie(data)`
import pandas as pd
df = pd.read_csv('data.csv')
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 5))
plt.title('Histogramme des ages')
sns.histplot(df.Age)
plt.show()
plt.figure(figsize=(8, 5))
plt.title('Boîte à moustache de la variable "Age"')
sns.boxplot(df.Age)
plt.show()
plt.figure(figsize=(15, 5))
plt.title('Boîte à moustache de la variable "Age"')
sns.boxplot(x=df.Gender, y=df.Age)
plt.show()
plt.figure(figsize=(15, 5))
plt.title('Boîte à moustache des âges en fonction de la profession et du genre des utilisateurs')
sns.boxplot(x=df.Profession, y=df.Age, hue=df.Gender)
plt.show()
plt.figure(figsize=(15, 5))
plt.title('Boîte à moustache des âges en fonction de la profession des utilisateurs')
sns.boxplot(x=df.Profession, y=df.Age, hue=df.Ever_Married)
plt.show()
plt.figure(figsize=(15, 5))
plt.title("Nuage de point des âges en fonction du nombre d'année d'expérience")
sns.scatterplot(x=df.Age, y=df.Work_Experience, hue=df.Ever_Married)
plt.show()
df['Profession'].value_counts()
df.Profession.value_counts()
Artist 2516 Healthcare 1332 Entertainment 949 Engineer 699 Doctor 688 Lawyer 623 Executive 599 Marketing 292 Homemaker 246 Name: Profession, dtype: int64
plt.figure(figsize=(15, 5))
plt.title("Diagramme circulaire des professions")
plt.pie(df.Profession.value_counts(), labels=df.Profession.value_counts().index, autopct='%1.1f%%')
plt.show()
plt.figure(figsize=(15, 5))
plt.title("Histogramme des professions")
sns.histplot(y=df.Profession)
plt.show()
Manipulations avancée de Pandas¶
import pandas as pd
df = pd.read_csv('data.csv', delimiter=',', encoding='utf-8')
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
# Répartition d'une variable
df.Profession.value_counts()
Artist 2516 Healthcare 1332 Entertainment 949 Engineer 699 Doctor 688 Lawyer 623 Executive 599 Marketing 292 Homemaker 246 Name: Profession, dtype: int64
df.Profession.value_counts().index
Index(['Artist', 'Healthcare', 'Entertainment', 'Engineer', 'Doctor', 'Lawyer', 'Executive', 'Marketing', 'Homemaker'], dtype='object')
df.Profession.value_counts().values
array([2516, 1332, 949, 699, 688, 623, 599, 292, 246])
# Filtre
df_filtre = df[(df.Age==18) & (df.Work_Experience==9)]
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
368 | 463560 | Female | No | 18 | No | Healthcare | 9.0 | Low | 5.0 | Cat_4 | D |
2044 | 460485 | Female | No | 18 | No | Marketing | 9.0 | Low | 5.0 | Cat_3 | B |
7244 | 466642 | Male | No | 18 | No | Healthcare | 9.0 | Low | 3.0 | Cat_6 | D |
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
# Groupement de données
df.groupby(['Gender', 'Profession'])['Work_Experience', 'Age'].mean()
<ipython-input-24-b5189fc41d37>:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead. df.groupby(['Gender', 'Profession'])['Work_Experience', 'Age'].mean()
Work_Experience | Age | ||
---|---|---|---|
Gender | Profession | ||
Female | Artist | 2.817446 | 46.406430 |
Doctor | 2.387097 | 37.815789 | |
Engineer | 2.650794 | 41.509839 | |
Entertainment | 3.276596 | 39.818182 | |
Executive | 2.423077 | 43.593750 | |
Healthcare | 2.766000 | 27.475763 | |
Homemaker | 6.560000 | 37.113300 | |
Lawyer | 1.247232 | 74.905537 | |
Marketing | 3.271429 | 36.201220 | |
Male | Artist | 2.606873 | 46.254029 |
Doctor | 2.809117 | 36.757812 | |
Engineer | 2.370968 | 42.685714 | |
Entertainment | 2.559809 | 44.163793 | |
Executive | 2.322709 | 51.520282 | |
Healthcare | 2.464912 | 26.361290 | |
Homemaker | 6.000000 | 41.744186 | |
Lawyer | 1.211896 | 75.515823 | |
Marketing | 1.805310 | 37.609375 |
df.groupby(['Gender', 'Profession'])['Age'].max()
Gender Profession Female Artist 89 Doctor 81 Engineer 81 Entertainment 89 Executive 83 Healthcare 63 Homemaker 85 Lawyer 89 Marketing 76 Male Artist 89 Doctor 89 Engineer 72 Entertainment 84 Executive 89 Healthcare 86 Homemaker 73 Lawyer 89 Marketing 89 Name: Age, dtype: int64
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
# Label encoder :from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
Profession_encod = encoder.fit_transform(df.Profession)
Profession_encod
array([5, 2, 2, ..., 5, 5, 4])
encoder.classes_
array(['Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive', 'Healthcare', 'Homemaker', 'Lawyer', 'Marketing', nan], dtype=object)
# Hot One Encoding : get_dummies(prefix)
df_Profession = pd.get_dummies(df.Profession, prefix='Profession')
df_Profession
Profession_Artist | Profession_Doctor | Profession_Engineer | Profession_Entertainment | Profession_Executive | Profession_Healthcare | Profession_Homemaker | Profession_Lawyer | Profession_Marketing | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
… | … | … | … | … | … | … | … | … | … |
8063 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8064 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8065 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
8066 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
8067 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8068 rows × 9 columns
df
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
… | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
# Concatenation dataframe
df = pd.concat([df, df_Profession], axis=1).drop(['Profession'], axis=1)
df
ID | Gender | Ever_Married | Age | Graduated | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | Profession_Artist | Profession_Doctor | Profession_Engineer | Profession_Entertainment | Profession_Executive | Profession_Healthcare | Profession_Homemaker | Profession_Lawyer | Profession_Marketing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | 1.0 | Low | 4.0 | Cat_4 | D | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 462643 | Female | Yes | 38 | Yes | NaN | Average | 3.0 | Cat_4 | A | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 466315 | Female | Yes | 67 | Yes | 1.0 | Low | 1.0 | Cat_6 | B | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 461735 | Male | Yes | 67 | Yes | 0.0 | High | 2.0 | Cat_6 | B | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 462669 | Female | Yes | 40 | Yes | NaN | High | 6.0 | Cat_6 | A | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
8063 | 464018 | Male | No | 22 | No | 0.0 | Low | 7.0 | Cat_1 | D | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8064 | 464685 | Male | No | 35 | No | 3.0 | Low | 4.0 | Cat_4 | D | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8065 | 465406 | Female | No | 33 | Yes | 1.0 | Low | 1.0 | Cat_6 | D | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
8066 | 467299 | Female | No | 27 | Yes | 1.0 | Low | 4.0 | Cat_6 | B | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
8067 | 461879 | Male | Yes | 37 | Yes | 0.0 | Average | 3.0 | Cat_4 | B | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8068 rows × 19 columns
# Change Type of data : .astype(str)
df['Age_Str'] = df.Age.astype(str)
df.Age_Str = df.Age_Str + ' ans'
df.Age_Str
0 22 ans 1 38 ans 2 67 ans 3 67 ans 4 40 ans ... 8063 22 ans 8064 35 ans 8065 33 ans 8066 27 ans 8067 37 ans Name: Age_Str, Length: 8068, dtype: object
# Custom Data Transformation : .apply(lambda x: x)
'22 ans'.replace(' ans', '')
'22'
df.Age_Str.apply(lambda x: int(x.replace(' ans', '')))
df.Age_Str.apply(lambda x: x.replace(' ans', '')).astype(int)
0 22 1 38 2 67 3 67 4 40 .. 8063 22 8064 35 8065 33 8066 27 8067 37 Name: Age_Str, Length: 8068, dtype: int64
# Remplacer les données manquantes
median_Work_Experience = df.Work_Experience.median()
median_Work_Experience
1.0
df.Work_Experience.fillna(int(median_Work_Experience))
0 1.0 1 1.0 2 1.0 3 0.0 4 1.0 ... 8063 0.0 8064 3.0 8065 1.0 8066 1.0 8067 0.0 Name: Work_Experience, Length: 8068, dtype: float64
# Sauvegarde du fichier notebook dans une page html
#jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb