PYTHONによるデータ分析と可視化

Python は、データを管理するための堅牢なライブラリとツールがあるため、データ分析言語として広く使用されています。これらのライブラリの中には、データ探索の操作と分析を容易にする Pandas があります。私たちは使用します パンダ というデータセットを分析するには 国データ.csv カグルから。このデータを操作しながら、Pandas のいくつかの重要な概念も導入します。

1. インストール

pandas をインストールする最も簡単な方法は、pip を使用することです。

Python

    pip   install   pandas

またはからダウンロードしてくださいここ。

2. Pandas でのデータフレームの作成

あ データフレーム は、行と列にデータが格納された Pandas のテーブルのようなデータ構造です。 DataFrame は、複数の Python Series オブジェクトを DataFrame クラス（ pd.DataFrame() ) を使用して pd.Series 方法。この例では、2 つの Series オブジェクトが使用されています。 s1 最初の行として、そして s2 2行目として。

例 1: シリーズからの DataFrame の作成:

Python

    import   pandas   as   pd   # Creating two Series: s1 (numbers) and s2 (names)   s1   =   pd  .  Series  ([  1     2  ])   s2   =   pd  .  Series  ([  'Ashish'     'Sid'  ])   # Creating DataFrame by combining Series as rows   dataframe   =   pd  .  DataFrame  ([  s1     s2  ])   # Displaying the DataFrame   print  (  dataframe  )

出力：

例 2: カスタムインデックスと列名を含むリストからのデータフレーム:

Python

    dataframe1   =   pd  .  DataFrame  ([[  1     2  ]   [  'Ashish'     'Sid'  ]]   index  =  [  'r1'     'r2'  ]   columns  =  [  'c1'     'c2'  ])   print  (  dataframe1  )

出力：

例 3: 辞書からのデータフレーム:

Python

    dataframe2   =   pd  .  DataFrame  ({   'c1'  :   [  1     'Ashish'  ]   'c2'  :   [  2     'Sid'  ]   })   print  (  dataframe2  )

出力：

3. Pandas を使用したデータのインポート

最初のステップはデータを読み取ることです。この例では、データは CSV (カンマ区切り値) ファイルとして保存されており、各行は改行で区切られ、各列はカンマで区切られています。 Python でデータを操作できるようにするには、csv を読み取る必要があります。ファイル Pandas DataFrame に変換します。

Python

    import   pandas   as   pd   # Read Country-data.csv into a DataFrame   df   =   pd  .  read_csv  (  'Country-data.csv'  )   # Prints the first 5 rows of a DataFrame as default   df  .  head  ()   # Prints no. of rows and columns of a DataFrame   df  .  shape

出力：

(167 10)

4. Pandas を使用した DataFrame のインデックス作成

Pandas は強力なインデックス作成機能を提供します。両方を使用して DataFrame にインデックスを付けることができます ポジションベースの そして ラベルベースの メソッド。

位置ベースのインデックス付け (使用) iloc ):

Python # prints first 5 rows and every column which replicates df.head() df . iloc [ 0 : 5 :] # prints entire rows and columns df . iloc [::] # prints from 5th rows and first 5 columns df . iloc [ 5 :: 5 ]

出力：

ラベルベースのインデックス作成 (使用) loc ):

インデックス作成は、 pandas.DataFrame.loc 位置の代わりにラベルを使用してインデックスを作成できるメソッド。

例:

Python # prints first five rows including 5th index and every columns of df df . loc [ 0 : 5 :] # prints from 5th rows onwards and entire columns df . loc [ 5 ::]

出力：

上記は実際には df.iloc[0:5:] とあまり変わりません。これは、行ラベルは任意の値を取ることができますが、行ラベルは位置と正確に一致するためです。ただし、列ラベルを使用すると、データの操作がはるかに簡単になります。

例：

Python # Prints the first 5 rows of Time period # value df . loc [: 5 'child_mort' ]

出力：

5. Pandas を使用した DataFrame の計算

Pandas を使用すると、データフレームに格納されたデータに対して数学的演算を簡単に実行できます。パンダ上で実行できる操作はベクトル化されているため、高速であり、ループを使用せずにすべての要素に自動的に適用されます。

例 - 列ごとの計算:

Python # Adding 5 to every element in column A df [ 'child_mort' ] = df [ 'child_mort' ] + 5 # Multiplying values in column B by 10 df [ 'exports' ] = df [ 'exports' ] * 10 df

出力：

Pandas の統計関数:

データフレームの計算は、pandas ツールの統計関数を使用して実行できます。次のような関数を使用できます。

df.sum() → 値の合計
df.mean() →平均的
df.max() / df.min() → 最大値と最小値
df.describe() → 簡単な統計の概要

Python # computes various summary statistics excluding NaN values df . describe () # Provides sum of all the values for each column df . sum ()

出力：

6. Pandas と Matplotlib を使用したデータの視覚化

パンダはとても使いやすいです マットプロットリブ 基本的なプロットやチャートの作成に使用される強力なライブラリです。わずか数行のコードでデータを視覚化し、より深く理解できるようになります。以下は、Pandas と Matplotlib を使用してプロットを開始するのに役立ついくつかの簡単な例です。

Python # Import the library first import matplotlib.pyplot as plt

ヒストグラム

ヒストグラムは、列内の値の分布を示します。

Python df [ 'income' ] . hist ( bins = 10 ) plt . title ( 'Histogram of Income' ) plt . xlabel ( 'Income Value' ) plt . ylabel ( 'Frequency' ) plt . show ()

出力：

箱ひげ図

あ 箱ひげ図 外れ値を検出し、データの広がりを理解するのに役立ちます。

Python df = df . head ( 10 ) plt . figure ( figsize = ( 20 6 )) # Increase width to make x-axis labels clearer df . boxplot ( column = 'imports' by = 'country' ) plt . title ( 'Boxplot by Country' ) plt . suptitle ( '' ) # Removes default title plt . xlabel ( 'Country' ) plt . ylabel ( 'Imports' ) plt . xticks ( rotation = 45 ) # Optional: Rotate x-axis labels for better visibility plt . tight_layout () # Adjust layout to avoid clipping plt . show ()

出力：

散布図

あ 散布図 2 つの変数間の関係を示します。

Python x = df [ 'health' ] y = df [ 'life_expec' ] plt . scatter ( x y label = 'Data Points' color = 'm' marker = '*' s = 30 ) plt . xlabel ( 'Health' ) plt . ylabel ( 'Life Expectancy' ) plt . title ( 'Scatter Plot of Health vs Life Expectancy' ) plt . legend () plt . show ()

出力：