Pandas

Most scientists are first introduced to data tables in the form of spreadsheets, such as an Excel Spreadsheet. In such a structure, each record or feature is represented by a row of data while each column represents a specific piece of information for each record. For example, each row in a table could contain all the data for a specific student, and each column could hold the same data for different students, such as academic year, age, height, etc.

Further, spreadsheets are able to hold different data types in each column. For example academic year is categorical or ordinal while height is ratio or continuous.

A comparable data structure would be handy for use in Python. This is made available by the Pandas library. Pandas allows for data to be stored in DataFrames. If you work in the R environment, this is very similar to the concept of data frames in R. In fact, Pandas DataFrames were inspired by R data frames.

Pandas makes use of the NumPy library, so it is generally a good idea to import NumPy if you plan to use Pandas. Also, you will need to install Pandas and NumPy into your environment prior to using them.

The complete documentation for Pandas can be found here.

After working through this module you will be able to:

create and manipulate Series and DataFrames using Pandas.
query and subset DataFrames.
manipulate DataFrames.
summarize and group DataFrames.

Intro to Series

import numpy as np
import pandas as pd

Before we talk about DataFrames, I will introduce the concept of a Series. These are actually very similar to a NumPy array except that they allow for axis labels to be assigned. Examples of generating a series from lists, NumPy arrays, and dictionaries are provided below.

A series is comparable to a single column from a Spreadsheet.

lst1 = ["GIS", "Remote Sensing", "Spatial Analysis", "Digital Cartography"]
arr1 = np.array([350, 455, 457, 642])
dict1 = {'Class1':"GIS", "Class2":"Remote Sensing", "Class3":"Spatial Analysis", "Class4":"Digital Cartography"}

s_lst = pd.Series(data=lst1, index = ["Class1", "Class2", "Class3", "Class4"])
s_arr = pd.Series(data=arr1, index = ["Class1", "Class2", "Class3", "Class4"])
s_dict = pd.Series(dict1)

print(s_lst)
print(s_arr)
print(s_dict)

Class1                    GIS
Class2         Remote Sensing
Class3       Spatial Analysis
Class4    Digital Cartography
dtype: object
Class1    350
Class2    455
Class3    457
Class4    642
dtype: int32
Class1                    GIS
Class2         Remote Sensing
Class3       Spatial Analysis
Class4    Digital Cartography
dtype: object

Labels or names can then be used to select data either using bracket notation or dot notation.

You can use whichever method you prefer. However, if you use dot notation you should not included spaces in the column names.

print(s_dict["Class3"])
print(s_dict.Class3)

Spatial Analysis
Spatial Analysis

Intro to DataFrames

Let's start by building a DataFrame from a set of lists. First, I create three lists to hold different components of course title information. Next, I combine these lists into a dictionary. Finally, I convert the dictionary into a DataFrame. Note that a well formatted table is generated by just calling the DataFrame name without the print() function; however, I am using print() here since I have found that this works better when rendering a web page. Also, the keys from the dictionary have been used as the column names, and a default index has been assigned to each row.

prefix = ["Geol", "Geol", "Geol", "Geol", "Geog", "Geog", "Geog"]
cnum = [103, 321, 331, 341, 350, 455, 462]
cname = ["Earth Through Time", "Geomorphology", "Paleontology", "Structural Geology", "GIScience", "Remote Sensing", "Digital Cartography"]
course_dict = {"prefix": prefix, "course_number": cnum, "course_name": cname}
course_df = pd.DataFrame(course_dict)
print(course_df)

  prefix  course_number          course_name
0   Geol            103   Earth Through Time
1   Geol            321        Geomorphology
2   Geol            331         Paleontology
3   Geol            341   Structural Geology
4   Geog            350            GIScience
5   Geog            455       Remote Sensing
6   Geog            462  Digital Cartography

Since column names are assigned, they can be used to select out individual columns using bracket or dot notation. Single columns can be saved as a Series.

print(course_df["course_name"])
# Or course_df.course_name

0     Earth Through Time
1          Geomorphology
2           Paleontology
3     Structural Geology
4              GIScience
5         Remote Sensing
6    Digital Cartography
Name: course_name, dtype: object

A list of column names can be provided to subset out multiple columns.

print(course_df[["course_number", "course_name"]])

   course_number          course_name
0            103   Earth Through Time
1            321        Geomorphology
2            331         Paleontology
3            341   Structural Geology
4            350            GIScience
5            455       Remote Sensing
6            462  Digital Cartography

You can also subset based on column names and row labels combined using the .loc method. The iloc method, in contrast, is used for selection based on indexes.

print(course_df.loc[[1, 2, 4],["course_number", "course_name"]])
# Or course_df.iloc[[1, 2, 4],[1, 2]]

   course_number    course_name
1            321  Geomorphology
2            331   Paleontology
4            350      GIScience

You can even use the data stored in existing columns to create a new column. Note that the new column does not need to be declared prior to writing to it. In the example, I have written the entire course name to a new column. The map() method is used to make sure all data are treated as strings. It allow for the same function, in this case str(), to be applied to each element in an iterable, in this case each row in the DataFrame. I am including blank spaces so that the components are not ran together.

course_df["full_name"] = course_df["prefix"].map(str)  + " " + course_df["course_number"].map(str)  + ": " + course_df["course_name"].map(str) 
print(course_df)

  prefix  course_number          course_name                      full_name
0   Geol            103   Earth Through Time   Geol 103: Earth Through Time
1   Geol            321        Geomorphology        Geol 321: Geomorphology
2   Geol            331         Paleontology         Geol 331: Paleontology
3   Geol            341   Structural Geology   Geol 341: Structural Geology
4   Geog            350            GIScience            Geog 350: GIScience
5   Geog            455       Remote Sensing       Geog 455: Remote Sensing
6   Geog            462  Digital Cartography  Geog 462: Digital Cartography

Instead of creating data tables or DataFrames manually, you are probably more likely to read in a data table from a file or web link. Fortunately, Pandas provides functions for reading data in from a variety of formats. Here are some examples:

read_table(): delimited file (TXT, CSV, etc.)
read_csv(): comma-separated values (CSV)
read_excel(): Excel Spreadsheet
read_json(): JavaScript Object Notation (JSON)
read_html(): HTML table
read_sas(): SAS file

Full documentation on reading in data can be found here.

In the example below, I am reading in a CSV file from my local computer. The sep argument is used to define the deliminator. However, commas are the default, so it isn't necessary to include this argument in this case. Setting the header argument to 0 indicated that the first row of data should be treated as column names or headers. It isn't always necessary to specify the character encoding; however, I found that it is necessary for this data set due to the use of special characters.

To view the first 10 rows of the data, I use the head() method. The len() function returns the number of rows.

This data set was created by my brother and provides his ratings for movies along with the movie's name, director, release year, genre, and whether or not he owns it.

movies_df = pd.read_csv("D:/data/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
print(movies_df.head(10))

                 Movie Name             Director  Release Year  My Rating  \
0             Almost Famous        Cameron Crowe          2000       9.99   
1  The Shawshank Redemption       Frank Darabont          1994       9.98   
2             Groundhog Day         Harold Ramis          1993       9.96   
3              Donnie Darko        Richard Kelly          2001       9.95   
4           Children of Men       Alfonso Cuaron          2006       9.94   
5                Annie Hall          Woody Allen          1977       9.93   
6                  Rushmore         Wes Anderson          1998       9.92   
7                   Memento    Christopher Nolan          2000       9.91   
8    No Country for Old Men  Joel and Ethan Coen          2007       9.90   
9                     Seven        David Fincher          1995       9.88

         Genre  Own  
0        Drama  Yes  
1        Drama  Yes  
2       Comedy  Yes  
3       Sci-Fi  Yes  
4       Sci-Fi  Yes  
5       Comedy  Yes  
6  Independent  Yes  
7     Thriller  Yes  
8     Thriller  Yes  
9     Thriller  Yes

print(len(movies_df))

DataFrame Query and Subset

Let's now use this data table to explore data query and selection methods. In the first example, I am selecting out all movies that are dramas and saving them to a new DataFrame. Note the use of bracket notation. The code in the middle bracket is used to perform the selection.

The second example includes a compound query. Note the use of parenthesis within the query.

Lastly, it is also possible to subset out only certain columns that meet the query. In the last example, I am subsetting out just the movie name and director.

#Example 1
just_dramas = movies_df[movies_df["Genre"]=="Drama"]
print(just_dramas.head(10))
print(len(just_dramas))
#Example 2
dramas_gt_9 = movies_df[(movies_df["Genre"]=="Drama") & (movies_df["My Rating"]>9.0)]
print(len(dramas_gt_9))
#Example 3
dramas_gt_9_sub = movies_df[(movies_df["Genre"]=="Drama") & (movies_df["My Rating"]>9.0)][["Movie Name", "Director"]]
print(dramas_gt_9_sub.head(10))

                  Movie Name           Director  Release Year  My Rating  \
0              Almost Famous      Cameron Crowe          2000       9.99   
1   The Shawshank Redemption     Frank Darabont          1994       9.98   
21                   Boyhood  Richard Linklater          2014       9.75   
26           American Beauty         Sam Mendes          1999       9.70   
39              Mystic River     Clint Eastwood          2003       9.56   
51    Hannah and Her Sisters        Woody Allen          1986       9.47   
52            In the Bedroom         Todd Field          2001       9.47   
57                  Big Fish         Tim Burton          2003       9.44   
60       The Sweet Hereafter        Atom Egoyan          1997       9.42   
62               Stand By Me         Rob Reiner          1986       9.40

    Genre  Own  
0   Drama  Yes  
1   Drama  Yes  
21  Drama   No  
26  Drama  Yes  
39  Drama  Yes  
51  Drama  Yes  
52  Drama  Yes  
57  Drama  Yes  
60  Drama  Yes  
62  Drama  Yes  
321
23
                  Movie Name           Director
0              Almost Famous      Cameron Crowe
1   The Shawshank Redemption     Frank Darabont
21                   Boyhood  Richard Linklater
26           American Beauty         Sam Mendes
39              Mystic River     Clint Eastwood
51    Hannah and Her Sisters        Woody Allen
52            In the Bedroom         Todd Field
57                  Big Fish         Tim Burton
60       The Sweet Hereafter        Atom Egoyan
62               Stand By Me         Rob Reiner

Another option for performing queries is to use the query() method provided by Pandas. When using this method, the query will need to be provided as an expression in string form. Also, spaces in column names can be problematic, so spaces should be removed or replaced with underscores.

#Remove spaces in column names using list comprehension
movies_df.columns = [column.replace(" ", "_") for column in movies_df.columns]
#Example 1
just_dramas = movies_df.query('Genre=="Drama"')
print(just_dramas.head(10))
print(len(just_dramas))
#Example 2
dramas_gt_9 = movies_df.query('Genre=="Drama" and My_Rating > 9.0')
print(len(dramas_gt_9))
#Example 3
dramas_gt_9_sub = movies_df.query('Genre=="Drama" and My_Rating > 9.0')[["Movie_Name", "Director"]]
print(dramas_gt_9_sub.head(10))

                  Movie_Name           Director  Release_Year  My_Rating  \
0              Almost Famous      Cameron Crowe          2000       9.99   
1   The Shawshank Redemption     Frank Darabont          1994       9.98   
21                   Boyhood  Richard Linklater          2014       9.75   
26           American Beauty         Sam Mendes          1999       9.70   
39              Mystic River     Clint Eastwood          2003       9.56   
51    Hannah and Her Sisters        Woody Allen          1986       9.47   
52            In the Bedroom         Todd Field          2001       9.47   
57                  Big Fish         Tim Burton          2003       9.44   
60       The Sweet Hereafter        Atom Egoyan          1997       9.42   
62               Stand By Me         Rob Reiner          1986       9.40

    Genre  Own  
0   Drama  Yes  
1   Drama  Yes  
21  Drama   No  
26  Drama  Yes  
39  Drama  Yes  
51  Drama  Yes  
52  Drama  Yes  
57  Drama  Yes  
60  Drama  Yes  
62  Drama  Yes  
321
23
                  Movie_Name           Director
0              Almost Famous      Cameron Crowe
1   The Shawshank Redemption     Frank Darabont
21                   Boyhood  Richard Linklater
26           American Beauty         Sam Mendes
39              Mystic River     Clint Eastwood
51    Hannah and Her Sisters        Woody Allen
52            In the Bedroom         Todd Field
57                  Big Fish         Tim Burton
60       The Sweet Hereafter        Atom Egoyan
62               Stand By Me         Rob Reiner

Once a query is complete, you may want to save the result back to a file on your local machine. The code below provides an example for saving out the last subset of data to a CSV file. The Pandas documentation provides examples for saving to other formats.

dramas_gt_9_sub.to_csv("D:/data/matts_movies_sub.csv", sep=",", header=True)

Missing Values

The NULL, NoData, or missing indicator in Python is NaN. To begin exploring missing values, let's recode some of the data to NaN in the movies data set. In the example below, I am changing the "Drama" and "Fantasy" genres to NaN. I am also recoding any ratings between 8.9 and 9.9 to NaN. The replace() method is used to change the categories while the mask() method is used to recode the rating values. np.nan is a NumPy method for defining null values.

movies_nan = movies_df.copy()
movies_nan["Genre"] = movies_nan[["Genre"]].replace(["Drama", "Fantasy"], np.nan)
movies_nan['My_Rating'].mask(movies_nan['My_Rating'].between(8.9, 9.9), inplace=True)
print(movies_nan.head(10))
print(len(movies_nan))

                 Movie_Name             Director  Release_Year  My_Rating  \
0             Almost Famous        Cameron Crowe          2000       9.99   
1  The Shawshank Redemption       Frank Darabont          1994       9.98   
2             Groundhog Day         Harold Ramis          1993       9.96   
3              Donnie Darko        Richard Kelly          2001       9.95   
4           Children of Men       Alfonso Cuaron          2006       9.94   
5                Annie Hall          Woody Allen          1977       9.93   
6                  Rushmore         Wes Anderson          1998       9.92   
7                   Memento    Christopher Nolan          2000       9.91   
8    No Country for Old Men  Joel and Ethan Coen          2007        NaN   
9                     Seven        David Fincher          1995        NaN

         Genre  Own  
0          NaN  Yes  
1          NaN  Yes  
2       Comedy  Yes  
3       Sci-Fi  Yes  
4       Sci-Fi  Yes  
5       Comedy  Yes  
6  Independent  Yes  
7     Thriller  Yes  
8     Thriller  Yes  
9     Thriller  Yes  
1852

The dropna() method can be used to remove rows or columns that contain missing data. If the axis parameter is set to 0, rows with missing values in any column will be removed. If it is set to 1, columns with missing data in any row will be removed.

movies_drop = movies_nan.dropna(axis=0)
print(movies_drop.head(10))
print(len(movies_drop))

movies_dropc = movies_nan.dropna(axis=1)
print(movies_dropc.head(10))
print(len(movies_dropc))

           Movie_Name           Director  Release_Year  My_Rating  \
2       Groundhog Day       Harold Ramis          1993       9.96   
3        Donnie Darko      Richard Kelly          2001       9.95   
4     Children of Men     Alfonso Cuaron          2006       9.94   
5          Annie Hall        Woody Allen          1977       9.93   
6            Rushmore       Wes Anderson          1998       9.92   
7             Memento  Christopher Nolan          2000       9.91   
127       After Hours    Martin Scorsese          1985       8.89   
128   Shotgun Stories       Jeff Nichols          2007       8.89   
129  The Untouchables     Brian de Palma          1987       8.89   
130   2 Days in Paris        Julie Delpy          2007       8.88

           Genre  Own  
2         Comedy  Yes  
3         Sci-Fi  Yes  
4         Sci-Fi  Yes  
5         Comedy  Yes  
6    Independent  Yes  
7       Thriller  Yes  
127       Comedy  Yes  
128  Independent  Yes  
129       Action  Yes  
130      Romance  Yes  
1420
                 Movie_Name             Director  Release_Year  Own
0             Almost Famous        Cameron Crowe          2000  Yes
1  The Shawshank Redemption       Frank Darabont          1994  Yes
2             Groundhog Day         Harold Ramis          1993  Yes
3              Donnie Darko        Richard Kelly          2001  Yes
4           Children of Men       Alfonso Cuaron          2006  Yes
5                Annie Hall          Woody Allen          1977  Yes
6                  Rushmore         Wes Anderson          1998  Yes
7                   Memento    Christopher Nolan          2000  Yes
8    No Country for Old Men  Joel and Ethan Coen          2007  Yes
9                     Seven        David Fincher          1995  Yes
1852

The .fillna() method can be used to replace NA values with another value or string. In the example, I am changing the missing genres to "Unknown Genre".

movies_nan["Genre"] = movies_nan["Genre"].fillna(value="Unknown Genre")
print(movies_nan.head(10))

                 Movie_Name             Director  Release_Year  My_Rating  \
0             Almost Famous        Cameron Crowe          2000       9.99   
1  The Shawshank Redemption       Frank Darabont          1994       9.98   
2             Groundhog Day         Harold Ramis          1993       9.96   
3              Donnie Darko        Richard Kelly          2001       9.95   
4           Children of Men       Alfonso Cuaron          2006       9.94   
5                Annie Hall          Woody Allen          1977       9.93   
6                  Rushmore         Wes Anderson          1998       9.92   
7                   Memento    Christopher Nolan          2000       9.91   
8    No Country for Old Men  Joel and Ethan Coen          2007        NaN   
9                     Seven        David Fincher          1995        NaN

           Genre  Own  
0  Unknown Genre  Yes  
1  Unknown Genre  Yes  
2         Comedy  Yes  
3         Sci-Fi  Yes  
4         Sci-Fi  Yes  
5         Comedy  Yes  
6    Independent  Yes  
7       Thriller  Yes  
8       Thriller  Yes  
9       Thriller  Yes

It is also possible to replace null values with a statistic derived from the available values. In the example, I am replacing the missing ratings with the mean of all available ratings. This would likely not be valid in this case, since the missing genres might have different average ratings than the available genres. This is just an example.

movies_nan["My_Rating"] = movies_nan["My_Rating"].fillna(value=movies_nan["My_Rating"].mean())
print(movies_nan.head(10))

                 Movie_Name             Director  Release_Year  My_Rating  \
0             Almost Famous        Cameron Crowe          2000   9.990000   
1  The Shawshank Redemption       Frank Darabont          1994   9.980000   
2             Groundhog Day         Harold Ramis          1993   9.960000   
3              Donnie Darko        Richard Kelly          2001   9.950000   
4           Children of Men       Alfonso Cuaron          2006   9.940000   
5                Annie Hall          Woody Allen          1977   9.930000   
6                  Rushmore         Wes Anderson          1998   9.920000   
7                   Memento    Christopher Nolan          2000   9.910000   
8    No Country for Old Men  Joel and Ethan Coen          2007   6.786509   
9                     Seven        David Fincher          1995   6.786509

           Genre  Own  
0  Unknown Genre  Yes  
1  Unknown Genre  Yes  
2         Comedy  Yes  
3         Sci-Fi  Yes  
4         Sci-Fi  Yes  
5         Comedy  Yes  
6    Independent  Yes  
7       Thriller  Yes  
8       Thriller  Yes  
9       Thriller  Yes

Grouping and Summarizing DataFrames

Pandas provides methods for summarizing data as described in the examples below. First, I am creating individual statistics and saving them to variables. I then create a Series from a dictionary of these statistics, convert it to a DataFrame using the to_frame() method, then transpose the DataFrame using transpose().

movies_df = pd.read_csv("D:/data/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns = [column.replace(" ", "_") for column in movies_df.columns]
mov_cnt = movies_df["My_Rating"].count()
mov_mn = movies_df["My_Rating"].mean()
mov_max = movies_df["My_Rating"].max()
mov_min = movies_df["My_Rating"].min()
mov_rang = mov_max-mov_min
mov_stats= pd.Series({"Count": mov_cnt, "Mean": mov_mn, "Max": mov_max, "Min": mov_min, "Range": mov_rang}).to_frame().transpose()
print(mov_stats)

    Count      Mean   Max   Min  Range
0  1852.0  6.952527  9.99  0.67   9.32

It is also possible to obtain summary statistics for each group separately by applying the very useful group_by() method. In the example below, I am obtaining stats for each genre and saving them into a DataFrame. The columns do not need to be defined beforehand.

movies_df = pd.read_csv("D:/data/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]
genre_stats = pd.DataFrame()
genre_stats["Count"] = movies_df.groupby("Genre")["My_Rating"].count()
genre_stats["Mean"] = movies_df.groupby("Genre")["My_Rating"].mean()
genre_stats["Max"] = movies_df.groupby("Genre")["My_Rating"].max()
genre_stats["Min"] = movies_df.groupby("Genre")["My_Rating"].min()
genre_stats["Range"] = genre_stats["Max"] - genre_stats["Min"]
print(genre_stats)

             Count      Mean   Max   Min  Range
Genre                                          
Action         197  6.651878  9.87  0.67   9.20
Classic         47  6.741064  8.33  4.03   4.30
Comedy         275  6.738545  9.96  1.76   8.20
Documentary     78  7.283718  8.87  5.43   3.44
Drama          321  7.140561  9.99  0.88   9.11
Family          84  6.670714  9.86  1.01   8.85
Fantasy         21  7.473810  9.85  6.21   3.64
Foreign        157  7.453439  9.63  3.76   5.87
Horror          96  6.523021  9.60  1.65   7.95
Independent    190  6.940263  9.92  1.81   8.11
Romance         94  7.238723  9.84  3.45   6.39
Sci-Fi          62  6.872419  9.95  1.92   8.03
Sports          23  6.391304  8.78  2.94   5.84
Thriller       164  7.095976  9.91  3.11   6.80
War             27  6.542593  8.67  4.37   4.30
Western         14  6.960000  8.65  4.53   4.12
comedy           1  6.030000  6.03  6.03   0.00
thriller         1  5.750000  5.75  5.75   0.00

You may have noticed above that there are some issues with the Genre names. For example, "Comedy" and "Thriller" show up twice in the summary because these genres were not always capitalized. This can be fixed.

In the example below, I am using the unique() method to obtain a list of all genre names. I then define a list of new names where capitalization is consistent. Next, I use the replace() function to change the original name set to my new name set and fix the capitalization issues.

If I re-execute the code from above, you can see that the duplicates have been removed in the summary.

orig_names = list(movies_df["Genre"].unique())
print(orig_names)
new_names = ['Drama', 'Comedy', 'Sci-Fi', 'Independent', 'Thriller', 'Action', 'Family', 
             'Fantasy', 'Romance', 'Foreign', 'Horror', 'Documentary', 'Sports', 'War', 
             'Western', 'Classic', 'Comedy', 'Thriller']
movies_df.Genre.replace(orig_names, new_names, inplace=True)
list(movies_df["Genre"].unique())

genre_stats = pd.DataFrame()
genre_stats["Count"] = movies_df.groupby("Genre")["My_Rating"].count()
genre_stats["Mean"] = movies_df.groupby("Genre")["My_Rating"].mean()
genre_stats["Max"] = movies_df.groupby("Genre")["My_Rating"].max()
genre_stats["Min"] = movies_df.groupby("Genre")["My_Rating"].min()
genre_stats["Range"] = genre_stats["Max"] - genre_stats["Min"]
print(genre_stats)

['Drama', 'Comedy', 'Sci-Fi', 'Independent', 'Thriller', 'Action', 'Family', 'Fantasy', 'Romance', 'Foreign', 'Horror', 'Documentary', 'Sports', 'War', 'Western', 'Classic', 'comedy', 'thriller']
             Count      Mean   Max   Min  Range
Genre                                          
Action         197  6.651878  9.87  0.67   9.20
Classic         47  6.741064  8.33  4.03   4.30
Comedy         276  6.735978  9.96  1.76   8.20
Documentary     78  7.283718  8.87  5.43   3.44
Drama          321  7.140561  9.99  0.88   9.11
Family          84  6.670714  9.86  1.01   8.85
Fantasy         21  7.473810  9.85  6.21   3.64
Foreign        157  7.453439  9.63  3.76   5.87
Horror          96  6.523021  9.60  1.65   7.95
Independent    190  6.940263  9.92  1.81   8.11
Romance         94  7.238723  9.84  3.45   6.39
Sci-Fi          62  6.872419  9.95  1.92   8.03
Sports          23  6.391304  8.78  2.94   5.84
Thriller       165  7.087818  9.91  3.11   6.80
War             27  6.542593  8.67  4.37   4.30
Western         14  6.960000  8.65  4.53   4.12

The describe() method can be used to obtain a set of default summary statistics for a column of data. Combining this with group_by() allows for the calculation of statistics by group.

print(movies_df.groupby("Genre")["My_Rating"].describe())

             count      mean       std   min     25%    50%     75%   max
Genre                                                                    
Action       197.0  6.651878  1.687098  0.67  5.8400  6.950  7.7300  9.87
Classic       47.0  6.741064  0.903364  4.03  6.3000  6.790  7.2550  8.33
Comedy       276.0  6.735978  1.494788  1.76  5.9025  6.800  7.7025  9.96
Documentary   78.0  7.283718  0.778165  5.43  6.8725  7.290  7.8800  8.87
Drama        321.0  7.140561  1.292603  0.88  6.5800  7.150  7.8900  9.99
Family        84.0  6.670714  1.654381  1.01  5.9550  6.930  7.4875  9.86
Fantasy       21.0  7.473810  1.113838  6.21  6.6500  7.030  7.9400  9.85
Foreign      157.0  7.453439  0.778073  3.76  6.9500  7.410  7.9200  9.63
Horror        96.0  6.523021  1.491829  1.65  5.9000  6.770  7.2425  9.60
Independent  190.0  6.940263  1.208283  1.81  6.3300  6.880  7.6525  9.92
Romance       94.0  7.238723  1.471279  3.45  6.4925  7.085  8.2150  9.84
Sci-Fi        62.0  6.872419  1.901382  1.92  5.9300  7.040  8.2000  9.95
Sports        23.0  6.391304  1.591958  2.94  5.5800  6.770  7.3000  8.78
Thriller     165.0  7.087818  1.427839  3.11  6.3200  7.120  8.0300  9.91
War           27.0  6.542593  1.096518  4.37  5.8400  6.560  7.2100  8.67
Western       14.0  6.960000  1.411033  4.53  6.0400  7.070  8.2275  8.65

Concatenate and Merge

The Pandas concat() method is used to concatenate DataFrames that have the same columns. This is comparable to copying and pasting rows from two spreadsheets into a new spreadsheet. To demonstrate this, I have extracted rows using indexes. Next, I concatenate them back to a new DataFrame.

movies_df = pd.read_csv("D:/data/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]

movies_sub1 = movies_df[100:500]
movies_sub2 = movies_df[900:1300]
movies_subc = pd.concat([movies_sub1, movies_sub2])
print(len(movies_subc))

Merge is comparable to table joins when using SQL. This requires the use of keys and the declaration of a joining method, such as "Left", "Right", "Inner", or "Outer".

In the example, I first create a unique ID by copying the row index to a column. I then break the data into two components, each containing the ID and a subset of the remaining columns. I then use the merge() method to merge the DataFrames using the "inner" method and the common "id" field. "Inner" will only return rows that occur in both data sets. Since both DataFrames were derived from the same original DataFrame, they will have identical rows, so the result would be the same as using "left", where all rows from the left table are maintained even if they don't occur in the right table, or "right", where all rows from the right table are maintained even if they don't occur in the left table.

In the second example, I use a query to separate out only movies released in 2008. When I perform a join with all of the data using the "inner" method, I only get back the common or shared rows.

Note that there is also a join() method that joins based on indexes. However, that will not be demonstrated here.

movies_df = pd.read_csv("D:/data/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
movies_df.columns =[column.replace(" ", "_") for column in movies_df.columns]

movies_df["id"] = movies_df.index
print(movies_df.head())

movies_first = movies_df[["id", "Movie_Name"]]
movies_second = movies_df[["id", "Director", "Release_Year"]]

movies_merge = pd.merge(movies_first, movies_second, how="inner", on="id")
print(movies_merge.head())
print(len(movies_merge))


movies_third = movies_df.query('Release_Year == 2008')[["id", "Director", "Release_Year"]]
movies_merge2 = pd.merge(movies_first, movies_third, how="inner", on="id")
print(movies_merge2.head())
print(len(movies_merge2))

                 Movie_Name        Director  Release_Year  My_Rating   Genre  \
0             Almost Famous   Cameron Crowe          2000       9.99   Drama   
1  The Shawshank Redemption  Frank Darabont          1994       9.98   Drama   
2             Groundhog Day    Harold Ramis          1993       9.96  Comedy   
3              Donnie Darko   Richard Kelly          2001       9.95  Sci-Fi   
4           Children of Men  Alfonso Cuaron          2006       9.94  Sci-Fi

   Own  id  
0  Yes   0  
1  Yes   1  
2  Yes   2  
3  Yes   3  
4  Yes   4  
   id                Movie_Name        Director  Release_Year
0   0             Almost Famous   Cameron Crowe          2000
1   1  The Shawshank Redemption  Frank Darabont          1994
2   2             Groundhog Day    Harold Ramis          1993
3   3              Donnie Darko   Richard Kelly          2001
4   4           Children of Men  Alfonso Cuaron          2006
1852
    id            Movie_Name           Director  Release_Year
0   11                Wall-E     Andrew Stanton          2008
1   38       The Dark Knight  Christopher Nolan          2008
2  100             In Bruges    Martin McDonagh          2008
3  102           Gran Torino     Clint Eastwood          2008
4  104  Let the Right One In    Tomas Alfredson          2008
93

Concluding Remarks

There are many more topics that could be explored relating to Pandas. However, I have found the methods and techniques discussed here to be the ones I use most often and are most useful.

For more examples and details, please consult the documentation for Pandas.

If you need to perform a specific task that I did not cover, Google it! Python has a large user base, so it is likely that you will find examples similar to your needs with a simple search.

In the next module, we will discuss methods for graphing and visualizing data using matplotlib, Seaborn, and Pandas.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search