Hello everyone! Today, I’m going be writing about the Pandas library (link to the site). Pandas stands for “Python Data Analysis Library”. Based on the Wikipedia page about Pandas, “the name is taken from “panel data” as an econometrics term used to describe complex structured sets of data.” However, I believe that it’s a cute name for a very useful Python library!
Pandas is an absolute game changer in the analysis of data using Python and is among of the most favored and widely utilized tools in data wrangling and data munging, if not the most popular one. Pandas is an open-source software that is open source, and completely free (under the BSD licence) and was initially created by Wes McKinney.
The cool thing what’s cool about Pandas is it uses information (like an CSV (or TSV document, for instance or SQL database) and produces an Python object that has columns and rows, a data frame . It looks similar to tables in statistical program (think Excel or SPSS for instance). If you’re familiar with R might recognize a few similarities to R as well). It is so much easier to work with compared to using lists or the dictionaries using for loops, as well as list comprehension (please be sure to read one of my earlier blog posts on the basics of data analysis with Python. This would’ve been easy to accomplish the same thing with Pandas! ).
Installation and Starting
To “get” Pandas you would have to install it. It is also necessary to be running Python 3.5.3 and higher. as a prerequisite for installation (will be compatible in conjunction with Python 3.6, 3.7, or 3.8) It’s also dependent on various other programs (like NumPy) and has dependent dependencies that are optional (like Matplotlib to plot). Thus, I believe the easiest method to start getting Pandas installed would be to set it up via an application like the Anaconda distribution “a cross-platform distribution that is designed for scientific computing and data analysis.” It is possible to get Pandas for Windows, OS X and Linux versions. If you’d like to install it in an alternative way, here are the full instructions for installation.
To use Pandas within Your Python IDE (Integrated Development Environment) such as Jupyter Notebook or Spyder (both come with Anaconda as a default) You must install this Pandas library before you do so. Importing a library is the process of loading it into memory, and it’s then available for you to use. To transfer Pandas all you need to do is execute this code
import pandas as pd
Import numpy into np as the np
Usually , you’ll need to put in the second (‘as the pd’) so you can access Pandas via ‘pd.command instead of having to type ‘pandas.command” each time you want to use it. You should also include numpy too as it is a useful to use for scientific computing using Python. The moment is now Pandas is available for use! Keep in mind that you’ll be required to do this every time you create an entirely fresh Jupyter Notebook, Spyder file and so on.
Working with Pandas
Loading or Saving Data with Pandas
If you’re looking to utilize Pandas to analyze data You’ll typically use it in three distinct methods:
Convert a Python’s list dictionary , or Numpy array into a Python dataframe
Open a local file with Pandas typically it’s a CSV file, but it can be a Text files (like TSV), Excel or Excel
Open an external database or file, such as the CSV or JSONon the web using the URL, or read data from an SQL table or database
There are various commands for all of the options however, once you have opened a document the file will appear as follows:
pd.read_filetype()
As I’ve mentioned previously that there are different file types Pandas can utilize Therefore, you need to substitute “filetype” by the real type of file (like CSV). The name, path, and filename inside the parenthesis. In the parenthesis, you can add additional arguments that are related to opening the file. There are many arguments to consider and to learn the entire list it is necessary to go through the document (for instance the documentation for pd.read_csv() will contain all arguments you can provide to the Pandas operation).
To convert a particular Python object (dictionary lists, lists, etc.) the most basic instruction is:
pd.DataFrame()
In the parenthesis, indicate the object(s) you’re constructing the data frame using. The command comes with a variety of options (clickable button).
It is also possible to save the data frame that you’re working on to various types formats of file (like CSV, Excel, JSON and SQL tables). The basic code is:
df.to_filetype(filename)
Monitoring and Viewing Data
After you’ve loaded your data it’s time to check it out. What do you think the data frame looks? The title of the table will provide the entire table. However, you could also obtain only the first row by using df.head(n) and the final n rows using df.tail(n). df.shape will provide you with the total number of columns and rows. df.info() will provide information on the datatype, index, and memory information. The command s.value_counts(dropna=False) would allow you to view unique values and counts for a series (like a column or a few columns). The most effective command can be df.describe() which outputs data summary statistics for columns that are numerical. You can also collect statistics for the whole data frame or series (a column, etc.):
df.mean()Returns the average of columns
df.corr()Returns the correlation between columns of the data frame.
df.count()Returns the total number of values that are not null in each column of the data frame
df.max()Returns the maximum value in each column
df.min()Returns the value that is lowest of each column
df.median()Returns the median value of each column
df.std()Returns an average of the column’s standard deviation
Choice of Data
One thing that is made much simpler with Pandas is choosing the data you require when compared to selecting the value from a list or an online dictionary. It is possible to select one column (df[colcol) and return the column with an inscription like Series or select a handful of columns (df[[col1 col1Col1, col2) and return columns in a DataFrame. You can select columns by location (s.iloc[0]) as well as by the index (s.loc[‘index_one [‘index_one’for indexes (s.loc[‘index_one’]) . To select the first row , you could make use of df.iloc[0, for example, to select the first row of the first column, use df.iloc[0,0to select the first column . The same can be utilized in various combinations, so I hope this gives users an understanding of various choices and indexing options you could accomplish using Pandas.
Sort, Filter and Groupby
You can apply different conditions for filtering columns. For instance df[df[year] > 1984[would provide only the column whose that is larger than 1984. You can utilize and (and) or (or) to add various criteria to your filtering. This is also known as boolean filtering.
It is possible to sort values in a certain column in an ascending order using df.sort_values(col1) ; and also in a descending order using df.sort_values(col2,ascending=False). Furthermore, it’s possible to sort values by col1 in ascending order then col2 in descending order by using df.sort_values([col1,col2],ascending=[True,False]).
The final option that is in the section of groupby. It involves dividing your data in groups according to certain criteria, applying the function for each group individually and then combining the results to create the form of a data structure. df.groupby(col) returns a groupby object for values from one column while df.groupby([col1,col2]) returns a groupby object for values from multiple columns.
Data Cleaning
Cleaning data is an crucial step in the analysis of data. In particular, we examine for missing values in the data by using pd.isnull() that checks for null values, and produces an array of booleans (an array of false for missing values, and false for values that are not missing). In order to calculate the sum of missing or null values, use pd.isnull().sum(). pd.notnull() will be the reverse of pd.isnull(). When you’ve compiled the list of missing values , you can remove the missing values with df.dropna() in order to eliminate the rows, as well as df.dropna(axis=1) to eliminate the columns. Another option is to fill in the gaps with other values using df.fillna(x) that fills in the values that are missing by using the value x (you can add anything you like) and s.fillna(s.mean()) for replacing all values that are null by the mean (mean is replaceable by almost any function in Statistics section).
Sometimes it is necessary to replace values using different values. For instance, s.replace(1,’one’) would replace all values greater than 1 by one. It’s possible to do it for multiple values: s.replace([1,3],[‘one’,’three’]) would replace all 1 with ‘one’ and 3 with ‘three’. You can also rename specific columns by running: df.rename(columns=) or use df.set_index(‘column_one’) to change the index of the data frame.
Join/Combine
The final set of Pandas commands is for joining together data frames, or columns or rows. These three commands include:
df1.append(df2)-Add the rows from df1 until the end of the df2 (columns must be the same)
df.concat([df1, 2, axis=1) Add the columns from df1 up to the end of the second row of (rows should be the same)
df1.join(df2,on=col1,how=’inner’) — SQL-style join the columns in df1 with the columns on df2 where the rows for colhave identical values. what is one ofthe following the following: ‘left’, right’ outer’, or ‘inner’
These are the fundamental Pandas commands, but I hope you’ll be able to appreciate how powerful Pandas can be in analysis of data. This article is only the beginning of the iceberg there are many books that could be (and have been) written about data analysis using Pandas. I hope that this article inspired you to take the data and playing with it with Pandas! 🙂