Data Structures and Object-Oriented Programming

Chapter 2: Pandas Library

1. Why Pandas?

Think of Pandas as the “Excel-plus” of Python: it lets you load data from many sources, rearrange it with a few commands, and save it back out—while keeping everything in regular Python code so it’s easy to automate or share.


2. The Two Core Building Blocks

ObjectWhat it representsQuick mental picture
SeriesOne-dimensional labelled arrayA single column from a spreadsheet
DataFrameTwo-dimensional labelled table (rows × columns)A whole spreadsheet tab

You’ll spend 90 % of your time with DataFrame, but it helps to know that each column inside it is itself a Series.


3. Reading Data In

You have …Use this functionWhat you get back
A CSV filepd.read_csv("file.csv")A DataFrame with typed columns
An Excel filepd.read_excel("file.xlsx")Same idea—column names come from the sheet
Many other formats (JSON, SQL, HTML tables, Parquet…)pd.read_*() familyThe pattern is consistent

Under the hood: Pandas calls highly-optimized C/NumPy code, so even huge files load quickly.


4. Writing Data Out

After you finish cleaning or analysing, one line puts the result where colleagues can open it:

df.to_excel("results.xlsx", index=False)   # or df.to_csv(...)

Tip: index=False prevents the row numbers from becoming an extra column.


5. Grabbing the Data You Need

SituationUse …Explanation
“Give me row 0, column 'Score' by its label.”df.loc[0, "Score"]loc = location by label (row index / column name)
“Give me the second row (position 1) by number.”df.iloc[1]iloc = integer location
“I want the raw NumPy matrix for machine-learning.”df.values (or the new df.to_numpy())Returns a 2-D ndarray

Mnemonic: loc - label, iloc - integer.


6. A Mini-Walk-through (ties it all together)

import pandas as pd

df = pd.read_csv("sample.csv")          # 1. Load
df.to_excel("sample.xlsx", index=False) # 2. Save elsewhere

alice_score = df.loc[0, "Score"]        # 3a. Label-based lookup
second_row  = df.iloc[1]                # 3b. Position-based lookup
as_array    = df.values                 # 4. NumPy view

What you would see in the notebook/console:

Read CSV:
      Name  Age  Score
0    Alice   25     85
1      Bob   30     90
2  Charlie   35     88

Alice's Score: 85
Second row:
 Name     Bob
 Age       30
 Score     90
Name: 1, dtype: object

7. Practical Tips & Gotchas

  1. Row labels matter. If your CSV has its own unique ID column (say “EmployeeID”), pass index_col="EmployeeID" to read_csv—then loc feels natural.
  2. Large files? Add dtype hints or chunksize=... to read pieces in streaming-style.
  3. Copy vs view. Operations like df["Age"] return a view, so in-place edits may warn you (SettingWithCopy). Use .loc to be explicit: df.loc[:, "Age"] += 1.
  4. NumPy interop. df.values shares memory—modify the array and the DataFrame changes too.

8. Where to Go Next

  • Filtering: Boolean masks (df[df["Score"] > 90]).
  • Grouping + summarising: df.groupby("Age").mean().
  • Merging: SQL-style joins with pd.merge().

Master these three techniques alongside today’s basics, and you’ll cover most everyday data tasks.


Key takeaway: With pd.read_csv → DataFrame → loc/ilocto_excel, you can already build a complete data pipeline—read, manipulate, and export. Everything else in Pandas extends or refines these fundamentals.

Back to Data Structures and Object-Oriented Programming