Chapter 2: Pandas Library
1. Why Pandas?
Think of Pandas as the “Excel-plus” of Python: it lets you load data from many sources, rearrange it with a few commands, and save it back out—while keeping everything in regular Python code so it’s easy to automate or share.
2. The Two Core Building Blocks
| Object | What it represents | Quick mental picture |
|---|---|---|
Series | One-dimensional labelled array | A single column from a spreadsheet |
DataFrame | Two-dimensional labelled table (rows × columns) | A whole spreadsheet tab |
You’ll spend 90 % of your time with DataFrame, but it helps to know that each column inside it is itself a Series.
3. Reading Data In
| You have … | Use this function | What you get back |
|---|---|---|
| A CSV file | pd.read_csv("file.csv") | A DataFrame with typed columns |
| An Excel file | pd.read_excel("file.xlsx") | Same idea—column names come from the sheet |
| Many other formats (JSON, SQL, HTML tables, Parquet…) | pd.read_*() family | The pattern is consistent |
Under the hood: Pandas calls highly-optimized C/NumPy code, so even huge files load quickly.
4. Writing Data Out
After you finish cleaning or analysing, one line puts the result where colleagues can open it:
df.to_excel("results.xlsx", index=False) # or df.to_csv(...)Tip: index=False prevents the row numbers from becoming an extra column.
5. Grabbing the Data You Need
| Situation | Use … | Explanation |
|---|---|---|
| “Give me row 0, column 'Score' by its label.” | df.loc[0, "Score"] | loc = location by label (row index / column name) |
| “Give me the second row (position 1) by number.” | df.iloc[1] | iloc = integer location |
| “I want the raw NumPy matrix for machine-learning.” | df.values (or the new df.to_numpy()) | Returns a 2-D ndarray |
Mnemonic: loc - label, iloc - integer.
6. A Mini-Walk-through (ties it all together)
import pandas as pd
df = pd.read_csv("sample.csv") # 1. Load
df.to_excel("sample.xlsx", index=False) # 2. Save elsewhere
alice_score = df.loc[0, "Score"] # 3a. Label-based lookup
second_row = df.iloc[1] # 3b. Position-based lookup
as_array = df.values # 4. NumPy viewWhat you would see in the notebook/console:
Read CSV:
Name Age Score
0 Alice 25 85
1 Bob 30 90
2 Charlie 35 88
Alice's Score: 85
Second row:
Name Bob
Age 30
Score 90
Name: 1, dtype: object7. Practical Tips & Gotchas
- Row labels matter. If your CSV has its own unique ID column (say “EmployeeID”), pass
index_col="EmployeeID"toread_csv—thenlocfeels natural. - Large files? Add
dtypehints orchunksize=...to read pieces in streaming-style. - Copy vs view. Operations like
df["Age"]return a view, so in-place edits may warn you (SettingWithCopy). Use.locto be explicit:df.loc[:, "Age"] += 1. - NumPy interop.
df.valuesshares memory—modify the array and the DataFrame changes too.
8. Where to Go Next
- Filtering: Boolean masks (
df[df["Score"] > 90]). - Grouping + summarising:
df.groupby("Age").mean(). - Merging: SQL-style joins with
pd.merge().
Master these three techniques alongside today’s basics, and you’ll cover most everyday data tasks.
Key takeaway:
With pd.read_csv → DataFrame → loc/iloc → to_excel, you can already build a complete data pipeline—read, manipulate, and export. Everything else in Pandas extends or refines these fundamentals.