Chapter 3: Matplotlib (Plotting and Line of Best Fit)

1. Why visualize?

Pictures turn rows of numbers into patterns your eyes can spot instantly. Scatter-plots show relationships, line-plots show trends, and a “line of best fit” summarizes a messy cloud of points with one clean equation.

2. Building blocks

Task	“Plain-English” goal	Function(s)
Line plot	connect points in order	`plt.plot()`
Scatter plot	show individual dots	`plt.scatter()`
Fit a straight line	find the best-fit slope & intercept	`np.polyfit(x, y, 1)`
Turn those numbers into a function	evaluate the line at any x	`np.poly1d()`

Tip: the last argument 1 in np.polyfit means “degree 1 polynomial” → a straight line.

3. A step-by-step mini-example

import matplotlib.pyplot as plt
import numpy as np

x = np.array([1,2,3,4,5])
y = np.array([2.1,3.8,6.2,7.9,9.8])

plt.scatter(x, y, color="red")           # 1. dots
m, b = np.polyfit(x, y, 1)               # 2. slope (m) & intercept (b)
plt.plot(x, m*x + b, label=f"y={m:.2f}x+{b:.2f}")
plt.legend(); plt.xlabel("x"); plt.ylabel("y")
plt.grid(True); plt.show()

What just happened?

Scatter — raw data in red.
np.polyfit — finds the least-squares line (minimum total vertical distance).
Plot the equation so you can judge the fit by eye.

4. Bringing Pandas to the party

The PDF example works with a Student_Grades.csv file. Here are the everyday Pandas moves you see there:

Goal	One-liner
Read the file	`df = pd.read_csv("Student_Grades.csv")`
Replace missing values	`df.replace(np.nan, np.nanmean(df["Scores"]))` (or any fill value)
Quick scatter straight from Pandas	`df.plot(kind="scatter", x="Hours", y="Scores")`
Basic stats	`df["FinalExam"].describe()` (or `.min()`, `.max()`, `.mean()`)
Filter rows containing “A”	`df[df["Grade"].str.contains("A")]`
Set an index for easy slicing	`df = df.set_index("Hours").sort_index()`
Slice by label	`df.loc[1.1:5.5]`
Slice by position	`df.iloc[0:3]`
Single value fast	`df.at[1.1, "Scores"]`, `df.iat[2, 4]`
Rename	`df.rename(columns={"Hours": "Hours Studied"})`

These tools let you clean, summarize, and grab exactly the subset you need before plotting.

5. Regression fit in practice

With cleaned data:

x = df["Hours"].to_numpy()
y = df["Scores"].to_numpy()

m, b = np.polyfit(x, y, 1)      # slope & intercept
x_line = np.linspace(0, 10, 100)
plt.scatter(x, y, s=8)
plt.plot(x_line, m*x_line + b, c="red", label="Best-fit")
plt.xlim(0,10); plt.ylim(0,100)
plt.xlabel("Hours studied"); plt.ylabel("Score")
plt.legend(); plt.show()
print(f"Slope: {m:.2f},  Intercept: {b:.2f}")

Interpretation

Slope (m) ≈ 9.8 means every extra study hour adds ~9.8 percentage points.
Intercept (b) ≈ 2.5 means a student who studied 0 hours might still score ~2.5 points (noise, luck, or guessing!).

6. Common gotchas & tips

Always drop or fill NaNs before feeding arrays to np.polyfit; missing values break the math.
Sort your index when you plan to slice by labels (df.sort_index()).
Save vs. show —
- plt.show() pops the window now.
- plt.savefig("figure.png") stores it for reports. You can call both in sequence.
Scale matters — if x-values span 1 – 1 000 000, center or standardize first; numerical instability can skew np.polyfit.

7. Mini-practice (try these on your own data)

Make a scatter plot of Hours vs FinalExam directly with Pandas.
Use df.describe() to find the mean and std-dev of Scores.
Write a loop (or just df["Scores"].min()) to find the smallest score.
Convert Hours to strings, set it as the index, then slice everything from "1.0" to "3.5".
Generate 10 random student rows, deduce their letter grades, and append them to Student_Grades.csv (see Homework section in the PDF for hints).

Key takeaways

Matplotlib draws; NumPy calculates; Pandas organizes.
A line of best fit is just a 1st-degree polynomial from np.polyfit.
Pandas indexing (loc, iloc, at, iat) lets you grab data precisely.
Clean data → explore with stats → visualize → model → interpret.

Data Structures and Object-Oriented Programming