Chapter 3: Matplotlib (Plotting and Line of Best Fit)
1. Why visualize?
Pictures turn rows of numbers into patterns your eyes can spot instantly. Scatter-plots show relationships, line-plots show trends, and a “line of best fit” summarizes a messy cloud of points with one clean equation.
2. Building blocks
| Task | “Plain-English” goal | Function(s) |
|---|---|---|
| Line plot | connect points in order | plt.plot() |
| Scatter plot | show individual dots | plt.scatter() |
| Fit a straight line | find the best-fit slope & intercept | np.polyfit(x, y, 1) |
| Turn those numbers into a function | evaluate the line at any x | np.poly1d() |
Tip: the last argument 1 in np.polyfit means “degree 1 polynomial” → a straight line.
3. A step-by-step mini-example
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1,2,3,4,5])
y = np.array([2.1,3.8,6.2,7.9,9.8])
plt.scatter(x, y, color="red") # 1. dots
m, b = np.polyfit(x, y, 1) # 2. slope (m) & intercept (b)
plt.plot(x, m*x + b, label=f"y={m:.2f}x+{b:.2f}")
plt.legend(); plt.xlabel("x"); plt.ylabel("y")
plt.grid(True); plt.show()What just happened?
- Scatter — raw data in red.
np.polyfit— finds the least-squares line (minimum total vertical distance).- Plot the equation so you can judge the fit by eye.
4. Bringing Pandas to the party
The PDF example works with a Student_Grades.csv file. Here are the everyday Pandas moves you see there:
| Goal | One-liner |
|---|---|
| Read the file | df = pd.read_csv("Student_Grades.csv") |
| Replace missing values | df.replace(np.nan, np.nanmean(df["Scores"])) (or any fill value) |
| Quick scatter straight from Pandas | df.plot(kind="scatter", x="Hours", y="Scores") |
| Basic stats | df["FinalExam"].describe() (or .min(), .max(), .mean()) |
| Filter rows containing “A” | df[df["Grade"].str.contains("A")] |
| Set an index for easy slicing | df = df.set_index("Hours").sort_index() |
| Slice by label | df.loc[1.1:5.5] |
| Slice by position | df.iloc[0:3] |
| Single value fast | df.at[1.1, "Scores"], df.iat[2, 4] |
| Rename | df.rename(columns={"Hours": "Hours Studied"}) |
These tools let you clean, summarize, and grab exactly the subset you need before plotting.
5. Regression fit in practice
With cleaned data:
x = df["Hours"].to_numpy()
y = df["Scores"].to_numpy()
m, b = np.polyfit(x, y, 1) # slope & intercept
x_line = np.linspace(0, 10, 100)
plt.scatter(x, y, s=8)
plt.plot(x_line, m*x_line + b, c="red", label="Best-fit")
plt.xlim(0,10); plt.ylim(0,100)
plt.xlabel("Hours studied"); plt.ylabel("Score")
plt.legend(); plt.show()
print(f"Slope: {m:.2f}, Intercept: {b:.2f}")Interpretation
- Slope (m) ≈ 9.8 means every extra study hour adds ~9.8 percentage points.
- Intercept (b) ≈ 2.5 means a student who studied 0 hours might still score ~2.5 points (noise, luck, or guessing!).
6. Common gotchas & tips
-
Always drop or fill NaNs before feeding arrays to
np.polyfit; missing values break the math. -
Sort your index when you plan to slice by labels (
df.sort_index()). -
Save vs. show —
plt.show()pops the window now.plt.savefig("figure.png")stores it for reports. You can call both in sequence.
-
Scale matters — if x-values span 1 – 1 000 000, center or standardize first; numerical instability can skew
np.polyfit.
7. Mini-practice (try these on your own data)
- Make a scatter plot of Hours vs FinalExam directly with Pandas.
- Use
df.describe()to find the mean and std-dev of Scores. - Write a loop (or just
df["Scores"].min()) to find the smallest score. - Convert Hours to strings, set it as the index, then slice everything from
"1.0"to"3.5". - Generate 10 random student rows, deduce their letter grades, and append them to Student_Grades.csv (see Homework section in the PDF for hints).
Key takeaways
- Matplotlib draws; NumPy calculates; Pandas organizes.
- A line of best fit is just a 1st-degree polynomial from
np.polyfit. - Pandas indexing (
loc,iloc,at,iat) lets you grab data precisely. - Clean data → explore with stats → visualize → model → interpret.