Data Structures and Object-Oriented Programming

Chapter 3: Matplotlib (Plotting and Line of Best Fit)

1. Why visualize?

Pictures turn rows of numbers into patterns your eyes can spot instantly. Scatter-plots show relationships, line-plots show trends, and a “line of best fit” summarizes a messy cloud of points with one clean equation.


2. Building blocks

Task“Plain-English” goalFunction(s)
Line plotconnect points in orderplt.plot()
Scatter plotshow individual dotsplt.scatter()
Fit a straight linefind the best-fit slope & interceptnp.polyfit(x, y, 1)
Turn those numbers into a functionevaluate the line at any xnp.poly1d()

Tip: the last argument 1 in np.polyfit means “degree 1 polynomial” → a straight line.


3. A step-by-step mini-example

import matplotlib.pyplot as plt
import numpy as np

x = np.array([1,2,3,4,5])
y = np.array([2.1,3.8,6.2,7.9,9.8])

plt.scatter(x, y, color="red")           # 1. dots
m, b = np.polyfit(x, y, 1)               # 2. slope (m) & intercept (b)
plt.plot(x, m*x + b, label=f"y={m:.2f}x+{b:.2f}")
plt.legend(); plt.xlabel("x"); plt.ylabel("y")
plt.grid(True); plt.show()

What just happened?

  1. Scatter — raw data in red.
  2. np.polyfit — finds the least-squares line (minimum total vertical distance).
  3. Plot the equation so you can judge the fit by eye.

4. Bringing Pandas to the party

The PDF example works with a Student_Grades.csv file. Here are the everyday Pandas moves you see there:

GoalOne-liner
Read the filedf = pd.read_csv("Student_Grades.csv")
Replace missing valuesdf.replace(np.nan, np.nanmean(df["Scores"])) (or any fill value)
Quick scatter straight from Pandasdf.plot(kind="scatter", x="Hours", y="Scores")
Basic statsdf["FinalExam"].describe() (or .min(), .max(), .mean())
Filter rows containing “A”df[df["Grade"].str.contains("A")]
Set an index for easy slicingdf = df.set_index("Hours").sort_index()
Slice by labeldf.loc[1.1:5.5]
Slice by positiondf.iloc[0:3]
Single value fastdf.at[1.1, "Scores"], df.iat[2, 4]
Renamedf.rename(columns={"Hours": "Hours Studied"})

These tools let you clean, summarize, and grab exactly the subset you need before plotting.


5. Regression fit in practice

With cleaned data:

x = df["Hours"].to_numpy()
y = df["Scores"].to_numpy()

m, b = np.polyfit(x, y, 1)      # slope & intercept
x_line = np.linspace(0, 10, 100)
plt.scatter(x, y, s=8)
plt.plot(x_line, m*x_line + b, c="red", label="Best-fit")
plt.xlim(0,10); plt.ylim(0,100)
plt.xlabel("Hours studied"); plt.ylabel("Score")
plt.legend(); plt.show()
print(f"Slope: {m:.2f},  Intercept: {b:.2f}")

Interpretation

  • Slope (m) ≈ 9.8 means every extra study hour adds ~9.8 percentage points.
  • Intercept (b) ≈ 2.5 means a student who studied 0 hours might still score ~2.5 points (noise, luck, or guessing!).

6. Common gotchas & tips

  • Always drop or fill NaNs before feeding arrays to np.polyfit; missing values break the math.

  • Sort your index when you plan to slice by labels (df.sort_index()).

  • Save vs. show

    • plt.show() pops the window now.
    • plt.savefig("figure.png") stores it for reports. You can call both in sequence.
  • Scale matters — if x-values span 1 – 1 000 000, center or standardize first; numerical instability can skew np.polyfit.


7. Mini-practice (try these on your own data)

  1. Make a scatter plot of Hours vs FinalExam directly with Pandas.
  2. Use df.describe() to find the mean and std-dev of Scores.
  3. Write a loop (or just df["Scores"].min()) to find the smallest score.
  4. Convert Hours to strings, set it as the index, then slice everything from "1.0" to "3.5".
  5. Generate 10 random student rows, deduce their letter grades, and append them to Student_Grades.csv (see Homework section in the PDF for hints).

Key takeaways

  1. Matplotlib draws; NumPy calculates; Pandas organizes.
  2. A line of best fit is just a 1st-degree polynomial from np.polyfit.
  3. Pandas indexing (loc, iloc, at, iat) lets you grab data precisely.
  4. Clean data → explore with stats → visualize → model → interpret.
Back to Data Structures and Object-Oriented Programming