NASCAR Transformation Module

Linear regression
Transformations
Polynomial regression
Using NASCAR driver rating data to explore a series of transformations to improve linearity in regression.
Authors
Affiliation

Alyssa Bigness

St. Lawrence University

Ivan Ramler

St. Lawrence University

Jack Fay

St. Lawrence University

Published

February 5, 2024

Stock cars used in NASCAR

Introduction

In NASCAR, driver rating is a metric used to evaluate the performance of drivers in races. It provides a comprehensive assessment of a driver’s overall competitiveness, efficiency, and consistency during a race. The driver rating is based on several key performance factors and is designed to offer a more objective view of a driver’s abilities. For this activity, you will be exploring the relationship between average position a driver finishes per lap over a season and their corresponding driver rating. Using data transformations techniques and polynomial regression to create different variations of linear models, you will enhance the capabilities of your models to make them more effective and accurate.

NASCAR, the National Association for Stock Car Auto Racing, is a preeminent motorsports organization in the United States, distinguished by its high-speed competitions and fervent fanbase. Established in 1948 by Bill France Sr., NASCAR has evolved into a leading racing series that encompasses three national divisions: the NASCAR Cup Series, the Xfinity Series, and the Truck Series. The Cup Series is the most prestigious, showcasing the elite drivers and teams as they compete in events that span a variety of track types, typically oval in shape, and ranging from short tracks (0.24 miles) to expansive superspeedways (over 3 miles).

The races take place on diverse tracks across the nation, including renowned venues such as Daytona International Speedway and Talladega Superspeedway, where vehicles frequently exceed speeds of 200 miles per hour. NASCAR races serve as a complex interplay of driver skill, team strategy, and vehicle performance. Key metrics such as lap times, pit stop efficiency, and vehicle dynamics offer insights into the determinants of success in the sport. This not only enhances the understanding of race outcomes but also contributes to the development of strategies that optimize performance.

The video linked below is a panel discussion from the 2023 Sloan Sports Analytics Conference entitled Start Your Engines: How Data is Fueling NASCAR’s Strategy to Engage an Evolving Customer Base.

Although the panel discussion is not directly related to this module, Justin Marks discusses how data is used in NASCAR during the 26:53 to 30:44 minute interval.

This activity would be suitable for an in-class activity lasting approximately one class period or as an out-of-class assignment.

By the end of this activity, students will have reinforced the following skills:

  • Assessing the effectiveness of simple linear regression models
  • Checking regression model assumptions
  • Using log transformations to improve linear regression model fit
  • Applying polynomial regression to model curved relationships

For this activity, students will need to use software to create scatterplots and plots of residual vs fitted values of models they will create. They will also need to create polynomial models and mutate the data by applying mathematic functions to columns.

It is assumed that students are already exposed to these concepts as the activity is intended to reinforce the skills instead of introduce them.

Data

The data comes from the NASCAR website and shows the season statistics from 2007-2022. Each row displays the metrics of a racer for that specific year. The data frame contains 1111 rows of observations and 20 variables.

Download data: nascar_driver_statistics.csv

Variable Descriptions
Variable Description
Wins The sum of the driver’s victories
AvgStart The sum of the driver’s starting positions divided by the number of races
AvgMidRace The sum of the driver’s mid race positions divided by the number of races
AvgFinish The sum of the driver’s finishing positions divided by the number of races
AvgPos The sum of the driver’s position each lap divided by the number of laps
PassDiff The sum of green flag passes minus green times passed
GreenFlagPasses Number of green flag passes performed by the driver
GreenFlagPassed Number of times driver is passed during green flag
QualityPasses Number of passes in the top 15 while under green flag conditions by driver
PercentQualityPasses The sum of quality passes divided by green flag passes
NumFastestLaps Number of where the driver had the fastest speed on the lap
LapsInTop15 Number of laps completed while running in a top 15 position
PercentLapsInTop15 The sum of the laps run in the top 15 divided by total laps completed
LapsLed The sum of the laps led in a race
PercentLapsLed The sum of the laps led in the race
TotalLaps The sum of the laps completed by a driver that year
DriverRating Formula combining wins, finish, top15-finish, average running position while on lead lap, average speed under green, fastest lap, led most laps, and lead lap finish with a maximum rating of 150 points

Materials

Class handout

Class handout - with solutions: MS Word

Class handout - with solutions: Quarto

In conclusion, the Transforming NASCAR Driver Data worksheet offers valuable insights into the relationship between average position and driver rating in NASCAR. Through the transformation of the average position variable, the worksheet enables students to enhance linearity in their models, thereby improving the accuracy of their predictions and analysis. The identification of curvature in the variable relationship also allows for the model using quadratic regression to be highly effective. The model can be conceptually compared to the model using a square root transformation to capture the same curve. Assessing the effectiveness of each model, students can critically evaluate the different approaches and determine the most suitable model for the data. This exercise equips students with essential data transformation and modeling skills, empowering them to make informed decisions and gain a deeper understanding of the factors influencing a driver’s performance.