The SAT is a test that high school students in the U.S. take. It's important because colleges use the SAT to detemine which students to admit.
In this project, we'll investigate NYC High School data to learn on how student demographics correlate with test scores.
We'll investigate the following demographic indicators with SAT scores:
From our analysis, we found the following:
New York City has published data on student SAT scores by high school, along with additional demographic data sets. The datasets we'll use are following:
We can note the following:
3or higher being a passing score.
Number of Exams with scores 3 4 or 5is effectively just tracking the number of exams passed from those taken in the
Total Exams Takencolumn.
SchoolNameare the only columns with the exact same count.
DBNis an alias for
SchoolNameand is shared across most datasets. This makes it a good column for doing JOIN operations on.
DBNcolumn, though it can be computed by combining the
CSDcolumn is actually shorthand for
Community School District.
GRADEcolumn we're intersted in is
SCHOOLWIDE PUPIL-TEACHER RATIOcolumn only has a count of
1484, suggesting that it's missing over 90% of its data
gender, with percentage columns such as
DBNvalue has a frequency of
7, suggesting we'll need to filter this column to preserve
schoolyearcolumn as some other datasets only have 2011/2012 data.
DBNvalue has a frequency of
DBNvalue only appears once.
Demographiccolumn are what prevents
DBNfrom being unique so we'll explore this later.
TOTALcolumn by summing the
SAT Critical Reading Avg. Score,
SAT Math Avg. Scoreand
SAT Writing Avg. Score.
SATcolumns are an
objectdata type instead of a
The survey datasets are broken into two:
survey_d75. They have
1773 columns, respectively.
We'll concatenate both datasets and select columns that give us information about how parents, teachers, and students feel about school safety and more.
From the Data Overview, we learned that
DBN acts as a unique identifier, but that not all datasets had this column.
We can, however, generate this columns using other dataset information.
We can sum the individual Math, Reading and Writing SAT columns to get a total column.
We'll need to filter out irrelevant information discovered in the Data Overview:
class_sizedataset, keep rows where the
PROGRAM TYPEvalues are
DBNand calculate average values for each column.
demographicsdataset, keep rows where the
grad_outcomesdataset, keep rows where the the
When combining the datasets on the
DBN columns, we'll use the following
LEFT JOINSwhen we want to preserve the rows from the left-hand dataset and the right-hand dataset contains many missing
INNER JOINSwhen the left-hand and right-hand datasets both contain important information with few missing
It just means that the order in which we merge our datasets is important:
CSD from the
class_size dataset is shorthand for
Community School District. And that the first two
DBN characters are actually the school district.
We can use this information to create a
school_district column which may be useful for later analysis.
saf_s_11 measures how teachers and students perceive safety at school.
Let's explore what relationship an environment's perceived safety has on SAT scores.
While there isn't a strong correlation between SAT Scores and how Teachers perceive safety, we can observe that the best SAT scores come from schools with a teacher rating of at least
Similarly, this doesn't show a strong correlation, though the best scoring schools have a Student safety perception of at least ~
Evaluating these scores by district may help us uncover larger patterns.
Whn grouping schools by district, there seems to be a weak correlation between the SAT score and Teacher safety perception.
Investigating District SAT score and Student safety perception provides a weaker correlation.
There are a few columns that indicate the percentage of each race at a given school:
It looks like:
hispanic_per as this grouping has the largest negative effect.
This paints a clearer picture as we can see the SAT score trend downwards with a higher hispanic percentage. Let's explore the schools with a
hispanic_per over 90%.
What might explain the negative correlation is that these schools are mostly international. They're also geared towards recent US immigrants who are likely learning English in addition to sitting the SAT. This would put them at an obvious disadvantage.
From the Pan American International High School, we learn that "Students at Pan American International High School all speak Spanish and have been in the United States for less than four years."
The above are schools with an average
sat_score above 1800 and a
hispanic_per less than 10%.
We can observe that these are specialized science and technology schools where entrants are required to pass an admissions test. This would explain why their students tend to do better on the SAT.
From Brooklyn Technical High School, we learn that "Students take the SHSAT (Specialized High School Admissions Test) in the fall of their 8th- or 9th-grade year and are admitted solely based on their test scores."
There are two columns that indicate the percentage of each gender at a high school:
If we use
0.25 as our threshold for finding a correlation, this highlights gender percentage in a school doesn't impact SAT scores.
This confirms that there isn't a strong correlation, though we can observe that schools that have an even gender mix score most highly.
We'll investigate whether schools with a higher percentage of AP takers perform better on the SAT.
There doesn't appear to be a strong correlation between the percentage of students who take the AP Test and SAT Scores.
From our analysis, we've found the following: