Scroll down for bigger images and an explanation.
Update: Links in the Huffington Post, SF Gate, Mission Mission, The Tender and Mission Loc@l.
This is part 2 of an analysis of postings reporting stolen bikes on Craigslist. Part 1 is here.
Bike theft sucks. What can we learn about it?
Bike theft sucks. If your bike is stolen, not only will you have lost a fairly significant possession, but there’s a good chance you’ll be stranded or stuck if you use your bike as a mode of transportation. (If you’re looking for tips on how to avoid having your bike stolen, try the San Francisco Bike Coalition’s page on theft prevention.)
I was curious about how bike thefts occur and what kind of patterns there were in bike theft incidents. To do this, I turned to Craigslist, where occasionally while looking through listings for used bikes, I’d stumble upon a post where someone would plead for their bike to be found and returned. Or I’d see a post where someone would promise revenge if the thief is ever found on their bike. So, to get a better idea of what was happening with stolen bikes and Craigslist, I gathered some data.
Gathering and processing data
I archived San Francisco Craigslist listings from July 2nd, 2011 til October 17th, 2011 in the “bicycles” section with “stolen” in the title. This includes listings all over the Bay Area – San Jose, Mountain View, Oakland, Berkeley, San Leandro, Santa Cruz and so on. This is only about 3 months of data, but I think it’s fairly representative.
The first part of my analysis was a simple graph with word counts for an aggregate of the postings. It showed that there was a tendency for people to post about bikes (obviously) with pleas for help. Shimano components and Specialized bikes dominated the listings. Black was the most popular color in the posts.
I was also curious about the geographical distribution of stolen bikes. If you were to park your bike somewhere, in which neighborhood is it more likely to be stolen? Which city has more stolen bikes? I took the big group of postings and converted the data format to a spreadsheet and then used software to make a treemap and then manually cleaned up the graphic and tried to make it a little prettier. (If you care for details: I created a Google Reader feed back in July, exported the feed to XML, cleaned up the data and exported to CSV using Google Refine, then used R and the map.market function in the portfolio package to create an image. I then used Adobe Illustrator to make things a bit more attractive and readable. Flowing Data’s “An Easy Way to Make a Treemap” was very helpful in this process. If I had known how to code in R better, I would probably have tried to modify map.market to create a more refined treemap and remove some of the manual steps.)
Above: Google Refine converted postings into a tabular format.
I settled on a treemap as a format to display the data, but I think a geographical map would have been the best way to represent the data. I guess I just wasn’t up for tracing neighborhood boundaries and all of the other associated work.
I should also note that going through the listings made me kind of sad. Bike thieves suck.
How good is the data? Can it be refined?
Analyzing Craigslist postings isn’t a perfect way to determine where bikes are stolen. There are a few registries out there that may have some good data. I was curious about Craigslist postings specifically since I had stumbled across so many while shopping around for bikes for myself.
So, for a point of data from Craigslist to show up correctly in this analysis, someone who had their bike stolen would need to:
1) report a missing bike as stolen on Craigslist
2) identify the neighborhood where the bike was stolen in the posting properly
There were 633 postings in total with the word “stolen” in the title. It seems that most people do 2) pretty well. Only about 9% (59 of 633) of the postings did not have an actual location in the title. I don’t know how many people who have had their bikes stolen actually post on Craigslist and report their bikes as stolen. I’m pretty sure it’s not all people.
Duplicates were removed
Some people are really good at posting on Craigslist though. They posted multiple times. This is totally understandable for someone who wants to get their bike back. I used Google Refine to remove these duplicates so that they would not skew the data, but I probably ended up removing some unique posts in the process. All in all, 64/633 were removed because they were duplicates.
“Stolen” is a bike company name
There’s a company that builds BMX bikes that is named “Stolen.” I removed some, but I think about 19/633 made it into the infographics that weren’t actually stolen. In an interesting ironic twist, 1 posting was for a stolen “Stolen” brand BMX bike.
The way people post locations can vary
Since anyone can post almost whatever they want on Craigslist, the data was pretty messy. Craigslist has created a bunch of predefined neighborhoods such as the “Mission District” and “Hayes Valley,” but sometimes people don’t stick to the naming convention. Sometimes people are ambiguous with neighborhood names and post “Mission” instead of the “Mission District.” Some other people are much more specific – they post “Near Mission Cliffs” or a specific intersection like “Stockton and North Point.” For the most part, I didn’t convert these to their applicable neighborhoods unless the poster’s intent was obvious, like with typos, for example.
Some people did not include a location or included not very useful but understandably frustrated-sounding “locations” such as, “you tell me” and “can you find it?” Other locations contained text like, “United States” and “The Bay Area.” I grouped these ambiguous “locations” into their own category. They are still included in the infographics below.
Also, some listings contained locations covered by other local Craigslist websites. “Sacramento,” for example, has its own Craigslist page, but a posting was still filed under the San Francisco Bay Area Craigslist page.
Good Samaritans also post
There were some postings from people who saw or purchased possibly stolen bikes and were trying to reunite them with their owner. (The Laney College Flea Market in Oakland is a good place to find your bike if it’s been stolen, by the way.) This is good for the world, but it clouded the data set just a little bit. I don’t know exactly how many postings were of this type but it was not too large of a number. I’d estimate about 5% of postings were from good Samaritans.
Where does bike theft occur?
So, with all of that out of the way, here are the infographics with treemaps. The size of a rectangle is proportional to the number of occurrences for that location. Larger rectangles mean more bikes were reported as stolen, and smaller rectangles mean the opposite.
The first infographic is a treemap, with postings separated by city:
Interestingly, nearly half of the stolen bike postings were from San Francisco. I expected to see more theft in Oakland and Berkeley. It’s also surprising that there was only 1 reported theft in Emeryville. Is it because Emeryville is that much smaller? Are there not many cyclists there?
Workflow note: I made this treemap manually in Illustrator based on the neighborhood map below. Open the image in a new window to view at full resolution.
This second infographic is also a treemap, but with areas divided by cities and neighborhoods. You’ll notice that they are color coded with the same hue as in the above city infographic.
Holy crap, there are a lot of neighborhoods. The Mission district wins for being the neighborhood with the most stolen bike listings. In Oakland, the largest chunk of thefts occurred by the Lake. There’s a pretty large number of stolen bikes in Santa Cruz and Berkeley, probably due to high bike usage by college students and perhaps naïveté with regards to bike locking strategies. Strangely, there aren’t a lot of listings from Palo Alto or Stanford. Is there less theft there or do people just not look to Craigslist when trying to recover their bike?
I like how they turned out, but making these damned infographics took a lot of time. I think there’s still some interesting stuff to get out of the dataset. I’m curious about how bikes are stolen. Did somebody cut through a lock? Did they break into an apartment? Did somebody just lean their bike and then look away for a few seconds? I’ll try to find that out next.
- Phillip Yip
October 27th, 2011
This is part 1 of a multi-part series. Part 2 is here.
Bike theft sucks. For quite some time, I’ve been wanting to compile some stats on bike theft in order to understand how and where it happens and how it can be prevented. I archived San Francisco craigslist listings from July 2nd, 2011 til October 17th, 2011 in the “bikes” section with “stolen” in the title. This includes listings all over the Bay Area – San Jose, Mountain View, Oakland, Berkeley, San Leandro, and so on. I was curious about how bikes get stolen and what kind of patterns there were in the posts and incidents. I’ve compiled a little graph that shows the frequency of words in the listings. I used a messy combination of Google Refine, Google Spreadshets and Open Office Calc and a word frequency counter that was a bit slow but quite useful. The data consists of about 633 entries.
We’ll see what other data I can squeeze out of the set in the future.
I had to do some manual cleaning of the data – I purposely omitted words under 4 letters in length, like “the”, “a” and so on, since they probably wouldn’t have been too informative. There were also various html-related words like “http”, “href” and “nofollow” that I removed as well.
Here’s a graph of the word counts:
and here’s the corresponding table:
some interesting things:
A lot of the terms are fairly obvious, but some words stick out. Most of the posts are obviously requests for help and the words show this – “please”, “reward”, “thanks”, “return”. As far as bike companies go, it appears that “shimano” dominates the component world and “specialized” is the most popular bicycle manufacturer. The color “black” is the most popular, but “white”, “blue” and “silver” also show up. Another word that stands out is “photobucket”, the popular image-sharing site.
I’m going to try to continue poring over this data and see if anything else interesting emerges.
October 17th, 2011
May 10, 2013: Updates!
CheapAir.com has performed a similar analysis and created a graph that I think looks similar. The peaks and valleys have been smoothed out because they’ve got a lot more data to average out. For domestic flights, the average cheapest flight is 49 days prior to departure. This is earlier than my graph, but I didn’t start my analysis nearly as early as they did. Clicking the graph links to their informative post.
I also found a study by ARC (Airlines Reporting Corporation) via marketplace.org that created a very similar looking graph to that of CheapAir.com’s, but from a different set of data.
Data Analysis: Flying from San Francisco to New York – when is the cheapest time to buy tickets?
Kayak.com has this nice feature where you can subscribe to price alerts for certain itineraries. This is helpful as fares change fairly frequently and it’s hard to know when to purchase tickets. Microsoft purchased a company called farecast.com back in 2008, which originally grew by using data to predict when prices would rise, fall, or hold steady. Microsoft has since integrated into bing travel. They claim about a 75% accuracy.
I visited New York a few weeks ago, and when searching for a ticket, I decided that I didn’t really trust bing travel’s technology. I decided that I’d monitor fares on my own using Kayak’s emailed price alerts, and then make a purchase when prices seemed to be reasonable. I identified travel dates for a round trip where I’d depart on June 17th at any time of the day and return June 21st, at any time of the day. San Francisco International Airport (SFO) and Oakland International Airport (OAK) are both just about as easy to get to for me. It also didn’t matter whether I arrived at John F. Kennedy International (JFK) or LaGuardia (LGA) in New York.
I took the prices from all of the emails, put them together in a data set, and plotted them. One of the big assumptions here is that the travel dates are fixed – if you’re able to fly on different days, you’ll of course most likely be able to find cheaper tickets.
but first, key findings:
* Prices go up at the last minute. In this case, they almost doubled.
* In the 6-week monitoring period, the cheapest flights were found about 3 weeks prior to departure
* There seems to be some truth to prices being lower mid-week
* There doesn’t seem to be a big price difference for OAK vs SFO or JFK vs LGA
* When one airline dropped fares, others seemed to follow
onto the graphs:
when should I buy tickets?
One interesting finding is that buying early (I am speaking relatively here as I didn’t start my search until about 6 weeks before departure) isn’t always the cheapest. In this case, the cheapest fares were found about 3 weeks prior to departure. Tickets may have, of course, been cheaper prior to 6 weeks before departure.
An ABC News article states that “Airfare sales tend to occur early in the week … And increases tend to occur at the end of the week.” My data set isn’t very large, but here’s a histogram of prices, grouped by day of the week:
What does the histogram show? For my set of data, the cheapest prices occurred on Wednesday and Thursday. You can see the little bumps of lower fares on the left side of the graph for Wednesday and Thursday. I’m not sure if much can be made of the rest of it – there aren’t too many data points to draw any strong conclusions.
Prices were probably also the highest Tuesday-Thursday because those were the last 3 days before the flight and as can be expected, last-minute tickets were much more expensive.
where should I fly from/to?
I had two theories about the relationship between airfare and the size of the airport. I was thinking that flights might be cheaper out of SFO since it’s a much more popular airport (Based on what I could find here and here, they handled about 45 million passengers in 2010 compared to about 9.5 million for OAK). Conversely, I also thought that flights may be cheaper out of OAK since I know that a strategy of low-cost carriers like Southwest, JetBlue, and AirTran is to use secondary airports in larger markets (think Midway for Chicago, BWI for DC, Providence for Boston, and Love Field for Dallas) to keep costs down and thus offer lower fares.
There doesn’t look to be a big price difference, on average. There’s a piddly $4 to $6 difference between flying out of SFO vs OAK and landing in JFK vs LGA. Maybe the two theories are both correct. Or incorrect. Also, the Kayak data doesn’t include Southwest, since Southwest doesn’t make its data available to third parties.
why did prices drop?
The lowest price I encountered was on May 25th, when United/Continental dropped their prices for a nonstop flight from SFO to JFK to $319 from $549 a day earlier. American and Delta also lowered their prices for nonstop flights that day to $439 and $359. Some of the airlines also lowered their prices from SFO to LGA (note: no nonstop flights). This may have been because the price of connecting flights was reduced and the SFO to LGA and SFO to JFK trips share similar legs. Interestingly, flights out of OAK didn’t change by much when prices of flights from SFO dropped by over $200. These prices didn’t last long – the $319 fare was available for only two days. $319 seems like a pretty good deal. I don’t have historical flight price data, but from what I can recall, this appears to be near the bottom of the fare range.
There was another temporary price drop from SFO to LGA offered by Delta on May 31st to $341. Prices were back up by the next day.
If you’ve made it this far, thanks for reading. I’ve been wanting to get more into data analysis on topics that we all can relate to and this is part of my foray into the field. There’s a lot more to learn and study out there, so if you have any suggestions of things I should look into regarding airfares or anything else, let me know.
When my schedule freed up, I ended up changing my travel dates in order to find flight times that worked better for me and found two nonstop flights from SFO to JFK on Virgin America.
June 28th, 2011
A while back, I decided to try to teach myself R. I thought that running races would have some interesting data to look through. Here’s what I’ve come up with so far:
This is a scatter plot of finishing times versus runner ages with different colors for male and female runners:
Males generally finished the race faster. There were more female runners (I wonder why?). The fastest age group looks to be runners in their mid 20s. There are a few data points where I’m guessing no age was given and therefore the runner was assigned the age of “1″. I’m impressed at the people who are still completing half marathons in their 60s and 70s!
More charts to come, maybe!
February 19th, 2011