Farmer’s Fridge Yelp Reviews Analysis#

Summary#

Using data from 173 Yelp reviews for three major fridge locations, I examined trends like average ratings over time, item popularity, and common phrases mentioned in reviews. The results show that, pre-pandemic:

The airport location had more and, on average, better ratings while it was active
Salad popularity decreased over time while bowl and sandwich popularities increased, but salads are still by far the most popular type of item

Common positive phrases include “healthy”, “good”, and “fresh”, while analysis of negative reviews suggests price and portion sizes factored into their low ratings.

Data Source#

I scraped Yelp reviews using the Selenium package on 6/20/2021. Data include rating (1 out of 5 stars), the main text of the review, and metadata like the the number of Yelp reviews a user has given or the number of people who found a review useful. Each review constitutes a row in the yelp.csv file, with each piece of data (rating, text, etc.) as a separate column.

I scraped three different Chicago fridges. I found the locations by manually navigating Yelp’s website and searching for “farmer’s fridge” in Chicago and selecting fridges with > 10 reviews. The three fridges have 93 reviews, 23 reviews, and 57 reviews for a total of 173 reviews. Two fridges are located in The Loop, while the third location with 57 reviews is located in O’Hare International Airport.

My complete code to scrape the reviews can be toggled below, but is not the main focus of this writeup.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
import csv

def scrape_reviews(yelp_url, sleep_time=1):
    """Retrieve all text data from reviews, including star rating.
    Sleep between pages of reviews so the html loads before accessing data."""
    browser = webdriver.Firefox()
    browser.get(yelp_url)

    all_reviews = []
    more_pages_exist = True
    while more_pages_exist:
        time.sleep(sleep_time)  # Wait for page to load
        reviews = get_reviews_on_page(browser)
        all_reviews.extend(reviews)
        click_succeeded = click_on_next_page(browser)
        if not click_succeeded:
            more_pages_exist = False
    return all_reviews

def get_reviews_on_page(browser):
    """Extract reviews on a Yelp page."""
    reviews = []
    review_num = 1
    more_reviews_exist = True
    while more_reviews_exist:
        try:
            review = browser.find_element_by_xpath(
                "/html/body/div[2]/div[2]/yelp-react-root/div/div[3]/div/div"
                + "/div[2]/div/div[1]/div[2]/section[2]/div[2]/div/ul"
                + "/li[" + str(review_num) + "]"
                + "/div"
            )
            rating = browser.find_element_by_xpath(
                "/html/body/div[2]/div[2]/yelp-react-root/div/div[3]/div/div/"
                + "div[2]/div/div[1]/div[2]/section[2]/div[2]/div/ul"
                + "/li[" + str(review_num) + "]"
                + "/div/div[2]/div/div[1]/span/div"
            ).get_attribute('aria-label')

            reviews.append(rating + "\n" + review.text)
            review_num += 1
        except NoSuchElementException:  # No more reviews on page
            more_reviews_exist = False
    return reviews

def click_on_next_page(browser):
    """Click on the next page of Yelp reviews, if it exists. Returns
    whether the next page exists."""
    try:
        next_page_button = browser.find_element_by_css_selector(
            '.next-link'
        )
        next_page_button.click()
        return True
    except NoSuchElementException:
        return False

def write_to_csv(file_name, reviews, append=False, location=''):
    """Write the following data as columns in a csv file:
        - rating (1 to 5 scale)
        - reviewer data: name, location, number of friends, number of
            reviews, and number of photos
        - review data: date of review, updated or not, review text,
            number of people who found the review useful, funny, and cool
    """
    mode = 'a' if append else 'w'
    with open(file_name, mode, newline='') as file:
        writer = csv.writer(file)

        if not append:
            header = [
                'rating',
                'name',
                'elite',
                'reviewer_location',
                'n_friends',
                'n_reviews',
                'n_photos',
                'review_photos',
                'date',
                'updated',
                'text',
                'n_useful',
                'n_funny',
                'n_cool'
            ]
            if location:
                header.append('restaurant_location')
            writer.writerow(header)

        for review in reviews:
            row = process_review(review)
            if location:
                row.append(location)
            writer.writerow(row)

def process_review(review):
    """Process the structure of a Yelp review line by line into a row
    to be appended by the `write_to_csv()` function."""
    lines = review.split('\n')
    rating = lines.pop(0)[0]

    # Reviewer data
    name = lines.pop(0)
    if is_elite(lines[0]):
        elite = 'True'
        lines.pop(0)
    else:
        elite = 'False'
    location = lines.pop(0)
    n_friends = lines.pop(0)
    n_reviews = lines.pop(0)
    if is_int(lines[0]):
        n_photos = lines.pop(0)
    else:
        n_photos = 0

    # Review data
    date = lines.pop(0)
    if is_updated_review(lines[0]):
        updated = 'True'
        lines.pop(0)
    else:
        updated = 'False'
    if has_photos(lines[0], single=True):
        review_photos = '1'
        lines.pop(0)
    elif has_photos(lines[0], single=False):
        review_photos = strip_nonint(lines[0], 'photos')
        lines.pop(0)
    else:
        review_photos = 0
    end_of_review_text = find_end_of_review_text(lines)
    text = '\n'.join(lines[:end_of_review_text])
    n_useful = strip_nonint(lines[end_of_review_text], 'Useful')
    n_funny = strip_nonint(lines[end_of_review_text + 1], 'Funny')
    n_cool = strip_nonint(lines[end_of_review_text + 2], 'Cool')

    row = [
        rating,
        name,
        elite,
        location,
        n_friends,
        n_reviews,
        n_photos,
        review_photos,
        date,
        updated,
        text,
        n_useful,
        n_funny,
        n_cool
    ]
    return row

def find_end_of_review_text(lines):
    """Find which line the main text of a review ends."""
    for i in range(len(lines) - 2):
        line1 = lines[i]
        line2 = lines[i + 1]
        line3 = lines[i + 2]
        if (line1.startswith('Useful')
                and line2.startswith('Funny')
                and line3.startswith('Cool')):
            return i

    print("Warning: no expected end of text found.")

def is_elite(string):
    """Determine if given string is an elite Yelp user."""
    return string.startswith("Elite")

def is_int(string):
    """Determine if given string is an integer."""
    try:
        int(string)
        return True
    except ValueError:
        return False

def has_photos(string, single=False):
    """Determine if given string ends in 'photo' or 'photos'."""
    if single:
        return string.endswith('photo')
    else:
        return string.endswith('photos')

def is_updated_review(string):
    """Determine if given string indicates an updated Yelp review."""
    return string == "Updated review"

def strip_nonint(string, word_to_strip):
    """Strip words and spaces from a string, keeping non-integer parts
    after stripping. If the remaining piece is not an integer, return '0'.
    So, 'Useful 12' becomes '12' while 'Useful' becomes '0'."""
    stripped = string.strip(word_to_strip).strip(' ')
    if is_int(stripped):
        return stripped
    else:
        return '0'


if __name__ == '__main__':
    url_location_pairs = [
        ('https://www.yelp.com/biz/farmers-fridge-chicago-13', 'The Loop 1'),
        ('https://www.yelp.com/biz/farmers-fridge-chicago-14', 'The Loop 2'),
        ('https://www.yelp.com/biz/farmers-fridge-chicago-87', "O'Hare Airport"),
    ]
    file_name = 'yelp.csv'

    counter = 1  # Determine when to write to a new file or append to existing one
    for url, location in url_location_pairs:
        print(f"Scraping url number {counter}")
        reviews = scrape_reviews(url)
        if counter == 1:
            write_to_csv(file_name, reviews, append=False, location=location)
        elif counter > 1:
            write_to_csv(file_name, reviews, append=True, location=location)
        counter += 1

Data Cleaning and Exploration#

Let’s load in the data and examine it:

Note

You can run and modify the training and inference code on this page Jupyter Notebook style, but without leaving the page! Hover over the launch button at the top of the page, then click the Live Code button. If you prefer a Jupyter environment, use the Binder button instead.

You’ll need to run the following cell manually because Live Code sessions assume a cwd of the root directory. Refresh the page to revert to the original view.

import os

try:
    os.chdir("./website/stat-ml/")
except FileNotFoundError:
    pass
print(f"cwd: {os.getcwd()}")

cwd: /home/adam/projects/website/website/stat-ml

import pandas as pd

df = pd.read_csv("yelp.csv")
df.head()

	rating	name	elite	reviewer_location	n_friends	n_reviews	n_photos	review_photos	date	updated	text	n_funny	n_cool	business_location
0	5	Josh C.	True	Chicago, IL	215	22	129	9	1/9/2021	False	We received a Farmer's Fridge delivery straigh...	0	1	The Loop 1
1	5	Jim Z.	False	Napa, CA	0	25	8	0	5/19/2021	False	Repeat buyer... awesome food, great service......	0	0	The Loop 1
2	5	Kathy L.	False	New Berlin, WI	0	2	0	0	4/2/2021	False	So happy I found Farmers Fridge in MKE airport...	0	0	The Loop 1
3	5	Regina X.	True	Seattle, WA	521	220	432	0	9/1/2019	False	The business stock photos did not prepare me f...	0	0	The Loop 1
4	1	Alice J.	False	Oak Park, IL	0	23	1	0	8/14/2020	False	$8 for a salad in a jar.\n$8.\nThis is ridicul...	3	0	The Loop 1

The important variables that will be used for this analysis are:

rating: an integer rating between 1 (worst) and 5 (best)
date: the date the review was posted on Yelp
text: the main contents of the review
business_location: the location of the fridge

The others are not used, but some ideas how of how to use them are sketched in § Future Work.

It’s much more convenient to work with datetime objects, so let’s convert them:

df.date = pd.to_datetime(df.date)

Now we can easily group reviews by year. Also grouping by fridge (i.e. business) location, we can count the number of reviews posted at each location in a given year:

grouped_by_year_and_loc = df.groupby([df.date.dt.year, df.business_location])
annual_counts = grouped_by_year_and_loc.rating.agg("count").unstack()
annual_counts

business_location	O'Hare Airport	The Loop 1	The Loop 2
date
2013	NaN	13.0	NaN
2014	NaN	21.0	NaN
2015	NaN	22.0	NaN
2016	NaN	19.0	5.0
2017	10.0	5.0	6.0
2018	24.0	5.0	6.0
2019	19.0	3.0	5.0
2020	4.0	2.0	1.0
2021	NaN	3.0	NaN

ax = annual_counts.plot()
ax.set(xlabel="Year", ylabel="Number of Annual Reviews")

[Text(0.5, 0, 'Year'), Text(0, 0.5, 'Number of Annual Reviews')]

The first Loop location seems to have been one of the first fridges, with reviews available in the same year the company was founded. The number of reviews per year begins to decline around 2016, coinciding with the opening up of the second Loop location. In 2017, a fridge location in O’Hare international airport opened up and became relatively popular. In 2020, when the effects of the COVID-19 pandemic drastically altered lives of Chicago residents, the number of reviews also drastically declined, continuing into 2021.

If we’re interested in annual trends over time, it may helpful to combine the two Loop locations to avoid aggregates from only 5 or fewer reviews:

import numpy as np


def combine_loop_locations(locations):
    combined_locations = []
    for loc in locations:
        if loc.startswith("The Loop"):
            combined_locations.append("The Loop")
        else:
            combined_locations.append(loc)
    return combined_locations


df["combined_location"] = combine_loop_locations(df.business_location)
np.unique(df.combined_location)

array(["O'Hare Airport", 'The Loop'], dtype=object)

Exploratory Data Analysis#

This section examines a few specific questions, focusing on graphical presentation and inuition rather than formal statistical tests. These analyses can help inform what avenues Farmer’s Fridge could look into in the future and serve as a foundation for more detailed analyses later.

Ratings Over Time#

A plot of annual ratings over time, grouped by location, can be constructed by grouping by the relevant variables and invoking the .plot() method. In addition, I’ll overlay points with sizes proportional to the number of reviews. This allows us to quickly weight large points more than small points. This is important especially for 2020 and 2021 when there is very little data; for example, the 5/5 rating data for the Loop locations in 2021 is only based on 3 reviews.

import matplotlib.pyplot as plt

grouped_by_year_and_loc = df.groupby([df.date.dt.year, df.combined_location])
annual_ratings = grouped_by_year_and_loc.rating.agg("mean").unstack()
annual_counts = grouped_by_year_and_loc.rating.agg("count").unstack()

ax = annual_ratings.plot()
ax.set(xlabel="Year", ylabel="Average Annual Rating (1 to 5)")
for i, loc in enumerate(annual_ratings.columns):
    col = "C" + str(i)
    point_sizes = [5 * count for count in annual_counts[loc]]
    ax.scatter(annual_ratings.index, annual_ratings[loc], s=point_sizes, c=col)

There appears to have been a steady decline in rating throughout 2016. In 2017, the Loop location’s rating fell substantially but the introduction of the O’Hare location was a success. The Loop location rebounded somewhat and the O’Hare location continued to see success, until the pandemic hit, drastically lowering the number of ratings available in 2020 and 2021.

What Items are Popular?#

A simple analysis assumes a correlation between the number of mentions of an item and its popularity. I’ve grouped the items like Farmer’s Fridge’s website menu, as well as words that are likely to belong in that grouping. Terms synonymous with salad, for example, are

salads = [
    "Smoked Cheddar Cobb Salad",
    "Buffalo Chicken Ranch Salad",
    "Elote Salad",
    "North Napa Salad",
    "Grilled Chicken Caesar Salad",
    "Southwest Salad with Chipotle Chicken",
    "Green Goddess Salad",
    "Greek Salad",
    "Salad",
]

The other groupings are defind as follows:

bowls = [
    "Apple Cinnamon Oats",
    "Falafel Bowl",
    "Pesto Pasta Bowl",
    "Thai Noodle Bowl",
    "Burrito Bowl",
    "Red Chile Braised Pork Bowl by Rick Bayless",
    "Pineapple Coconut Chia Pudding",
    "Berries & Granola Greek Yogurt",
    "Almond Butter Oats Bowl",
    "Chickpea Tikka Masala",
    "Chicken Tikka Masala",
    "Grilled Chicken & Veggie Bowl",
    "Truffle Couscous with Chicken" "Bowl",
    "oats",
    "yogurt",
]
sandwiches = [
    "Italian Turkey Wrap",
    "Turkey, Apple & White Cheddar Wrap",
    "Sandwich",
    "Wrap",
]
snacks = [
    "White Cheddar Cheese",
    "Cookie Dough Bites",
    "Dark Chocolate Trail Mix",
    "Chips & Guac",
    "snack",
    "trail",
    "dough",
]
proteins = [
    "Falafel",
    "Grilled Chicken",
]
drinks = [
    "La Colombe Vanilla Draft Latte",
    "La Colombe Triple Draft Latte",
    "Brew Dr. Kombucha Love",
    "Spindrift Raspberry Lime",
    "drink",
]

With the item groups defined, let’s count unique mentions of each group:

def count_item_mentions(reviews, item_group):
    """Compute percentage of times an item in `item_group` is mentioned
    in a review."""
    count = 0
    for text in reviews:
        text = text.lower()
        for item in item_group:
            item = item.lower()
            if item in text:
                count += 1
                break
    return count / len(reviews)


item_mention_counts = {
    "salads": count_item_mentions(df.text, salads),
    "bowls": count_item_mentions(df.text, bowls),
    "sandwiches": count_item_mentions(df.text, sandwiches),
    "snacks": count_item_mentions(df.text, snacks),
    "proteins": count_item_mentions(df.text, proteins),
    "drinks": count_item_mentions(df.text, drinks),
}
item_mention_counts

{'salads': 0.7052023121387283,
 'bowls': 0.10404624277456648,
 'sandwiches': 0.07514450867052024,
 'snacks': 0.10982658959537572,
 'proteins': 0.023121387283236993,
 'drinks': 0.005780346820809248}

fig, ax = plt.subplots()
ax.barh(list(item_mention_counts.keys()), list(item_mention_counts.values()))
ax.set(xlabel="Fraction of times mentioned in a review")

[Text(0.5, 0, 'Fraction of times mentioned in a review')]

Salads are the overwhelming favorite. Have these trends changed over time? Below is the annual share for each item type, i.e. the fraction of times an item was mentioned in a review for a given year. Weighting by total number of reviews that year is again used.

def annual_item_share_plot(ax, item_group, title):
    annual_share = grouped_by_year_and_loc.text.apply(
        count_item_mentions, item_group=item_group
    ).unstack()
    annual_share.plot(ax=ax)
    ax.set(xlabel="Year", ylabel="Annual share", title=title)
    for i, loc in enumerate(annual_counts.columns):
        col = "C" + str(i)
        point_sizes = [5 * count for count in annual_counts[loc]]
        ax.scatter(annual_share.index, annual_share[loc], s=point_sizes, c=col)


fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(9.6, 7.2), sharex=True, sharey=True)
annual_item_share_plot(axs[0][0], salads, "Salad Popularity")
annual_item_share_plot(axs[0][1], bowls, "Bowl Popularity")
annual_item_share_plot(axs[1][0], sandwiches, "Sandwich Popularity")
annual_item_share_plot(axs[1][1], snacks, "Snack Popularity")

Salad popularity has been decreasing, but still has the largest share. Bowl popularity seems to have the opposite trend, though still have a smaller share. Sandwiches seem popular at airports, with a sudden drop in 2020 (perhaps due to the pandemic, as sandwiches are not usually eaten with utensils?) Finally, snacks were initially popular (perhaps because of vending machine familiarity?) but have somewhat declined, except for 2021 where there are only 3 data points.

What Phrases are Common?#

We can answer this question with more sophisticated language processing. Using NLTK, the Natural Language Toolkit, we can parse out individual words in a string of unstructured text, or more generally ngrams that contain n words strung together. Before any of that, let’s define a function that combines all the reviews in the df.text column of a dataframe:

def combine_review_text(df):
    review_text = ""
    for text in df.text:
        review_text += text + "\n" * 3
    return review_text


text = combine_review_text(df)
print(text[:1200])

We received a Farmer's Fridge delivery straight to our doors! We got a text when the driver was close and when the delivery was made. It was so easy and super convenient.

Their menu includes everything from salads to breakfast items to dinner plates and snacks. We received a variety and not only was it delicious, it was healthy and filling! Salads are typically not our first choice, but we were so impressed with how much flavor was packed into each one and am already craving them again!

We have tried other meal delivery services but were impressed with Farmer's Fridge the most. The convenience of the order, the ease of preparing the meal, and the overall quality and healthy options puts them above the competition. We can't wait to try more of their menu! Already eyeing the Buffalo Mac & Cheese and the Enchilada Roja plates......
See all photos from Josh C. for Farmer's Fridge

Repeat buyer... awesome food, great service... highly recommend!!! AAAA ++++
The salads are fresh and very tasty

So happy I found Farmers Fridge in MKE airport!!!! As a flight attendant it was wonderful finding their  "fridge" full of many, many fresh, nutritional and delicious food items!!  I couldn't d

Next, let’s import nltk and get a list of words. Technically in language processing terms, we are extracting tokens, since objects like punctuation are retained. It’s also useful to drop the most common words like “to”, “then”, and “the” using nltk’s library of English stop words and convert everything to lowercase for simplicity (must download stopwords and punk):

import nltk

nltk.download("stopwords")
nltk.download("punkt")
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
words = nltk.word_tokenize(text)
filtered_lowercase_words = []
for word in words:
    word = word.lower()
    if word not in stop_words:
        filtered_lowercase_words.append(word)
filtered_lowercase_words[:12]

[nltk_data] Downloading package stopwords to /home/adam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/adam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

['received',
 'farmer',
 "'s",
 'fridge',
 'delivery',
 'straight',
 'doors',
 '!',
 'got',
 'text',
 'driver',
 'close']

The result is terse, but the meaning of sentences are maintained.

Now we’re ready to count the most common words and display them. I’ll define functions for general ngrams for later use. So even though it is overkill to use a “onegram” for just a list of words we already have, I will for generalizibility.

from nltk.util import ngrams


def count_ngrams(ngrams):
    """Return a dictionary of ngram, count pairs."""
    counts = {}
    for tuple in ngrams:
        if tuple in counts:
            counts[tuple] += 1
        else:
            counts[tuple] = 1
    return counts


def sort_dict_by_vals(dict):
    sorted_dict = {k: v for k, v in sorted(dict.items(), key=lambda item: item[1])}
    return sorted_dict


def print_top_vals(ngram_counts, min_count, min_length):
    """Print ngrams with at least `min_count` counts and each component
    exceeds `min_length` in characters."""
    for ngram, count in ngram_counts.items():
        if each_elem_is_long_enough(ngram, min_length) and count >= min_count:
            print(ngram, count)


def each_elem_is_long_enough(tuple, min_length):
    for elem in tuple:
        if len(elem) < min_length:
            return False
    return True


word_counts = list(ngrams(filtered_lowercase_words, 1))
word_counts = sort_dict_by_vals(count_ngrams(word_counts))
print_top_vals(word_counts, min_count=50, min_length=3)

('great',) 50
('one',) 51
('like',) 53
('delicious',) 57
('farmer',) 60
('healthy',) 65
('good',) 66
('vending',) 77
('fridge',) 85
('salads',) 87
('machine',) 92
('fresh',) 97
("n't",) 100
('food',) 111
('salad',) 220

Unsurprisingly, salad is the most common word. Positive words include “great”, “like”, “delicious”, “healthy”, “good”, and “fresh”. Because the overall rating across all reviews is high, around 4.3, it’s also unsurprising there are no negative words in the most frequent list. We can repeat the above for bigrams and trigrams, which are strings of two and three words, respectively.

bigrams = list(ngrams(filtered_lowercase_words, 2))
bigram_counts = sort_dict_by_vals(count_ngrams(bigrams))
print_top_vals(bigram_counts, min_count=10, min_length=3)

("n't", 'wait') 10
('could', "n't") 10
('southwest', 'salad') 10
('napa', 'salad') 12
('revival', 'food') 13
('food', 'hall') 13
('salad', 'vending') 15
('vending', 'machines') 16
('farmers', 'fridge') 19
('vending', 'machine') 60

trigrams = list(ngrams(filtered_lowercase_words, 3))
trigram_counts = sort_dict_by_vals(count_ngrams(trigrams))
print_top_vals(trigram_counts, min_count=5, min_length=3)

('north', 'napa', 'salad') 5
('high', 'protein', 'salad') 5
('honey', 'mustard', 'dressing') 6
('salad', 'vending', 'machine') 13
('revival', 'food', 'hall') 13

These are less informative as they are not positive word plus noun. Instead, they seem to be names of menu items or expected phrases, like “Farmer’s Fridge” and “vending machine”.

Since most of the reviews are positive, it’s interesting to repeat the ngram analysis for negative reviews only. Only about 10% of reviews have a rating of 2 or lower.

negative_text = combine_review_text(df[df.rating <= 2])
words = nltk.word_tokenize(negative_text)
filtered_lowercase_words = []
for word in words:
    word = word.lower()
    if word not in stop_words:
        filtered_lowercase_words.append(word)
filtered_lowercase_words[:12]

word_counts = list(ngrams(filtered_lowercase_words, 1))
word_counts = sort_dict_by_vals(count_ngrams(word_counts))
print_top_vals(word_counts, min_count=5, min_length=3)

('prices',) 5
('like',) 5
('revival',) 5
('hall',) 5
('back',) 5
('little',) 5
('open',) 5
('times',) 5
('lettuce',) 5
('got',) 6
('would',) 7
('vending',) 8
('lunch',) 8
('machine',) 13
("n't",) 13
('food',) 14
('salad',) 20

While we don’t see obvious negative terms like “bad”, we can infer from the given words that these reviewers didn’t like the “prices”, “little” [portions?], [available?] “times”, etc. Similarly from the bigrams and trigrams,

bigrams = list(ngrams(filtered_lowercase_words, 2))
bigram_counts = sort_dict_by_vals(count_ngrams(bigrams))
print_top_vals(bigram_counts, min_count=2, min_length=3)

('salad', 'vending') 2
('every', 'day') 2
('open', '9am') 2
('used', 'get') 2
('salad', 'portions') 2
('brown', 'lettuce') 2
('come', 'back') 3
('could', "n't") 3
('revival', 'food') 5
('food', 'hall') 5
('vending', 'machine') 7

trigrams = list(ngrams(filtered_lowercase_words, 3))
trigram_counts = sort_dict_by_vals(count_ngrams(trigrams))
print_top_vals(trigram_counts, min_count=2, min_length=3)

('salad', 'vending', 'machine') 2
('revival', 'food', 'hall') 5

we can infer things like these reviewers were not impressed by the salad portions or the vending machine (perhaps they were just not expecting one?).

Future Work#

These analyses would benefit from more review data and more use of data already collected in reviews. The Selenium web-scraping approach is very flexible and generalizable. Future work can automate it to search for all fridge locations in Chicago and scrape all locations with 1 or more reviews. It is also easy to extend this approach to other cities, like New York City. I collected lots of metadata, like how many other reviews a given reviewer has posted or whether they’ve obtained the “Elite” Yelp user status. In addition, it may be useful to take a more detailed look at reviews that have lots of engagement, measured by how many other Yelpers found the review useful, funny, and/or cool.

Additional data could be estimated from the variables collected, for example reviewer sex from an analysis of their names. This could be useful information if Farmer’s Fridge wanted to better target a particular demographic. More in depth natural language processing could also be useful to improve the accuracy of the popularity metrics, e.g. salad popularity. Rather than just looking at if “salad” was mentioned, analyzing the sentiment around that mention would give better information.

Adam Wang's Data Science Portfolio

Farmer’s Fridge Yelp Reviews Analysis

Contents