Post-Individual World

A Machine Learning-driven organization may say things like…

“We deliver content curated for the individual.”

“People like you also liked <insert name of recommendation>.”

“We predict that you’ll respond better to text messages than a phone call.”

“We seek to get to the N of 1.”

Great. But that’s no longer good enough.

As a music listener, there are times where I will turn on Vivaldi, Dave Matthews, or Taylor Swift (…like you don’t?). Yet one day, pulling up Spotify on my work computer, I clicked on a playlist called “Recommended for you.” The first few songs were nice, soft classical pieces that helped me settle in for heavy focus session on my work. But then the playlist took a turn for the unexpected — “Louis and Dan and the Invisible Band” (a great children’s music duo – I might add). I nearly spit out my coffee as the rockers passionately conveyed their desire for hot dogs.

Seriously?  I’ve never listened to kids music on my work computer. In fact, I have only ever listened to soft piano music.  You can do better than this, Spotify!

Now, Spotify wasn’t completely wrong. We do listen to kids music occasionally at my house. But that is almost exclusively on our Kitchen Echo device after 5pm on weekdays. So the content wasn’t “wrong” for the individual in general. What was wrong, was the time and the place.  As humans, we are complicated. We have moods and sometimes we have routines. Therefore, the traditional data science approach to treat us all as one-dimensional vectors will largely miss the mark.

We are in the Post-Individual world. What is true for an individual now, may not be true for that same individual 5 hours from now. And data scientists need to begin taking situational clues into account for their applications.

One good example is Apple. Iphone users may have noticed that upon connecting to a specific bluetooth device, or joining a specific wifi, the user is prompted with an “app” recommendation commonly associated with that context event.  Traditional data science might say “Your favorite app is Reddit – so let’s present the Reddit app!”  Apple knows, however, that when I connect a bluetooth speaker, I am not looking to pull up Reddit.

Many digital companies are aware that nothing frustrates customers more than those who purchase a product or service at full price, and then a day later are served an ad with a discount. Traditional data science may have predicted that “this person is likely to convert.” But a smart digital marketer will have hooks in place to control the timing of the ad based on context events, and will even suppress the ad when appropriate.

Companies can begin to move beyond the N of 1 (maybe toward an N of 0.1, I guess) by gathering a set of situational data which can help describe a person’s mood or patterns of behavior. There is plenty of digital information made available from web experiences and even more from IoT devices. Life events are critical sources of data that can change a person’s values, power, and ultimately decisions. A birth, home purchase, accident, divorce, or graduation are all major events that will fundamentally change a customer’s needs and availability. Companies should look to capture all of this data, and store it in real-time. Rules and models can be deployed that trigger an immediate recommendation upon recognizing a particular pattern or event. This way, we don’t simply have a “Customer Record,” we have a record for every possible mood for that customer, and a set of models and scores for that mood. Perhaps a set of records for different times in the day for that customer, so that we can recommend the best experience or service for that customer at any given moment.  Learning algorithms can be deployed to learn routines about the individual and even understand which devices are being engaged with at various times throughout the day.  Whatever it may be, remember that humans are complicated. Unpredictable (pun intended). But our ability to capture situational context will help create moments of comfort (and dare I say trust) between a brand and the individual.

(originally published to LinkedIn 5/31/2022)

(Photo Credit: Kat B Photography, licensed under CC BY-NC 2.0 )

Flip Flop and Flourish

I remember back in the 2004 presidential campaign, and the words “flip flop” being used critically against John Kerry on a change of position across several key political issues. A shallow carefree 20-something, the word association seemed to always conjure a Hawaiian-shirt-clad John Kerry in sandals in my mind. Flip-flops were for the beach! Over time, I learned the truth, that “flip-flopping” was a pejorative label meant to assault the character over another showing lack of grounding or confidence. Possibly bending to the whim of whatever audience is present. So it stuck. “Flip-flopping is bad. Better to form decisive opinions early and defend them with confidence.

As I grew personally and professionally, I came to learn how pathetic such a mindset can be. In my field, Data Science, those who fail to acknowledge new information and potentially re-evaluate previous hypotheses will never get past the first rung of the ladder. Each word in “data science” is equally important: Data – the completeness and organization of information, and Science – the work of continually learning and questioning. In his book Think Again, author Adam Grant describes a scientists as those “who dare to disagree with our own arguments” and that the profession requires “searching for reasons why we might be wrong.” It is a quest to continually seek truth and explain reality. But for some, it is a difficult path to follow.

What, then, stands in the way. Maybe, pride? Acknowledging that maybe, just maybe I didn’t have all the information at the beginning or that my initial judgement may have been a little emotionally biased could make me appear weak. So, I’ll convince myself that I was right, and stick to the story. Propping up anecdotal evidence and garnering support.

Does this sound like what our country and our world needs right now? People who are too proud to admit they were wrong? Time allows many great things to occur: 1) It allows new data to present itself 2) It allows us to rethink the problem when our emotions and distractions are under control. Flip-flopping, in my mind, should therefore not be discouraged but embraced. To me, someone who changes their position is someone who can demonstrate keeping an open mind and who is able to make decisions based on logic, new data, and relevance. And so ironically, I have flip-flopped on my understanding of the term flip-flop. “Flip-flopping is good. Better to keep an open mind. To listen. To question. To embrace humility. To grow.

This summer, as you put on your sandals, take a moment and think. Will you be a flip-flopper too?

7 Weeks of Rest (A Team Playbook)

Inspired by Rich Strobel’s post on LinkedIn, I wanted to take a great concept regarding Dr. Dalton-Smith’s 7 Types of Rest and bring it into action with my team at work.

The purpose of this plan is to explore the 7 different kinds of rest, and evaluate which types seem to meet unmet needs for each individual. My suspicion is that types of rest will yield different responses for each individual. By practicing each type of rest as a group, we can learn from each other and begin to take inventory of how to re-energize ourselves in the future. I’ve set the course to take 8 weeks, with the team exploring a different type of “rest” each week, and then discussing the overall feedback and impressions during the 8th and final week.

Feel free to leverage this plan with your own team, and please let me know if you have feedback or have any recommendations to improve the exercise!

As I mentioned above, the concept behind this exercise comes from a Ted Talk on Rest by Saundra Dalton-Smith, MD. 

SECTION 1: The Plan

Each week, the team will take time for a different kind of rest.  We will coordinate the “rest” exercise, and sometimes will schedule a concurrent “rest” session, and later discuss how it went and how we feel afterwards.  This way, each of us will learn what type of rest seems to help the most, and we can learn from others if there are steps or techniques that sound interesting for future “rest” exercises.

Week One: Creative Rest

Creative rest reawakens the awe and wonder inside each of us. Do you recall the first time you saw the Grand Canyon, the ocean or a waterfall? Allowing yourself to take in the beauty of the outdoors — even if it’s at a local park or in your backyard — provides you with creative rest.

The Exercise

  • Team Lead will Schedule 1 Hour for the team
  • First 5 minutes:  Read about Creative Rest (from the blurb above) 
  • Next 25 Minutes: ask everyone to choose 1 of 3 options:  (1) put on their shoes and go for a 20-minute walk (2) listen to your favorite music for 20 minutes (3) log into Flickr, Frieze, or Artsy or another art-centric website and peruse pieces for 20 minutes.
    • On your walk (or listening or viewing session), take time to notice something small or something new or interesting.  We will ask you to talk about what you noticed!
    • Reschedule if inclement weather is possible!
  • Last 30 minutes: Everyone hop back onto the call.  One-by one, share the thing you noticed.  Did this help you feel (a little) more rested?

Week Two: Sensory Rest

Bright lights, computer screens, background noise and multiple conversations — whether they’re in an office or on Zoom calls — can cause our senses to feel overwhelmed. This can be countered by doing something as simple as closing your eyes for a minute in the middle of the day, as well as by  intentionally unplugging from electronics at the end of every day.

The Exercise

  • Team Lead will Schedule 1 Hour for the team
  • First 5 minutes:  Read about Sensory Rest (from the blurb above) and ask everyone to:
  • Last 30 minutes: Everyone hop back onto the call.  One-by one, share the thing you experienced or felt (or didn’t feel). Did this help you feel (a little) more rested?

Week Three: Emotional Rest

Emotional rest also requires the courage to be authentic. An emotionally rested person can answer the question “How are you today?” with a truthful “I’m not okay” — and then go on to share some hard things that otherwise go unsaid.

The Exercise

  • Team Lead will Schedule 30 minutes for the team
  • First 10 minutes:  Read about Emotional Rest (from the blurb above) and ask everyone to
    • Write down how they are feeling.
    • “Thinking back 6 months ago, did you expect you would feel this way by this day?”
    • If possible, find a partner on this team, or in this company and share how you feel with them.  
  • Allow for last 20 minutes to have the team share their feelings in groups of 2 or 3.
  • The following day: set up another 30 minutes to meet as a group.
    • Discuss: how was the conversation.  Did it help you feel better?  Was it awkward or embarrassing?

Week Four: Social Rest

This occurs when we fail to differentiate between those relationships that revive us from those relationships that exhaust us. To experience more social rest, surround yourself with positive and supportive people. Even if your interactions have to occur virtually, you can choose to engage more fully in them by turning on your camera and focusing on who you’re speaking to.

The Exercise

  • Team Lead will Schedule 30 minutes for the team
  • First 10 minutes:  Read about Social Rest (from the blurb above) and ask everyone to
    • Set up a “for fun” call with friends or family over the next week.  Try to use “zoom” or meet in person if appropriate.
  • Since it takes time to schedule a call with others, it is not realistic to assume everyone will be able to do this within a week.  Maybe consider checking in voluntarily at the beginning of the  “Week Five” session to see if this exercise was beneficial to anyone.

Week Five: Mental Rest

Do you know that coworker who starts work every day with a huge cup of coffee? He’s often irritable and forgetful, and he has a difficult time concentrating on his work. When he lies down at night to sleep, he frequently struggles to turn off his brain as conversations from the day fill his thoughts. And despite sleeping seven to eight hours, he wakes up feeling as if he never went to bed. He has a mental rest deficit.

The Exercise

  • Team Lead will Schedule 30minutes for the team
  • First 10-15 minutes: Debrief from previous week (Social Rest).  Does anyone feel like it was beneficial?
  • Next 10 minutes:  Read about Mental Rest (from the blurb above) and discuss:
    • Are there things that any of you do to “take a break” during the day?  Are you able to “let work go” at night?
    • Ask everyone to schedule at least four 15-minute breaks the following day.
    • Set up a call at the end of the following day to debrief.
  • The following day: as mentioned, hold a 30-minute meeting at the end of the day where we can share “did having 4 breaks help you feel mentally rested?”  Please be honest!

Week Six: Physical Rest

Passive physical rest includes sleeping and napping, while active physical rest means restorative activities such as yoga, stretching and massage therapy that help improve the body’s circulation and flexibility.

The Exercise

  • Team Lead will Schedule 15 minutes for the team
  • First 10 minutes:  Read about Physical Rest (from the blurb above) and encourage w/in next 24 hours:
    • 20-minute yoga session (you will need a mat or a soft floor):
    • Plan for FULL 8-10 hours of sleep tonight!
  • The following day (or the next day): hold a meeting at the beginning of the day where we can share “did the active and passive rest help?”  Please be honest!

Week Seven: Spiritual Rest

The ability to connect beyond the physical and mental and feel a deep sense of belonging, love, acceptance and purpose. To receive this, engage in something greater than yourself and add prayer, meditation or community involvement to your daily routine.

The Exercise

  • Team Lead will Schedule 30 minutes for the team
  • First 10 minutes:  Read about Spiritual Rest (from the blurb above) and share that this one feels the most elusive.  
    • To me: things that can bring you confidence, or back to center:
      • Meditation
      • Reading about science or history
      • Prayer or Practicing Religion
      • Other interesting ideas:  Genealogy/Ancestral Praise, Crying, Fellowship on a shared interest (eg: tribal)
      • Any other ideas
  • Encourage each person to partake in 1 or more of the above over the course of the next week.
  • We will have a chance to share or discuss results from this exercise during our final meeting (mentioned below in section 2).

SECTION 2: The Results

As a group, let’s discuss this exercise and whether or not we found any value in exploring the different types of rest.

The Exercise

  • Leader will schedule a 1-hour meeting
  • 5-10 min: Leader can review each of the 7 weeks and what steps were taken.
  • 40 min: go around the room and let each individual share her/his experiences.  What types of rest were beneficial. Are you likely to try any of these again?  Anything that was uncomfortable or that you would not want to attempt again? 
  • Last 10 min: Any other suggestions on the exercises or other ideas to improve well-being?

Leading with Authenticity

Inspiration comes from within. But before one can feel inspired, a series of needs must be met. Basic needs (food, water, and shelter) followed by the need to feel safe and then belonging. Leaders have an opportunity to create the space that fosters safety and belonging in the workplace. 

Trust is not assumed. It is earned. And a behavior that I, as an employee, have found impactful to pave the way to a safe and trusting environment is authenticity. Employees are perceptive. They can pick up on nuance and contradictions, and when they do, it can quickly erode that feeling of safety and trust that leaders have to work so hard to cultivate.

There are moments in a leader’s tenure where she or he can decide how to cascade messages or prioritize direction, and leaders were tested in 2020. 2020 (and into 2021) challenges were fueled by the pandemic which disrupted revenue, careers, goals, and generally the way of life for everyone on earth. But it was also a wonderful opportunity for leaders to build trust with the sentiment that “we’re in this together.”

My inspiration today comes from learning of the passing of a great leader, Arne Sorenson. Arne was the CEO of Marriot (the 3rd CEO in 93 years), and had been battling pancreatic cancer. Not only was this a difficult year for Arne due to his struggle with cancer, but this year hit the hospitality industry harder than I can ever remember, and I can only imagine how difficult it must have been for him trying to manage a company this size through a situation that he literally had no control over. I remember seeing a video of Arne addressing his employees soon after COVID-19 became a household term, and was struck with his candor and concern for his employees. Through countless testimonies of others that knew him, I knew that his message here was authentic. Here is that video…

My thoughts are with your family today, Arne.

Watch this video (above). Do you feel like you can trust this man? Do you feel like you could work for this man? I do. And I know that there are many today who mourn because of what Arne drew out of them. Inspiration. We, as leaders, can learn from Arne, and bring our authentic selves to work to create that safe and trusting place that lends itself to innovation.


(originally published on LinkedIn Feb 16, 2021)

Icario Blog

The blog at Icario is a great place to learn about health engagement strategies, particularly around using data to create behavior-changing experiences for your members. Here is a list of articles I have contributed to or been mentioned in over the years with Icario (formerly known as Revel). Give them a look!

COVID-19 Weekly Averages

EDIT (JULY 2020): I AM NO LONGER REFRESHING THIS DATA, AS THERE ARE FINALLY MORE OPTIONS TO VISUALLY OBSERVE THE TRENDS. I WILL KEEP THE POST HERE IN ORDER TO REMIND PEOPLE ABOUT FALLACIES IN DATA COLLECTION.

Lots of data has been published regarding COVID-19, but one frustrating factor (in my opinion) has been that the data is often at the “daily” level which makes analysis difficult. The way that counties collect and publish data vary greatly, so we tend to see repeatable patterns of new cases being highest on Fridays and lowest on Mondays. These fluctuations distract from the overall theme, so one way to deal with it is to use moving averages or weekly averages.

By doing so, we can get a better sense of how cases and deaths are trending week over week.

In this dashboard, it is important to remember the issues regarding data collection:

  1. Testing is not administered at random. So the number of cases is skewed based on test kits available and individuals experiencing symptions (people not experiencing symptoms are likely NOT being tested, so the true number of cases is unknown).
  2. Deaths are not attributed to co-morbidity. We’ve seen that COVID-19 has largely affected populations that are already at risk (elderly, chronic health issues). The deaths counts here are attributing 100% of the death to be caused by COVID-19 which is not entirely true. The data is still useful but must be considered in this context.

Unfortunately, Tableau Public does NOT allow for real-time data updating, so the only time that this data can be updated is when I re-publish the report using an updated data set.

For anyone interested in refreshing this viz with new data, you may do so by downloading the viz and then refreshing it with the latest data available from the NY Times git.

Advice for aspiring data scientists: Learn SQL

My advice to data scientists hoping to get into IT or business: Learn SQL.

 

Pronounced “sequel,” SQL has been around since the 70s and remains the most popular language for interacting with databases today. Please note the subtle choice of words here…I said “interacting with databases,” not “interacting with data.” There continue to be strides made by the open source community and other tool sets that make working with data a breeze, but the work of bringing the data down from the database largely requires the work of someone who knows a thing or two about SQL.

My understanding is that SQL was originally created for marketers to be able to interact with databases on their own! The fire didn’t really catch though, and technical resources and analysts soon became the bearers of the skill over time.

“But I’m a data scientist, can’t someone else get the data for me?”

Sure, I get that point of view. But there are also some limitations to one’s productivity by abstaining from learning SQL. For example:

  1. You will be limited to only organizations large enough to staff a full-time database extract resource to pull your data down for you. Not every organization can afford to specialize data extraction apart from data science. By learning SQL, you will be able to expand your prospective workplace search by spanning into both roles.
  2. There will be misunderstandings between you and your data extraction partner, in terms of what data you need for your work. I did not say “there may be…” I said “there will be.” By learning SQL, you will be forced to understand the cardinality and sources of the data at the level of detail you require. Even if you are fortunate enough to have a data extraction partner on staff to support your data needs, you will find that your communication will benefit from a strong grasp of SQL. For example, stating that you need “sum of sales and count of distinct transactions grouped by region” will make your data partner’s life much, much easier.
  3. You will have to rely on someone else’s competing bandwidth. Imagine your boss demanding updated model results by end-of-day, but your data extraction partner is on vacation or simply busy with another project. How does that make you look? Even if that situation only happens a few times, imagine how your performance looks compared to a peer data scientist who knows SQL and can consistently meet deadlines.

 

My prediction is that SQL will likely be around for another 20 years, so learning it now will keep you relevant and valuable in the short-term and the long-term.

As far as data querying go (eg: SELECT statements), SQL syntax is generally common between platforms, so a Microsoft SQL Server version of SQL will be relatively similar to an Oracle version of SQL. You may not be able to completely copy a giant SQL statement and run it in another system, but it will be pretty close. As a data scientist, of all the SQL you write, 95% or more will be data querying, so don’t get too worked up about learning data INSERT or UPDATE statements across platforms.

Even some of the most advanced massively parallel processing (MPP) systems are using SQL, for example Redshift, Google BigQuery, Netezza, Vertica, and Aster Data. Many of these are relatively new platforms who have chosen to support SQL.

“But what about the new data management platforms that don’t use SQL, like Hadoop and NoSQL?”

Sure. I would say to learn those platforms, as well! The point I am trying to make here is it that it will do your career well for you to become autonomous at extracting and providing your own data. Data will never become less abundant, and by enabling yourself to pull the data from its source, you become significantly more valuable to an employer.

When you need a friend, MatchIt.

I’ve recently fallen in love with a great R package called MatchIt, and I’m going to talk about it in an upcoming Minneanalytics conference, Big Data Tech 2017.

MatchIt was created by Ho, King, Imea, and Stuart with the original intent of measuring political figures’ behavior regarding particular topics. Interestingly, a powerful side effect of this usage is that the tool can be used to help find a “like” sub-population between two different samples, whereby recreating similar distributions and attributes across multiple populations. Not surprisingly, this has a direct impact to retail testing.

A “mocked” version of my code to perform the matching can be found below. Here, I’m identifying 5 “matched” control stores for each test store. My poor choice for a blog post title refers to the tool “finding a friend” or finding a match for each specified test store.

If you’d like to see the original documentation by the authors, a great reference can be found here: https://www.jstatsoft.org/article/view/v042i08/v42i08.pdf. It includes a deep description of what’s happening within the tool. Everything made a lot more sense to me after reading the doc.

UPDATE: I’ve found a much better parameter for the MatchIt script when using it for match-pair purposes. The distance=”mahalanobis” parameter will actually choose nearest neighbors instead of matching on propensity score which is the default method. I much prefer my results with the mahalanobis parameter. Model code updated below to reflect this addition.


#Assign "test / control" flag
testdlrsDF <- filter(subsetDF,Focus.Market != "CONTROL")
controldlrsDF <- filter(subsetDF,Focus.Market == "CONTROL")
testdlrsDF$treatment <- 1
controldlrsDF$treatment <- 0
pairsDF <- union(testdlrsDF,controldlrsDF)


#apply Match-Pair Assignment
m.out1 <- matchit(treatment ~ latitude + metric1 + metric2 + dim1 + dim2
                  , method = "nearest", distance="mahalanobis",
                  ratio=5,data = pairsDF,replace=TRUE)

m.out2 <- m.out1$match.matrix


#construct output dataframe
pairsdlr <- data.frame(TestDealerNum = integer(), ControlDealerNum = integer())
newRow1 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,1],]$DealerNum)
newRow2 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,2],]$DealerNum)
newRow3 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,3],]$DealerNum)
newRow4 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,4],]$DealerNum)
newRow5 <- data.frame(TestDealerNum = pairsDF[row.names(m.out2),]$DealerNum, ControlDealerNum = pairsDF[m.out2[,5],]$DealerNum)
#note: actual analysis used 10 samples instead of 5...

pairsdlr <- dplyr::bind_rows(pairsdlr, newRow1, newRow2, newRow3,newRow4,newRow5)

#write data
write.csv(pairsdlr, "matchpairs.csv")
 

A Gentle Introduction into Random Forests

I can still picture exactly where I was the first time I heard someone suggested implementing a Random Forest as a modeling technique. We had hired a contractor to help with a particular modeling problem, and I literally laughed out loud and asked the individual if he was serious.  Hearing it for the first time, the word “random” certainly does not offer much to instill confidence in predictive accuracy. “How would something random add any sort of value,” I wondered. However, I later learned Random Forests have proven to be an extremely effective modeling technique, able to protect against correlated variables and bias.

In this post, I will provide context around what Random Forests are and what value they bring to business.

 

Overview of Random Forests

Random Forest models are based on the Decision Tree modeling technique which is based on splits of data rather than linear correlation. Developed by Breiman (2001), the algorithm follows a Bagging technique (coincidentally, also developed by Brieman several years prior) except that in addition to randomizing bootstrapped samples of data, Random Forest also randomizes bootstrapped sampled of predictors (Figure 1).

 

Figure 1: Evolution of Random Forests
rf

 

In Figure 1, notice how there is a single tree for the CART model. The next evolution, Bagging, employs multiple trees based on bootstrapped samples of data (James, et al, 2014). We refer to this as ensemble modeling, because we use multiple models simultaneously to determine a prediction for a single observation (Seni & Elder, 2010). Ensemble modeling has proven to occasionally yield more accurate results at the expense of model interpretability. In Bagging, notice how the top of the tree is generally the same “Important Predictor.” This leads to correlated models. The correlation can be addressed by implementing a random factor (called perturbation) which only selects a subset of predictors with each bootstrap sample. Random Forest, another ensemble method, employs this approach. In the end, these ensemble techniques combine all of their models together and export a composite score or predictor (for example, through voting), for each observation.

While it does operate under the guise of a black box, Random Forests do leave us a few minor clues as to what’s going on underneath the hood. In statistics packages, there are generally some “variable importance” plots which can be conjured once a model is fit. These plots allow us to see which variables are most “interesting” but don’t necessarily explain why they’re interesting or even give a correlation sign.  Also, if needed, we can generally extract a few actual “trees” or splits from within the model construct, but since there are generally so many trees, simply reviewing a handful of them closely would not be helpful, and in fact, may be misleading.

 

Value of Random Forests

 

The value we realize from Random Forests is that it protects against correlated variables and gives each predictor more of a chance to be recognized in the model rather than be overshadowed by a few strong or greedy predictor variables. This is awesome when there exists high multicollinearity or a high number of predictors present. Overall, these additions lead to greater predictive accuracy (Seni & Elder, 2010). The downside of the Random Forest model is that it is not interpretable to the analyst or the business. It will be very difficult to peel back the covers and determine “why” a particular observation was classified in such a manner. The business must learn to “trust” the model through cross-validation and constant model performance monitoring.

 

 

For more Random Forest fun (I can tell you are hardly able to contain your excitement), head on over to either this other author’s blog post and one more for more “gentle” conversations regarding your new favorite landmark in Statisticsland, the Random Forest.

 

 

 

References

Breiman, L. (2001). Random Forests, random features. Berkeley: University of California. 1.1, 4.4.

Chapman, C. & Feit, E. (2015). R for Marketing Research and Analytics. Switzerland: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2014). Introduction to Statistical Learning. New York: Springer.

Seni, G. & Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions. Morgan & Claypool.

Economics of CRM Modeling

These are several conversations I have had with many marketing leaders; maybe you can relate?

 

Conversation 1

Marketing Manager: If we get $2 in incremental sales per customer mailed, we should be able to increase our list by 10,000 and then make an additional $20,000 in incremental sales, right?

Me: Customer response and incremental sales are not linear!

Conversation 2

Marketing Manager: If you build a model, it will drive incremental sales, right?

Me: We’re already mailing nearly all of our customers.

Marketing Manager: Right, but if you build a model, it will drive more sales, though, right?

Me: No, that’s not quite how it works…just a sec…here (comes back with handy-dandy economic CRM diagram)

net-contribution-economic-diagram

 

Here’s the PDF, for people who like PDFs.

I built this little diagram to bring all the key metrics into a single place that I can then use to demonstrate the relationships between cost, file size, incremental sales, ROI, and even more importantly the NET CONTRIBUTION. Here, I define [NET CONTRIBUTION] = [Incremental Sales] – [Cost].

Many leaders get caught up on ROI, but again, ROI is not a linear thing. Once the model is built, if we invest more in going deeper in the mail file, we cannot expect to maintain the same level of ROI.  ROI might be fantastic for the first decile, but the marketing manager needs to also consider, “will I get enough volume of incremental sales from this tactic at this depth to even make the ROI worth it?”

I like this  graph because it also emphasizes that there exists some “arbitrary incremental sales ceiling.”  In other words, if we mail EVERYONE in our entire CRM database, we will probably generate a lot of incremental sales, but it will probably be at an even GREATER expense…which is why net contribution (and ROI) is nearly zero at the full size of the file.

The goal, then, of marketing analytics is to optimize on maximum net contribution. Maximum ROI will likely only be a few records translating to minimum incremental sales, and Maximum Incremental Sales will be the whole file (which will probably also be super-high cost).  So, maximum net contribution is where the focus should be.

Once our “customer selection model” is established, we can use other models that “choose offers, choose times, or choose creative” to send our targeted customers in order to improve response or incremental sales for that group. For example, these types of models could include offer or product propensity models, lifecycle modeling, or creative/behavioral segmentation.  In other words, a customer selection model that chooses “who” to target won’t necessarily increase incremental sales from one campaign to the next (if circulation depth is held constant),* but an offer, timing, or creative model might be able to improve incremental sales because the “right message” and “right time” principles are addressed.

 

*Hey, I know there’s an argument to be made for improving models for customer selection and uplift modeling, etc. (which would boost incremental sales slightly on the same circulation), but that’s another discussion, mm-kay?

INFP? Consider a career in analytics.

I realize that this post will only resonate with a small number of people (3% to be exact) who associate with the Meyers-Briggs personality type INFP.

For those of you who aren’t familiar, INFPs are the types of people who aren’t very task-oriented, daydream (a lot), become obsessed and forget about other important things, and may reserve some judgement because they don’t want to hurt other people’s feelings.  Sounds like a great candidate for hire, am I right!

I’ve taken the Meyers-Briggs (MBTI) test several times and am consistently scored as an INFP. Then, after the test, I’m usually provided a little blurb with “career advice for INFPs” among relationship advice (which I clearly need help with) and other things. But, within the career advice section, it always mentions social careers or art or music. Now, not surprisingly, I do enjoy music and do break out the ol’ guitar from time-to-time, but I have always been curious that there is never any mention of Analytics as a career option which has been an incredibly fulfilling option for me.

I’m usually provided a little blurb with “career advice for INFPs” … but I have always been curious that there is never any mention of Analytics as a career option which has been an incredibly fulfilling option for me.

Here are some examples:

  • As an INFP, I am innately curious. Analytics THRIVES in an environment where someone is curious. It is only when the analyst becomes familiar with the data that interesting findings become more apparent.
  • As an INFP, I need a creative outlet. Analytics allows me to be extremely creative through creating presentations, storytelling with the data I curate, choosing which techniques to experiment with and which paths to follow.
    • I find that having supportive leadership is incredibly beneficial. When leadership trusts me and allows me to take some time to explore, I feel that I’m able to find some really interesting patterns and information.
  • As an INFP, I like having time at work to be by myself and be thoughtful. As an analyst/data scientist, I will have occasional meetings with people when I have questions or need to learn about data or context around certain problems or when I need to present. But, I like being able to go back to my desk and get work done, too. And if I do need to meet, I prefer the smaller meetings where there is focus and some sort of goal.
  • As an INFP, I tend to become obsessed with a particular technology or problem. This allows me to ramp up and adapt very quickly and become an expert.
  • As an INFP, I strive to get my customers the answers they need. In analytics, I’ve found that my customers are most delighted when I’m able to respond to their questions within the same day. Therefore, jumping through hoops or diverging from process (occasionally) needs to happen in order for me to turn something around quickly. As an INFP, though, I have no guilty conscience about breaking process, because at the end of the day, if my customer is happy and is able to do her job better, then it’s better for the company.
    • Hey, you “Js” sorry about breaking the process. It’s fine. It won’t happen again. 😉
  • As an INFP, I like being able to learn how things work. That’s why writing code has been really fun for me. I am “fluent” in many data gathering, analysis, and coding languages, and I love being able to sit back and watch my code “run.” I was never in love with Mathematics in school, but now that I can see the value and how it fits within an interesting career for me, I’ve been much more interested in learning math.

 

I could go on and on about this. Ultimately, the reason I wanted to jot down my thoughts here is that I feel like analytics is a GREAT career option for INFPs, and I don’t feel like it is getting the attention it deserves. I’ve read several blogs from INFPs who are having a career crisis, and so I wanted to mention this in the hopes that it may be helpful for someone like me, an INFP.

 

 

(Photo credit: Daydreaming - Craig Edwards)

Running Total (Cumulative) Line Graph by Day

In the early stages of its life, Microsoft PowerBI still lacks quite a bit of basic functionality that one would find in other data visualization tools. One such functionality is the ability to quickly perform cumulative (or running total) functions. Fortunately, using the CALCULATE command, we are able to hack our way to a solution.

 

Begin by creating a NEW MEASURE. Right-click on the table that you want to add the measure to (or follow this tutorial if you still need help). Use the calculation template below:

CALCULATE (
SUM ( <value> ),
FILTER ( ALL ( <tablename> ), <group-by-field> <= MAX ( <group-by-field> ) )
)

My example pasted below:

measure

 

 

Now, add a line graph widget by clicking the line button in the visualizations menu.

lg widget

 

Now, add the new MEASURE that you created into the “Values” field and add the <group-by-field> into the “Axis” field of the tool.  In my example, I added “RunningTotal” to the “Values” field and “date” to the “Axis” field.

select

 

Voilà. Cumulative Line Graph.  I also like to add a “Total” card at the top, just to double-check that it worked.

result

 

Microsoft Power BI (or is it PowerBI with no spaces…I don’t even know) has the potential to be a threat to an already saturated BI Data Visualization market, but it has a long way to go, yet.

 

Rise of the Gamer

No, I’m not talking about the Super Mario fiend or the World of Warcraft junkie. I’m talking about what unifies us as a human race. “Fun.” Humans desire fun.

By now, many of us have heard that we humans have an attention span of eight seconds (although, my pal Phil definitely has less). So, as marketers, how on earth will all of our hard-earned collective creative genius be recognized and appreciated by our customer?

Do you know how many hours and revisions we went through to get that shade of gray on the nav bar just right?! More than fifty.

One answer is out there. It’s games.

And the brands that can figure it out first will thrive in this new day in age. Creating portals and processes that tie in to gaming or character growth and recognition will create two invaluable assets:

  • Consumer experiences which are fun and memorable and which will create loyalty, curiosity, and advocacy.
  • Tons of actual human response data that can be consumed and analyzed creating an infinite testbed.

Not much more needs to be said about creating the “fun experiences.” I think we all get that finding a way to entertain and reward our customers will create strong affinities. I do, however, want to touch on how these data can then be used to create incredible outcomes.

First off, a game could be used to simulate shopping behavior.

Do you remember playing Sim City? All that time spent building a city, just to start over? Think of all that data which could be collected from that game. Why did the customer choose the hydro plant instead of the coal plant (maybe the customer is an environmental segment)? Why build a baseball field before a library (maybe the customer is a sports-minded segment)?  Did only 0.001% of gamers choose to build a harbor (maybe customers don’t see the value in a harbor)?

The data can be compiled to established segments and determine product value and simulate customer choice.

Second, humans can refine your analytical models.

Anyone else annoyed with Siri? Think about if Apple had invented some sort of trivia game, like “Who wants to be a millionaire.”  The list of questions could either be curated by Apple or solicited from the human playing the game. Responses would be fed back into the model and then refined over time.  At the very least, Apple should have developed some sort of feedback loop where humans could say “no, Siri, THIS is what I meant <next action>.” This feedback loop allows the algorithm to get better over time, because let’s face it, we humans are tough to figure out (and…as Boaty McBoatface helped remind us, hard to trust).

How about “voting.” Think about how Tinder has taken off (grown in popularity, I mean). It’s so easy to quickly discern taste. A game where players can quickly decide whether or not they like a certain feature could be invaluable for companies planning to release new products or enter new markets.

I’m barely scratching the surface here, guys, but the amount of value from the data we could source from interactions with games could be ENORMOUS.

10 years ago, social data was quite the rage. Companies needed to figure it out, and now, we see social marketing managers staffed within nearly every organization with even the smallest digital presence. I think we will see the same shift in gaming.

If I may be so bold, I predict that in the next 10 years, we will begin to see Gaming Strategist roles begin to pop up in organizations and agencies, as well.  These will be individuals who understand human attention, understand what creates curiosity, mystery, reward, and prowess. This individual will partner with Loyalty teams to create memorable and growth experiences, R&D teams to create market-ready products, and Sales/CRM teams to create razor-sharp targeting profiles.

 

Who’s already doing it?

Waze does it for driving conditions, and now Google is using them for their maps app. Foursquare was one of the first to begin giving users badges and status. Any type of “forum” nowadays gives the user’s rank alongside their comments. Rally Health and Novu are doing it for health tracking and living a better lifestyle. Mastercard has been creating games with a charity element attached to it.

 

Games are fun.

And games are data. The game (“fun”) component creates interest/loyalty and the side-effect of game-play data could lead to a fascinating behavioral analysis around how particular segments (or direct customers) make decisions.

Game on.  -Pete

pixelated

 

Originally published on LinkedIn April 7, 2016.

This is Your Life! (brought to you by LinkedIn & Python & Gephi)

Alas, this totally awesome feature from LinkedIn API will discontinue its availability sometime in May of 2015. My hopes are that LinkedIn either changes its mind or allows a partner to still continue lending the data to aspiring analysts (such as myself). Regardless, I wanted to take a moment and share a little bit about my experience with some extraordinarily powerful tools explaining network data using the soon-to-be-discontinued LinkedIn API.

Before I begin, I need to give the majority of the credit for this experience to the author of this post on Linkurio. The author goes by the pseudonym, Jean, and that’s all I’m able to find out. Thank you, Jean, whoever you are! Also, thanks to Thomas Cabrol for compiling some excellent code as a place to start.

Process

Ok, to start we must begin by creating an API account, which can be done by visiting the LinkedIn developer URL: https://www.linkedin.com/secure/developer

Add a new application, and then record the API and User keys because they are needed in the Python code below.  Note: if you are unable to retrieve the user tokens, the Linkurio post provides an option on how to collect them. Here is the code I ran:

[su_heading]Step 1: Connect to LinkedIn API[/su_heading]

#!/usr/bin/env python
# encoding: utf-8
"""
linkedin-2-query.py

Created by Thomas Cabrol on 2012-12-03.
Copyright (c) 2012 dataiku. All rights reserved.

Building the LinkedIn Graph
"""
#Note: run this first, and then run cleaner.py

from __future__ import division, print_function

import os
os.chdir("C://Users//Peter//Documents//code//linkedin")

print(os.getcwd()) 
import oauth2 as oauth
import urlparse
import simplejson
import codecs
 
 


# Define CONSUMER_KEY, CONSUMER_SECRET,  
# USER_TOKEN, and USER_SECRET from the credentials 
# provided in your LinkedIn application
CONSUMER_KEY = xxxx;
CONSUMER_SECRET = xxxx;
USER_TOKEN = xxxxx;
USER_SECRET = xxxxx;
 
 

OAUTH_TOKEN = USER_TOKEN
OAUTH_TOKEN_SECRET = USER_SECRET
 
OUTPUT = "linked.csv"
 

consumer = oauth.Consumer(key=CONSUMER_KEY, secret=CONSUMER_SECRET)
token = oauth.Token(key=OAUTH_TOKEN, secret=OAUTH_TOKEN_SECRET)
client = oauth.Client(consumer, token)
# Fetch first degree connections
resp, content = client.request('http://api.linkedin.com/v1/people/~/connections?format=json')
results = simplejson.loads(content)    
# File that will store the results
output = codecs.open(OUTPUT, 'w', 'utf-8')
# Loop thru the 1st degree connection and see how they connect to each other
for result in results["values"]:
    con = "%s %s" % (result["firstName"].replace(",", " "), result["lastName"].replace(",", " "))
    output.write("Peter Eliason," con "n")
    # This is the trick, use the search API to get related connections
    u = "https://api.linkedin.com/v1/people/%s:(relation-to-viewer:(related-connections))?format=json" % result["id"]
    resp, content = client.request(u)
    rels = simplejson.loads(content)
    try:
        for rel in rels['relationToViewer']['relatedConnections']['values']:
            sec = "%s %s" % (rel["firstName"].replace(",", " "), rel["lastName"].replace(",", " "))
            output.write(con "," sec "n")
    except:
        pass

 

This code will connect to my LinkedIn account and DOWNLOAD my ENTIRE 1st and 2nd degree network!  Spooky, but awesome.

Now that the data has been downloaded, the next step is to clean it up.  We may run the below to remove bad characters and set everything to lowercase:

 

[su_heading]Step 2: Cleaner code[/su_heading]



#!/usr/bin/env python
# encoding: utf-8
"""
linkedin-3-cleaner.py

Created by Thomas Cabrol on 2012-12-04.
Copyright (c) 2012 dataiku. All rights reserved.

Clean up and dedup the LinkedIn graph
"""
#note: run network_test.py first

from __future__ import division, print_function

import os
os.chdir("C://Users//Peter//Documents//code//linkedin")

print(os.getcwd())

import codecs
from unidecode import unidecode
from operator import itemgetter
 
INPUT = 'linked.csv'
OUTPUT = 'linkedin_total.csv'
 
def stringify(chain):
    # Simple utility to build the nodes labels
    allowed = '0123456789abcdefghijklmnopqrstuvwxyz_'
    c = unidecode(chain.strip().lower().replace(' ', '_'))
    return ''.join([letter for letter in c if letter in allowed])
 
def cleaner():
    output = open(OUTPUT, 'w')
    # Store the edges inside a set for dedup
    edges = set()
    for line in codecs.open(INPUT, 'r', 'utf-8'):
        from_person, to_person = line.strip().split(',')
        _f = stringify(from_person)
        _t = stringify(to_person)
        # Reorder the edge tuple
        _e = tuple(sorted((_f, _t), key=itemgetter(0, 1)))
        edges.add(_e)
    for edge in edges:
        output.write(edge[0] "," edge[1] "n")
 
 
if __name__ == '__main__':
    cleaner() 


The next part of the Python code uses a library called NetworkX to create a file format called graphml which can be imported by a network graphing tool called Gephi. NetworkX is actually capable of far more than simply converting API files to graphml, but we’ll hold off on that tangent for another post. For now, we’ll focus on Gephi and graphml.

[su_heading]Step 3: Create graphml file[/su_heading]


# Defining and Visualizing Simple Networks (Python)

# prepare for Python version 3x features and functions
from __future__ import division, print_function

import os
os.chdir("C://Users//Peter//Documents//code//linkedin")

print(os.getcwd())


# load package into the workspace for this program
import networkx as nx
import matplotlib.pyplot as plt  # 2D plotting

# read Wikipedia Votes data creating a NetworkX directed graph object g
f = open('linkedin_total.csv', 'rb')
g = nx.read_edgelist(f, delimiter=",", create_using = nx.DiGraph(), nodetype = str)
f.close()

nx.write_graphml(g,'linkedin_total.graphml')

Ok, so now we’ve got our graphml file. The next step is to import it into this tool called Gephi. You’ll need to download Gephi as an executable — it’s not a Python library or anything like that. It is a standalone visualization tool.

I’m a Windows user and I had problems getting Gephi to install properly. I was able to work around this by UNINSTALLING Java, and then reinstalling an old version of Java, version 7. After I did this, I was able to install Gephi without problems.

I’m told that Mac users are able to install Gephi with no problems. Figures, ha!

Now, after importing the graphml file into Gephi, I took these steps:

  1. On the left-hand-side, ran “Force Atlas 2.”  It takes a LONG time for the process to complete, so I cancelled it after about 10 minutes because the visualization was close enough for my needs.
  2. Activated the “Show node labels” to see who each node represented
  3. Ran the modularity algorithm in the Statistics panel (on the right). I went to the partition window (select Window > Partition) and choose to color the nodes according to their “Modularity class”

 

I’m left with a stunning graph of my network with me listed as the center node.  Each color represents a certain cluster within my list of connections. If you look closely, you can see that some nodes have names next to them (I’ve purposefully left them obscenely small to protect the identities of my connections), but Gephi allows the analyst to zoom in and out in order to explore the network.

gephiblank

After only a couple minutes, it becomes blatantly clear which each of these clusters and colors represent. They’re a representation of ME and MY LIFE!  The incredibly beautiful part of this entire process was that the analysis was entirely undirected! I had nothing to do with direction the creation of these clusters…NetworkX and Gephi did all of that for me by themselves!

To call attention to each of these clusters, I’ve gone ahead and named each cluster, here.  Each cluster represents a key time and network (aka: clique) in my life.

gephi_lists

The Undergrad section represents all of my connections from my undergrad school, Luther College in Decorah, IA.

MSPA represents grad school connections (in another state, and 10 years after undergrad, so not much connection between those two networks!) as part of Northwestern University in Evanston, IL.

Also interesting, Best Buy had some hard years back in 2008-2010 and a lot of folks left Best Buy to join SUPERVALU, which explains the many connections between the two.

The fascinating thing about this analysis, is that through LinkedIn, I have a basic map of my Personal AND Professional life.

Conclusion

While this particular map may not be extraordinarily useful for advancing my career, it allows me to be reflective on the state of my network, and in essence, a brief story of my life.

In a business setting, however, I can see how this process might be interesting in identifying clusters, tribes, and influencers using relationship data to understand influence of products, lifestyles, and consumer choices..

Machine Learning is not Artificial Intelligence

Remember the first time you heard the words “Big Data?” Well, there’s a new buzzword in town — “Machine Learning.”

Ok, when I say “Machine Learning,” what happens in your mind? What images have I conjured by saying “Machine Learning?” Maybe, you saw a brief shadow of a floating, intelligent, robotic metal squid, or a flying Keanu Reeves? Maybe, you heard the name “Ah-nold” or “I’ll be back” with occasional lasers flashing in the distance.

Well, I’m sorry to say that I’m here to burst your bubble. Pop! There it goes…  When we discuss within the context of statistics and analytics, Machine Learning is NOT the same thing as Artificial Intelligence.

Machine Learning isn’t even a super simple, intuitive approach to data modeling and analytics. Machine Learning basically has to do with the fact that technology has finally come so far as to allow computers to apply brute-force methods and build predictive models that were not possible 30 and even 15 years ago. You may have actually already heard of many Machine Learning algorithms — for example: Decision Trees, Neural Networks, Gradient Boosting, GenIQ, and even K-means clustering.  Many analytical tools, such as Python and R, already support these modeling techniques. The SciKit Learn package in Python offers a great tutorial in Decision Trees.

Ultimately, what I want you to walk away with is that, when we talk about statistics and analytics, Machine Learning isn’t some super-fancy, futuristic process that will enlighten all of your analytics capabilities. It is actually a set of functionality that already exists and can be drawn upon to create predictive models using heavy computer processing.

If you’re interested in learning more, I’ll recommend the book “Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data” by Bruce Ratner. He talks about many of these techniques, what they are used for, and how to avoid pitfalls..

SQL Server to SAS ODBC: The Partial Truth

Issue:

There are occasions when your ODBC connection from SAS to SQL Server returns partial data without an error flag, and you’d never know that you were missing data from looking at the SAS log.

 

I was running a SAS script in SAS Enterprise Guide (SASEG) which uses SQL pass-through to interface and retrieve data from Microsoft SQL Server. In this instance I wanted to create a dataset of all coupon redemptions in a given time frame. Below is the SAS script I was running (grayed out a few proprietary database names just to be safe).

sas code

 

Checking the log, my script ran without error and created a dataset with 15,330 rows.

 

first sas try

 

Later in the day, I wanted to change some code further down in the script. Since our database is not live, I would anticipate that a single query would return the exact same result set each time it’s executed.  However, I noticed that my row counts were changing each time I ran the script.  The second time I ran the script, it returned 2902 rows, and then the third time I ran the script it returned 13,816 rows!

third sas trysecond sas try

What was going on?!?

Confused, I took the exact SQL code and executed it directly in the SQL Server Management studio.  This time I noticed an error “Arithmetic overflow error converting expression to data type bigint.” However, even though there was an error, SQL Server still returned 1737 rows.sql server output 1

I ran the exact query a second time. This time it returned 3813 rows!

sql-server-output-2

 

It seemed that SQL server was encountering an error with a particular record and would return all rows up to the encounter, but somehow the ERROR message never made its way to SAS.  Also, since there was no “order by” clause in the query, the data results seemed to be ordering by some SQL Server clustering technique that might have been triggered by a random seed (which explains why my row counts always changed).

sql code

After removing the “cast” function, the query returned 19,994 records without error, and consistently returned 19,994 records every refresh thereafter.  I had discovered the issue, but I remain very concerned that SAS did not acknowledge the partial list and overflow error at all in the execution log.  If my row counts had not changed, I would have never realized that there was an error in the data.

 

Possible Solutions:

1) Execute every SQL Server script in SQL Server Management Studio prior to pasting into SAS. This could be a huge pain, because macros in pass-through scripts must be resolved prior to running.  An alternative to this would be to stress test the SQL scripts in SQL Server Management Studio, where we execute every possible permutation of the SAS query in order to try and protect against any subset.

 

2)  <TBD> Is there an error catching option? I have tried errorstop, stopontruncate, and exitcode options in the proc sql statement, but none of them seem to warn the user of this error. Please comment on this post if you have a suggestion, and I will update the post! Thanks.

.

Control Redeemer Effect

Intro to Test vs. Control

One of the beautiful things of CRM analytics is the ability to design controlled experiments. Random sampling of a homogeneous group of individuals allows CRM professionals to test the impact of a particular CRM offer or message against another offer or no offer, and then after applying some statistics conclude whether or not that offer or message drove a desired behavior, like an open, click, or conversion. This process is generally referred to as Test versus Control.

tvsc

One potential pitfall that muddies the analytical waters is dealing with customers that potentially “cross-over” from the Control side to the Test side during the campaign. Take, for example, a customer who was assigned to the Control group but somehow ended up redeeming an offer that was assigned to the Test group.

*Gasp!* What do we do now? Our experiment is ruined!

Well, not to fear (sort of). One widely accepted method to handle this problem is to exclude these Control Redeemers, as they are called, from the analysis entirely. The decision to exclude them is supported by the fact that these customers are marked as “dirty” for somehow getting their hands on an offer that was intended for someone else. Herein lies the issue!

These “Control Redeemers” tend to be more engaged customers.

  • Therefore, I believe that excluding them creates an unfair bias.
  • It seems that this bias is magnified with longer campaigns.
  • In the below, “Best” refers to the best, most engaged customers, and “Sec.” refers to secondary customers, or customers who are not as likely to redeem anyway.

bar chart comparison

 

The longer a campaign, the higher the impact from Control Redeemers.

  • I noticed this pattern within a recent campaign I analyzed and was having a hard time explaining why results appeared so much stronger as the campaign carried on.
  • To support my hypothesis of the Control Redeemer Effect, I conducted Sales Lift calculations using two methods:
    • Calculating Test vs. Control lift including Control Redeemers in the control group.
    • Calculating Test vs. Control lift excluding Control Redeemers in the control group.
  • For a 4-week campaign, the lift is 1.2 times higher when excluding Control Redeemers.
  • For a 52-week campaign the lift is 3.7 times higher when excluding Control Redeemers.

magnitude

 

Conclusion

I felt my hypothesis had been validated. As the length of the campaign increased, it allowed highly-engaged customers more of an opportunity to get their hands on a coupon (either from a friend/family member, from generous Test group customers who leave their coupons in public places, or from gracious sales associates at the store), and therefore dilute the overall sales of the control group.

There are several ways to mitigate this Control Redeemer issue.

  • Assign unique offer codes to customers. This way coupon sharing is impossible, and Test vs. Control groups stay cleaner.
  • Stratify Test vs. Control results by some sort of “Customer Value” segmentation. Some companies have a “Metals” or “Value” segmentation which ranks customers from Best to Worst. Stratifying results by this segmentation would alleviate some of the pressure.
  • Consider replacing the “control redeemer” with a composite of match-pair control customers (from a pool of control customers who have not redeemed), matching on keys from an arbitrary pre-period (say 1 month, 3 months, or a year depending on average purchase cycles).  Note: this option is likely going to be analysis-heavy.
  • If neither of these above methods are feasible, then ultimately, the “real” answer of “campaign lift” is probably somewhere in between the “including control redeemers” and “excluding control redeemers” methods.

 

Please let me know if you have other thoughts or ideas on this measurement methodology topic!

 

Appendix

Sales Lift calculation (not standard, by any means, but a common one):

left { frac{[TestGroupCampaignSales]}{[TestGroupTargetedCounts]}-frac{[ControlGroupCampaignSales]}{[ControlGroupTargetedCounts]} right }  ast [TestGroupTargetedCounts]= [SalesLift]

In the above, we include Control Redeemers in the total sales lift.  To exclude Control Redeemers, we use the below (note: I had to remove the vowels from a few terms below  because there is a text limit to the equation editor):

left { frac{[TestGrpCampaignSales]}{[TestGrpTargetedCnts]}-frac{[CtlGrpCampaignSales] - [CtlRedeemerSales]}{[CtlGrpTargetedCnts]-[CtlRedeemerCnts]} right }  ast [TestGrpTargetedCnts]=[SlsLift]

 .

Twitter API

I’m finding out that the Twitter API has been fairly consistent between Python environments (my home and my work machines – both running Windows 7). However, I’m running into significant issues with twitter data encoding on my work machine. By encoding, I’m talking about which characters are interpretable or printable. In terms of Python IDE, I’m running Canopy at home and Spyder on my work machine.

For collecting Tweets, I have been using Tweepy. There are many other classes available to extract tweets from Twitter’s API, but Tweepy was very easy to install and manage. Like I mentioned above, the problem now is extracting the data from the JSON file created by Tweepy so that it’s available to my work machine. Once I am able to figure out how to extract data on my work machine, I will post the code for both environments.

The picture below has been particularly helpful in extracting data from the Twitter JSON record. It’s a bit dated, but much of the record structure is still true. I found the image from another post by Mike Teczno.

Here is the map of a tweet, published by Raffi Krikorian:

raffi-krikorian-map-of-a-tweet.

Context Filters with Tableau, Important for Top N Filters

This took me FOREVER to finally figure out, so I wanted to share a method to avoid a common mistake when trying to use Tableau’s Top N or Bottom N filter.  The issue is that often times when the Top filter is applied, it applies the filter against the entire, unfiltered source data, while the user is likely expecting the Top N (or Bottom N) to be select only after all the other filters have already been applied.  Here are the steps I’ve taken in some sample data with the ultimate goal of selecting the “Top 3 Markets in Texas.”

 

Step 1: our original data.

Here, I’ve taken a list of customers by state.

01 customers in all markets

Step 2: Filter the top 3 markets.

Right-click on LocationMetro > Filter > Top tab. Then select “By Field” and enter 3.

02 top 3 metro areas

 

Step 3: Results – top 3 markets overall (still need to filter on Texas).

03 result top 3 metro areas

 

Step 4: Filter on Texas.

Wait! Our results have only 1 Market? I wanted 3 markets!

04 select TX

Step 5: Apply Context Filter on State

In order to preserve our “Top 3” filter, we must add a Context Filter. A Context Filter will apply the filter FIRST, prior to any other filters on the page.

What was happening in Step 4, was that the worksheet was choosing the “Top 3” markets out of all of the states first, and then applied the Texas filter.

 

05 click add to context

 

Step 6: Make Sure your Context Filter didn’t reset.  In this example, make sure Texas is the only state selected.

In my experience, Tableau often resets all of the filters in the context filter, which requires the user to go back a re-select the filters. In this case, all the states were selected again, so I had to go back and unselect them all and then choose Texas.

 

06-ensure-proper-filter-is-applied

 

We’re done! Our chart now shows the Top 3 Markets in Texas!

Happy filtering!

 

 

 

 

.

Parsing Exact Target (aka SalesForce Marketing Cloud) List Detective Import Error Files

When sending emails from ExactTarget (aka: SalesForce Marketing Cloud) — you know, we’ll just call it “ET” for short — we occasionally receive error logs which contain ET subscriber keys where a small subset of targeted customers match profiles of other customers who have unsubscribed or been flagged as undeliverable.  Our goal is to process these log files and mark these customers as “undeliverable” in our parent system.  Since the log files are not formatted for data import, we need to use a parsing tool to extract the subscriber keys. I chose Python.

In the code below, I enter the path of the log file which was emailed to me from ET, and then I use re (regular expressions) to find all instances that match the subscriber key format, which is “[varchars-varchars-varchars-varchars-varchars]”.

Your version of subscriber keys will undoubtedly look different than mine, but you should be able to modify the regular expression in the re.compile() function to search for the right format. More info about Python’s regular expression class.

Let me know what you think!

 
I have included sample files with fake data for reference:
Input File (9989FFA6-BD29-8DBA-B712-C6E8ED32F0X9 Results 988623122.txt)
Output File (cleanFile.txt)

 

 
[su_heading]Python Code[/su_heading]


# -*- coding: utf-8 -*-
"""
Created on Wed Feb  4 13:59:50 2015

@author: peliason

Purpose: Parse out subscriber key from ET List detective.

"""

from __future__ import division, print_function



##PARAMETERS
inputFile = "C://Users//peliason//Documents//reference//python//ET parse subscriber//for post//9989FFA6-BD29-8DBA-B712-C6E8ED32F0X9 Results 988623122.txt"

outputFile = "C://Users//peliason//Documents//reference//python//ET parse subscriber//for post/cleanFile.txt"
###END PARAMETERS





import re
import codecs


#initialize output list
found = []

#this is my search function
subscriberKeySearch = re.compile(r'[([A-Za-z0-9_] - [A-Za-z0-9_] - [A-Za-z0-9_] - [A-Za-z0-9_] - [A-Za-z0-9_] )]')

## Open the file with read only permit
f = codecs.open(inputFile, 'r','utf-16')

##IMPORTANT: if your input file is utf-8, then you should be able to run this 
#   file open code instead (without the codes class):
#   f = open(inputFile, 'r')


## Read the first line 
line = f.readline()

## till the file is empty
while line:
    
    found.extend(subscriberKeySearch.findall(line))
    
    line = f.readline()
f.close()

#write the list to the outputFile
myOutput = open(outputFile, 'w')
for ele in found:
    myOutput.write(ele 'n')
    
myOutput.close()


 .

Python and Web Scraping (using Scrapy)

Certainly the most extensible scripting language I have ever used, Python allows the user to build powerful programs ranging from web crawling to text mining to machine learning. With invaluable packages, NumPy and SciPy, Python is able to tackle complex modeling tasks, while at the same time, other packages such as BeautifulSoup and Scrapy allow for thorough data collection through web crawling and scraping.

In the Tableau Project below, I have provided an example (with code included on the second tab) of how web crawling and data collection work, by taking a snapshot of my old motorcycle model and comparing prices from two different markets. The data was scraped using Scrapy and exported into a CSV file which I imported into Tableau.

https://public.tableausoftware.com/javascripts/api/viz_v1.js

[su_heading]Here is the Spider code:[/su_heading]

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_mcy.items import CraigslistMcyItem
import re, string
			
     
			
class MySpider2(BaseSpider):
  name = "craigmcy2"
  allowed_domains = ["craigslist.org"]
  start_urls = ["http://minneapolis.craigslist.org/search/mca?query=vulcan 900",
                "http://phoenix.craigslist.org/search/mca?query=vulcan 900",
                "http://phoenix.craigslist.org/search/mca?query=vulcan 900&s=100"]

  def parse(self, response):
      hxs = HtmlXPathSelector(response)
      
      titles = hxs.select("//p[@class='row']")
      items = []
      for title in titles:
          item = CraigslistMcyItem()
          item ["title"] = title.select("span[@class='txt']/span[@class='pl']/a/text()").extract()
          item ["link"] = title.select("span[@class='txt']/span[@class='pl']/a/@href").extract()
          item ["postedDt"] = title.select("span[@class='txt']/span[@class='pl']/time/@datetime").extract()
          item ["price"] =title.select("a[@class='i']/span[@class='price']/text()").extract()
          item ["debug"] = "" #blank for now...before, it was: title.select("a[@class='i']").extract()
          item ["location"] = re.split('[s"] ',string.strip(str(hxs.select("//title/text()").extract())))
          items.append(item)
      return items	

[su_heading]Items code:[/su_heading]

from scrapy.item import Item, Field

class CraigslistMcyItem(Item):
  title = Field()
  link = Field()
  postedDt = Field()
  price = Field()
  debug = Field()
  location = Field()

[su_heading]Run code (aka “Main”):[/su_heading]


import os
import scrapy  # object-oriented framework for crawling and scraping


os.system('scrapy list & pause')
os.system('scrapy crawl craigmcy2 -o craigslist_peter.csv')

.

Grad School Progress

The field of analytics is constantly evolving. I have enrolled in Northwestern University’s Masters of Science, Predictive Analytics program (in Evanston, IL) to help provide me with a fresh perspective on today’s top methodologies, tools, and business case studies.  You can track my grad school progress with a gantt chart that I created using Tableau. I will keep this up-to-date until I’ve earned my degree (expected 2016).