If the average person knows anything about economic forecasting, it’s that it’s a job for the number-crunching experts. And however the analysts assemble their latest predictions about employment and the nation’s wealth and health, we assume they aren’t simply googling it. However, a current SBE research project suggests that we shouldn’t discount Google entirely. Stephan Smeekes, an NWO Vidi award-winning associate professor in the Department of Quantitative Economics at SBE and a member of the Data-Driven Decision-Making research theme, is working with SBE colleagues to find out whether parts of the economic forecasting puzzle could be hidden inside the mountains of data generated by our search engine queries.
You’re currently studying people’s search engine behaviour and what it tells us about unemployment and the economy. How are you going about it?
Stephan Smeekes: “This is a work in progress; a joint project with CBS, the Centraal Bureau voor de Statistiek, or Statistics Netherlands. I’m working with Professor Franz Palm [former dean of SBE], Jan van den Brakel, who works both at CBS and part-time in our department, and our PhD student Caterina Schiavoni, and we are looking at using alternative data sources for producing official unemployment statistics.
“Every month, CBS publishes a report on unemployment data, and to compile it they use surveys. They essentially pick up the phone, call people, and ask, ‘Have you been unemployed since the end of last month?’ Because they know all the characteristics of the people that they ask, they can extrapolate what they find against the general population.
“These are high-quality data, but they are very time-intensive to gather, and very costly, and moreover people are not always interested in answering questions like this. So CBS are currently looking for other data sources, and we thought it would be particularly interesting to look at Google search data. Google publish regular reports on the popularity of search terms, and you would expect that when people become unemployed, or just before, they’ll start searching for things like information on unemployment benefits, or advice on how to write a CV, or job vacancies, those kinds of things. Therefore, we would expect to see that as more people become unemployed, the popularity of these search terms would increase.
“But the problem with these data is that there is a lot of noise in there, and moreover we don’t really know how Google produces these data; it’s not very open and explicit. And they might change the way they produce it without us knowing. So, there’s no quality guarantee, and additionally there are lot of factors when people use those search terms that are not related at all to them being unemployed.
“As an example – and I just looked this one up again – if you look at the search term ‘CV’ in the Netherlands, you see a very strong seasonal pattern. Every winter there are a high number of searches compared to summer. I found myself wondering about that – and then I remembered that CV can also mean centrale verwarming, or central heating, and indeed, every winter, people in the Netherlands start searching for the term CV! Or take the search term ‘jobs’: there was a huge worldwide spike in October 2011… when [Apple founder] Steve Jobs died.
“The problem is that we have lots of data series that may have content that is useful, but there will also be lots of noise, which is not useful. We are now developing methods to extract the right information from these series without running creating something full of fake relationships, basically. Which is essentially what Google themselves got recently when they predicted flu epidemics. I think they took about 50 million search terms and then looked at the ones that had the highest correlation with actual flu epidemics. That worked really well for a year, and then the model broke down completely. No one knows why, but it was likely because they were picking up things which appeared to be related to flu but were not, and in the end, reality won out. Naturally these are the kinds of things CBS would rather avoid!”
Will CBS some day publish an alternative unemployment report based on Google searches rather than their traditional approach?
SS: “No, we are not going as far as to say that. At the moment, the idea is that we could use it as supplementary data, which may mean we can make estimates more precise, or it could mean that at some point they could interview fewer people, and not need to run such a large survey, which would save them quite a lot of time and money.”
What are the challenges, besides making sure that you don’t confuse people’s CVs with their central heating?
SS: “The main issue is that we want to build something that can be used not just with Google Trends but with any kind of data. I mean, who knows if Google ends up with similar privacy problems as Facebook and they have to withdraw the access to this data. So we don’t want to rely too much on this particular data. We want to create methods where you can essentially put everything in, because there might be many other data sources that would potentially work, and basically get out whatever is useful out of those data.”
“It’s hard to say when CBS reports might reflect this data, because it’s still on-going work. And at the moment Caterina is trying all kinds of ways to get the information out of this data. We are not ready yet to have a method out there, but I hope that in the next six months or so, we will. Not a finished product, but something that will be a first step.
“If it’s going to be included in official unemployment data, it needs to be validated and validated and validated, over and over again, first. But maybe CBS will be able to publish some sort of detailed statistics; they also do that occasionally.
“One thing we still need to discover is whether we actually can get information out of these data. Maybe they are so noisy that the data turn out not to be useable. But even if we can show that, it will still be a useful contribution, because at least we can tell people to stop looking at those series and try to find other data that might be better.
Have you already spotted anything interesting?
SS: “Well, it seems that at least the methods we are looking at now are able to filter out the irrelevant information. What we see is that we can put in the data, and even if it will not help, it doesn’t hurt, which is a good thing. We still need to find better ways to extract more of the relevant information. But I’m quite happy with the fact that we don’t seem to be going in the direction of what Google did with their flu trends, namely getting fake relations that suddenly break down. Which means that for CBS; they won’t get completely crazy results because it’s winter and people’s heating has broken down!
If this works, could it also be applied elsewhere?
SS: “It would be useful for any situation where you want to include data in your analysis where you’re actually not sure if it’s very relevant. This research is related to what I do for my research theme projects, where it’s really about developing methods for filtering out irrelevant from relevant information. Another thing that I want to look at, with another PhD student, is more like classical macroeconomics, predicting GDP or output. We have lots of variables that might be useful or might not be useful.
“We have also played with the idea of applying it to more general text-mining, even though of course there’s lots of noise there. In marketing, for instance, people use text-mining a lot. While I don’t have any concrete applications in mind, this is something where it could be useful to apply these methods.”
You’re a member of SBE’s Data-Driven Decision-Making research theme. In your view, what’s the value in working as part of a theme?
SS: “It creates a way to connect with people that you wouldn’t otherwise connect with. We all have co-authors that we work regularly with, but there may also be other people doing things related to what we do, and it’s hard to get connected to them because we’re all in our own groups and we have our own seminars. To me, that’s the most important thing that these research themes can do: helping us to work with people where there are natural connections. It’s not about some notion that everyone has to work together all the time and we all have to be a happy group! But there are many natural connections, I think, that we can exploit, and I hope the theme will allow us to create these links. And, of course, we can also improve connections to the outside world.”
How easy is it to work with someone whose methodology is different to your own?
SS: “The major difficulty is language. We’re all speaking English, sure, but we all speak different languages in our academic fields. Because all fields use their own terminology, it’s hard to realise that so many things are connected. When I read the abstract of a paper published by people in other fields, I may not realise that it’s actually connected to what I do, simply because of the way things are written; the same is likely true the other way around. I think we need to bridge this language gap. That is one of the major things that I think the theme can help in. It’s not necessarily difficult to work with other groups, as long as you can cross the language barrier.”
Without a research theme, perhaps interdisciplinary initiatives would just be something on your wish list.
SS: “Exactly: it takes time and effort, and we are all busy. I think everything that facilitates this kind of interaction is very helpful. Without a theme, it’s much harder, because who are you going to approach? You’re not going to ask everyone you see, ‘Hey, can we work on something together?’. But if you regularly have [research theme] meetings and you read about what other people are doing, and it’s explained in a common language, it becomes much clearer. We all have to learn to find out from each other who has what expertise, so that if you have a problem, you know who you can go to.”
Are research themes are just another way of encouraging interactions among academics in which they meet and exchange ideas? Isn’t this something you already do at academic conferences, for example?
SS: “Yes, but the difference is that conferences are typically in your own field, so that’s who you’re talking to there. I work mainly on statistical methodology, so in terms of developing methods for data analysis, I don’t always know that these methods can be used for. In the field of macroeconomics, which is fairly close to my field, I might know where it’s useful, but the same method might be equally useful in marketing research or even in biology. If I don’t know this, then I can’t apply it there – and of course I don’t go to conferences in those fields.
“I think interactions outside your field, such as via research themes, help to make research stronger. It can show me where I need to improve methods to be able to use them in other fields, and people from other fields might encounter helpful methods that they’ve never seen before. That’s hard to find out just by going to your own field’s conferences.”
Like a lot of things in scholarship, research themes seem like a process of planting slow-germinating seeds. How long will it take for the Data-Driven Decision-Making theme to bear fruit?
SS: “It’s hard to say. I think we can expect small things to start to come soon, but larger project are going to take time. We want to start with small projects, multiple small groups of people from different disciplines getting together, and then, based on that, they might grow into one of our big projects or spin off into different kinds of projects.”
Last year you won an NWO Vidi award, worth 800,000 euros. Are you a little bored of all the attention by now?
SS: “No, not at all! Of course, life goes on and research goes on and all the meetings go on, and in that sense, not that much has actually changed. But certainly it’s nice; we’ve been able to hire PhD students, for example, and I’ve spent more time on my research than I would have otherwise, so these are all good things. But it’s not as though data is suddenly important where it wasn’t before. It hasn’t changed my life in that sense.
All of us non-academics have recently heard somewhere that the future is all about data. What would you say to that?
SS: “It’s nice to hear, but it’s also nothing new!”