ML for scientific insight

ML for scientific insight


Building ML models is easy; answering science questions with them is hard. This short talk will introduce common issues in applying ML, illustrated with real failures from astronomy and healthcare - including some by the speaker. We hope sharing the lessons learned from these failures will help participants build useful models in their own field.

Transcript

Thanks, Jesper. And thanks, everybody for being here. Can I just ask real quick actually, just before we get started, can you give me like a tick in the chat if you are a scientist like a literal like academic, I'm doing science scientist. And across if you're just like anything else, like you're a product guy, you're a data science guy you're building stuff actually being useful, etc. I'd be curious.

Quite a few. How are we lucky Jesper actually Barnstaple. I'm screenshare chat, okay, a lot of crosses. Okay, so we're product people, right? Well, hopefully and that gets I'll frame myself a little differently. So, that aim for this talk was John asked that question, to kind of illustrate I think, some issues that I've run into in my work trying to do science science with machine learning, and how that can be uniquely difficult. I think lots of the computer science research and the data science stuff we talk about focuses. In academia, it focuses a lot on like benchmarks, you know, image net, getting your paper and neurips in conferences like PI data, as I think there's poll, you know, quick poll. kind of shows it tends to focus a lot on the commercial applications. I'm guessing many of you like people at companies building things and a lot less so on the science and so scientists often take the kind of advice aimed for computer scientists and product people and then fall into a few traps that that advice wasn't really designed around. But since you guys aren't really scientists, I think I'll just change the words around a little and just kind of explain I think, what's, what's different for scientists, and maybe how some of these issues might also affect the kind of products you're building. Very quickly, who am I? My name is Mike Walmsley. I am a researcher at the University of Manchester. I'm an astronomer. I'm the Lead data scientist for a project called Galaxy Zoo, which uses volunteers to classify very large numbers millions of galaxies we have hundreds of 1000s of volunteers contributing their effort to saying oh, this galaxy is smooth, this one's featured etc. And it takes you down and ask different questions about those galaxies. And I'm ultimately trying to answer questions about how galaxies work. So for me the model is a is a means to an end of answering a science question. This talk is very much about getting to that end, and the ways the model can go wrong. I know that most people really aren't astronomers as well, but I think this can be quite common. So also, for each kind of issue that comes up I tried to find where this could also be a problem in the case of COVID. Very briefly, these are the issues that I want to talk about, kind of arrange them loosely on a scale of well, this is like a general ml problem that you really think would expect. affect everyone to this is really specific to the cultural context of science. And I'll just run through quickly because it's a short talk, and we're a little over time. So the first one is shortcut learning, which is essentially when the model works, but not in the way that you want it to. So as a quick example, here's two pieces of astronomy data from a radio telescope called chime. The aim of the model is to tell the difference between what signal is human junk and that's this kind of one on the left with what is a real signal from space from these mysterious extra galactic sources we don't understand. By I think the the obvious thing that you would like the model to tell us well, this is a vertical strip of some particular shake. Well, this is kind of junk. But actually a very good way to tell the differences simply saying, what is the standard deviation within this image, and you can be an almost perfect classifier between the two. So in that way, you can solve the problem but not in the way that you actually want. And as scientists, we need to get that causal. effect out because when we then try the model on new data sets fainter data, we need to make sure it still works. And I think that's important for product people too. Because again, if your model is working for like a coincidental reason, that isn't going to generalize well to new datasets. Second one is training test leakage. So this is where you're where you know building kind of on Valeria is talk where you're careful to partition up your dataset in interesting ways. The real world often doesn't fit this nice kind of row based idea of data. So for example, here are four simulated galaxies from a real astronomy paper by Chip janovich 2020 saying, dividing galaxies into different classes. But actually, they're not four different galaxies, therefore views of the same simulator galaxy. So if you do your training test whether your cross validation on this data, then you'll be mixing up different views of the same galaxy between your train and test sets. And your model can kind of cheat in that way. And this really happened in that paper. Although it was fixed and follow up work. And I really recommend actually, if you're interested in deep learning and astronomy then Jeff, Briana, which is work here, and in general is a really nice example of how to do it. The third one is around what I call begging the question and this is where your model will discover it by where you will see that your your models predictions depend on some feature and you write it up in your paper like oh yeah, this is the important thing or I guess in a product context, if you are as a data scientist trying to understand like what is driving your conversion rate, say you find Oh, this thing is what's driving conversion events. But actually, that's not what's really happening. What's happening is the you're rediscovering the biases of your model. So to make that concrete, in astronomy, it's generally true that spiral galaxies are pretty blue, and elliptical galaxies pretty red. That means that the model can can learn then just based on the color information that okay, if this is blue, and I don't know what it is, I'm going to call it the spiral. And then when you look through your predictions, you say, Oh, look, the machine learning is finding all these blue spirals like spirals a super blue, like kind of right but also wrong because you've biased your classifier in the first place, right? And you're just rediscovering what your classifier has learned from your general population. This is very tricky to fix, but it's mostly around controlling the information that you pass to the model. So for example, we might just pass a grayscale images here, or we might shuffle the chapter. The last issue I wanted to highlight was what I call sandcastles. And this is really a cultural thing. I think it's not a code problem. So in astronomy, we tend to make many many models for classifying galaxies and in COVID. There's a lovely review by weinan, where they found some 500 Odd models produced for essentially diagnosis or prognosis of COVID. And it turns out that the vast majority of these were never validated independently. People were just throwing out hundreds of models. There and then they were useless, right. And it's largely considered a massive failure of our industry that we were not very helpful, broadly speaking, despite big promises in addressing COVID. And the reasons for this, and I think this might be controversial with Justice talk, is that there are very minimal motivations. Really, in the short term for doing reproducible, careful validated science right in the sense that I need to get papers out and need my grant application funded. I need my conference talk. I need to keep just practically progressing my science. I don't really have time to be like carefully benchmarking everything or making it a nice Docker package. So my advice to you around there is is apart from trying to take slightly longer view for the reasons Jesper said Just be honest in what you're trying to do. If you're a scientist and you want to be a data scientist, that's okay, right data science papers, but if you're trying to do genuinely push the limits of human knowledge. Then maybe the best thing in this imperfect world is to try to use a few simple tools like put it on GitHub, make a run or Jupyter Notebook for user requirements, etc. There's some really low hanging fruit that Jesper was that Jesper went through and other people are going through shortly like Gaku I think that might help you get there. So that was a really quick run through of a few issues that I found in my work, doing science and machine learning, and I think they might be relevant to people who are trying to build products but also ultimately trying to find the answers behind why the models work well, because that is in a practical sense. Science. Thank you for listening.