Can you briefly describe your professional and personal experience relating to water and space technologies?

I work with remote sensing of the environment, and my focus is on using radar technology for observing the land surface. Since radar instruments are very sensitive to water, our focus has shifted increasingly to observing variables that are related to the water cycle of the Earth’s surface, most importantly, soil moisture, but also water bodies or water within the vegetation.

Tell us about your current work, your latest project, or your proudest professional moment?

At the institute, we have been long focused on retrieving soil moisture from active microwave remote sensing data and have developed a couple of operational services together with partners from outside the university. I am most proud of publishing the first near-real-time soil moisture data service in cooperation with EUMETSAT in 2009, which has been operational since then.

Afterwards, we have been developing new services that extend beyond soil moisture. Recently, a new Copernicus Emergency Management service was launched that allows to monitor flood extent on a global scale. This service is still in a beta version, but we are continuously improving it.

What do you need to innovate?

Resources -- and the most important resources are people and knowledge. People who are very good in the technical sciences and who have a good scientific understanding are most important. We also need people who can develop and run a big IT infrastructure. The data we are dealing with nowadays is in the order of several 10s of Tera to Petabytes. Obviously, for this you need to have a very powerful IT infrastructure.

What does it take to initiate the creation of a global data set? What steps are involved? What made you decide to do it, and how did you go about it?

The first dataset we released in 2002 was very much just a research product. We started with a regional focus. For example, we had individual studies in Canada, Spain and Ukraine, and realised we can create soil moisture products for whole countries. Then the European Space Agency motivated us to develop something for the whole of Africa, which also turned out to perform well. And from there on we decided to do it globally.

The big question is, do we dare to publish this? It was a very exciting moment, because once you have a global data set, you are not able to actually look into the quality of the data yourself. You are not quite sure what you are delivering. In the end, I was motivated, for example, by an American scientist called Alan Robock who suggested every dataset becomes more valuable the more people are using it. He told me: “You will see you will profit from making it freely available.” And this is what we experienced. Of course, it was not a perfect data set, but the response from the scientific community was very good and we learned a lot through cooperating with other people and keep on improving the system.

Please elaborate, why is it that once you develop global datasets you cannot check the validity or accuracy of your own data?

Because the data volume becomes so large that you cannot look at the quality of the data as an individual organization. You also depend very much on the availability of good reference states. That means, the feedback from the community and from experts from different disciplines, from holders of different reference datasets, is extremely important for the developers of remotely sensed data products.

What is the relevance of soil moisture data for overall water resource management?

As I am coming from Austria, when I started my PhD research I got many questions from colleagues, family and friends, why soil moisture? You know Austria is quite wet, so people thought we do not need this. But in the last few years Austria has experienced more heat waves and droughts. It has become noticeably clear to everybody that you need soil moisture information. It is such an important variable that determines the strong coupling of water and energy fluxes at the land surface. And of course, soil moisture provides the water for the plants to grow. It is a very integrative variable impacting different disciplines from hydrology to meteorology, agronomy, etc.

What potential do you see in this field for water resource management? In which water related research field do you see promising but unharnessed potential of active sensors?

The strength of satellite data is that one gets an overview over larger areas and can observe what is happening locally. The more one deals with phenomena that extend over very large domains, the more important is that one uses satellite technology. The phenomena of drought for example is not something that happens in a small valley, but over a whole landscape, country, or part of a continent. To gather information about such a drought and have a good overview, you need to have data.

What fields of application do you see in terms of water resource management that have not been harnessed? Where is the future pointing us to in terms of active remote sensing?

While active remote sensing satellites (e.g., the MetOp ASCAT series or the Sentinel-1 Synthetic Aperture Radar) are already providing many different variables, like moisture, water bodies, vegetation classifications, etc., I think the uptake in applications could be even stronger. People have started to realize that there are accurate and operationally available data out there (for free), and they can trust having them for much longer. What is most important now is to continuously develop new and improved data products and to understand how to integrate the newly available data into the applications

Do you think we are going to discover other variables in the existing data? Some we have not thought of using yet, or indicators for various parameters?

One of the exciting on-going research topics is how to estimate water within vegetation, which is complex to determine because a signal reflected from vegetation is strongly dominated by the structure of the vegetation. Separating the structural contributions from the water-related signals within a vegetation’s signal is very difficult. But if we manage to develop a reliable product it would have an important impact on many applications.
Another underutilised capability of radar sensors is deriving information about frozen soils.

If you have three wishes to be fulfilled by satellite engineers in a space agency, what would they be?

Typically, scientists have lots of wishes to the space agencies. If you look at what is needed by the applications, temporal sampling remains most important.
We work with two main sensors. One is MetOp ASCAT, with three operational satellites at the moment. There we have very good temporal coverage sampling almost twice a day for Europe, but the spatial resolution is rather coarse (in the order of 25km). The other one, Sentinel-1 SAR data, has a very good spatial resolution of 20m and a repeat coverage in the order of two to four days in Europe, which is often too long for monitoring hydrological processes. For example, we miss many flash floods, because the satellite is not watching often enough.

In the future, we would need sensing technology that is able to provide us much denser measurements at a good spatial resolution. This could be a fleet of satellites, or the possibility to put a SAR sensor in a geosynchronous orbit to observe the same area all the time. I think the easiest thing to do would be to work with satellite constellations and have more satellites than just two, as is the case for example for Sentinel-1. This will be very valuable for applications.

Terrain patterns and digital elevation models play an important role in hydrological modelling. Can you explain their relevance and limitations related to water resource management and the best process to generate them?

Water forms the landscape. And the landscape determines where water flows and runs off. In that sense, it is clear that the topography and digital elevation models are of upmost importance for hydrology, and anything related to water.

If you ask which data sources can one use for deriving digital elevation models, then certainly, on a small scale, it is airborne laser scanner data, which have the big advantage of seeing through vegetation. Unfortunately, airborne laser scanner data are normally only available over small areas. For the development of global datasets, one can use interferometric SAR techniques, such as those provided with the Tandem X mission. I'm sure we will see important improvements here.

What areas of application of LIDAR remote sensing, physical modeling and collaboration of full waveform laser scanner data are relevant for water resource management? What are the gaps in this field and what is suggested by state of the art research that we do not see implemented in practice so far?

Airborne laser scanning has grown because of the need for good terrain models, particularly driven by large-scale flood events. Over the years, people have understood that airborne laser scanning is also very useful for vegetation modelling, e.g. biomass assessment, as well as for describing urban structures. These data together with image data form a good basis for very detailed information on a regional scale.

Airborne laser scanning could benefit from not just to use the geometric information, but also radiometric information. And while researchers have started to work in this direction, I don't think it had already had a big impact in practice. I think the combination of backscatter information that you observe with airborne laser scanning in combination with the geometric information is really very beneficial for many applications.

What kind of applications we can derive from that?

For example, if one uses the 3D point clouds from laser scanning, one can classify echoes from vegetation and the ground, which very often are just based on the geometric relationships within the 3D point cloud.

People tried to classify the sources of echoes. You then have calibrated information to have, e.g., physical backscatter values for each point, a cross section. You can use the geometric relationship and the radiometric information to more accurately say if for example, a point still comes from a low vegetation, or if this is already coming from the ground. You can improve the laser scanner’s classification accuracy. This is the basis for deriving more accurate models about the terrain, vegetation biomass or housing models/building models.

What are hot problems academia currently tries to research and resolve?

At the moment, the buzzwords are machine learning (ML) and artificial intelligence (AI). Many research groups apply different sorts of ML techniques to remote sensing data. Data volume has become so large that humans cannot explore the data themselves any longer. They need intelligent tools to explore and understand the data better. We have seen substantial progress over the last years.

As a physicist, I always try to understand where data is from and what they actually mean from a physical point of view. I am sometimes unsatisfied with the fact that machine learning models are used for predicting what you are measuring without really understanding what you are doing. I hope we will see more integration of physical models and ML techniques to solve these problems.

We have developed studies on data science and have skilled students in ML, but once they start working with the data, they still have a steep learning curve in the interpretation of the data.

What are currently the biggest needs and gaps in water resource management that remote sensing fill? How can we increase opportunities to manage and deliver the wealth of information available by technological innovations to a wider audience?

Flood extent mapping, runoff data and water level data. In some regions of the world, there is a decline in in-situ measurement capability. We certainly need observation technologies to understand runoff in rivers throughout the world. Moreover, by getting to ever finer resolutions, we can better understand hydrological processes of smaller lakes and rivers. I see considerable progress within the remote sensing community to provide virtual gauged runoff stations and the development of monitoring capability.

Another important need is improved rainfall data. Certainly, one can use rainfall observations from ground stations. Some industrialized countries have very good coverage of their territory with in situ measurements. However, even if stations might be fairly close to each other e.g., 50 km away, to estimate precipitation and other currently unresolved problems, smaller distances such as 1 km are needed. Ground-based radars looking up into the sky could be used in this context. However, they are typically limited in space. We need innovative remote sensing approaches to get higher resolution global precipitation data.

A very interesting development has been pushed forward by a team under Luca Brocca at CNR-IRPI (the Italian Research Institute for Geo-Hydrological Protection) using soil moisture data to derive how much water has fallen on the ground.

The idea behind this is that the soil is a natural rain gauge. By observing changes in soil moisture and inverting a land-surface model, one can retrospectively estimate the amount of rain that has fallen. Once a soil moisture data product adjusted to a spatial scale of 1 km is available, one can expect much higher resolution information about potential rainfall patterns and start integrating it with other available sources on precipitation.

I imagine that must be very complex because structures like streets and things out of concrete would not absorb the water. You would probably have run off in the street and right next to grassland. Isn't it hard to control for all these factors?

Yes, you are right. Sometimes, if you go to a finer spatial scale things become more challenging. The rainfall data product mentioned before strongly depends on the quality of the soil moisture retrievals. Looking at the error maps, we directly see the errors in this almost retrievals from their maps. This is our task; we need to deal with heterogeneous land cover and properly model the different contributions to microwave signals to extract the soil moisture information appropriately. Now that high spatial scale data is available, we can start to disregard the measurements of a street or old house or any other area where the data show no sensitivity to soil moisture, and still get a good estimate of soil moisture at a 1 km scale.

Don’t we also need in situ data to train, verify, and test, e.g., machine learning models on Earth observation data? How much in situ data do we need? I imagine that it depends on the observed parameter.

It strongly depends on the variable that you are looking at. The more in situ data we have, the better; that is the rule of thumb. However, we still need to be careful because often what in-situ sensors are measuring is not what the satellite is observing.

Moreover, the overall understanding is that in-situ measurements per se may not be very representative of the landscape around. It really depends on the type of variable measured, and how easily we can use in situ measurements to calibrate the models.

Let us start with a simple example: surface temperature is a variable that is rather representative of a larger patch of land, while precipitation measurement is less representative.

Regarding soil moisture measurement, some stations are well representative while others are not. We often do not know why that happens and to what extent a measurement is actually representative of a certain area. Using all station data for training satellite data with machine learning can be dangerous, since machine learning algorithm does not know a priori which data samples are good, and which are not good for a specific purpose.

Can we coordinate global data collection in any way?

I believe a lot can be achieved through good standards. There are many scientific groups like CEOS or the GCOS that work on those standards and give recommendations. However, some physical challenges cannot be overcome with standards and then clever decisions need to be taken in terms of setting up the algorithms.

A good data scientist with knowledge in machine learning also needs to understand the limitations of the data. Only then, this professional is able to select the most suitable algorithm to the task. How can we get a feeling for how much ground truthing data is necessary for training, testing and verification? How can we get a better feeling for that set of problems that we are trying to solve and how much in-situ data in proportion to the Earth observation data we need?

A rule of thumb: it is never enough! We really need extensive datasets for a proper calibration of machine learning algorithms.

Physical models deal with a few parameters to tune, say in the order of 5, 10 or 20. Machine learning algorithms on the other hand, need training of thousands of parameters. Even if you have done amazing field data collection for two years at 50 sites, it may still not be enough for a proper training of a machine learning algorithm with the aim of general applicability.

For the global soil moisture dataset, we used data from different countries and put them together in joint databases and started to use it to validate our algorithms. It is an effort we started almost 20 years ago with Peter van Oevelen from GEWEX, who initiated the initiative to establish an international soil moisture network. It has grown so extensively that today we have data from all around the world and we can actually train our model, not just with data from one country, but from all countries. It suddenly starts to make sense and algorithms become more generally applicable.

The broader you go in terms of regional scope and the more variability is in the data, like in the physical phenomena, the more training data you need also to be able to distinguish those, don’t you?

Yes, that is true. It is very important to understand that models then become physically more correct. In the past, remote sensing scientists used to investigate small regions using small in-situ or reference data sets. That way, it is always easy to get good correlations.

But then the model might not be valid in a very nearby area. It also means the model does not have much value. This can only be overcome by designing the training of the model in a way it can take into consideration different environmental conditions, so the model can learn dealing with those different conditions.

How can you make global models applicable to local phenomenon and the other way around? How can you get transferability of local models to other sites?

I always tell my team that I want the algorithms developed over one region to also work elsewhere. If we develop an algorithm over Austria, for example, I want it to work in Australia or in South America, it shouldn't matter. The processes should be described as generic as possible by what we are doing. You may also fine tune them with machine learning to specific conditions. That is possible and imaginable.

How can we assess which Earth observation data sources are most suitable for a given research problem? What would you recommend to young researchers who do not have an internalised library of sensors and data products? Where should they start? How do they get a good feeling for what is there?

Considering the tremendous amounts of data nowadays, it is really hard. Many young researchers will be overwhelmed by the offerings. Data is coming from different sources, different countries and from public and private actors. Moreover, the data volume is becoming increasingly large, so it is challenging to process data locally. For this reason, a very prominent first place to get experience is Google Earth Engine, which is a cloud platform with lots of data and the capability to process the data on a global scale. I think this is a first good step. However, to be innovative and develop new things, or gain special expertise on a certain sensor or data product, one must probably also work on more specialized platforms. There are several platforms with certain expert knowledge that also offer to start processing the data. For example, we have founded the EODC Earth Observation Data Centre for Water Ressources Monitoring where users can work with pre-processed Sentinel-1 data on a global scale. It is a pre-processed global data cube based on Sentinel-1 backscatter data. All the burden of pre-processing and making sure that quality control has been achieved is already done. This is a big leverage for young people who can now concentrate on specific problems and not to have to deal with all the processing chain.

You are a highly cited researcher and a proponent of open data. What values does it have for researchers to share their demanding work freely and with everyone?

By publishing data in an open way, you can improve your impact and reputation. People will use your data, read your papers and through this, the number of your citations will grow. Your reputation will be boosted. There is a selfish motivation for sharing data in an open manner with the scientific community .

Moreover, data sharing is very important because it allows us to critically analyse what other people have done so we are able to reproduce research and examine the assumptions. This makes sure that science is done in a proper manner, and very importantly, it generates feedback from a larger audience in case something is wrong.

An increasing amount of scientific work is based on technology originated in software development processes, where an entire environment is shared with other developers. These containers are used to package research, data and models together. What do you consider an appropriate way to share scientific work with other researchers and in how far would?

Data should be packed with the models (code) to allow for recoverability and traceability of results. Making a scientific study fully reproducible is a big challenge. However, scientists should try to go as far as possible, by providing at least some of the data from their analysis and, if possible, some code for letting other people understand what has been done. There are natural limitations to this. Think about the huge amount of data available nowadays. We cannot always share everything with everybody, it is impossible. But we need to find ways of describing input and output datasets in a systematic manner, e.g., how you made a query of a certain dataset at a certain time in a certain area so readers at least know which specific data set was used.

This puts a burden on researchers because it is additional work. But again, I think it is worth doing, because our experience shows that the more people look at a dataset, the more valuable it becomes, because you understand its strengths and weaknesses much quicker.

I think it is not only a burden that should lie on researchers themselves. Maybe the infrastructure of academia will change with this, in case it becomes a more commonly used approach. Maybe pioneers nowadays must do all the work and make sure others see the value in it. Eventually we may also need different data structures and infrastructure to access and share data with one another.

Yes, we have a long way to go. Certainly, universities are starting to do this and offer services to scientists to help them in the process. However, it is still very demanding because then it is not just about better published papers, it is also about making sure that the data are well curated and presented. It is also about code. There is a significant difference between scientific code, and code that you share. Other people cannot really help in tidying up your code or making sure it is clean. These tasks come on top of what researchers need to do already.

What is your favourite aggregate state of water?

Well, I like skiing, so the snow would be a candidate, but I probably like going to rivers, lakes, and the sea even more. So, I think it is the liquid state.