European Commission logo
INSPIRE Community Forum

How to get distinct values to query large datasets from WFS endpoints?

This post tries to provide a simple work-around for a little-known limitation with the WFS standard, especially regarding being able to provide ways of filtering the available datasets using data-specific filters.

The need for this post: just as an example, imagine a WFS service providing access to species distribution from a Member State INSPIRE endpoint. Due to the nature of the data theme, that dataset might be have tens, hundreds or even thousands of different species distribution elements, and since there is no way to determine without prior knowledge, which species are provided by that download service, there is little alternative right now to find out exactly what the dataset contains without getting the entire dataset. I am, of course presuming that the dataset’s metadata contains little to no useful information on the structure of the data in the endpoint.

WFS through Stored Queries provide a way of filtering the data, but how to get the exact values from the dataset to provide to the filter? If we were in a database we could just have a SELECT DISTINCT SQL syntax, or even a GROUP BY + COUNT to have more information on the features in the dataset, even with services from a REST endpoint would allow a distinct values and count of the underlying data. But right now there is no working alternative to achieve this.

Until this issue will make it to the newest version of the WFS documentation, and we can only speculate on when it might be implemented in Server-side applications, there is a work-around for this issue that I have deployed on our development server that is running the latest Geoserver 2.12 with a PostGIS database.

The trick was to getting Geoserver to perform a SQL syntax on the database similar to:

SELECT COL, COUNT(COL) from TABLE GROUP BY COL

where we want COL and TABLE (or database view) to be parameterized as required by each user, and in this way have only one server layer fulfill the purpose of getting the distinct values from the dataset.

Taking pointers from the very useful post from the Geosolutions blog:

http://geoserver.geo-solutions.it/edu/en/adding_data/add_sqllayers.html

Although the post is pretty self-explanatory, in a nutshell it says something like:

-Publish a new layer in Geoserver using the “Configure new SQL view” option, connecting to a DB store

-Write de SQL group by that you are interested in and publish the layer to geoserver.

Since the goal is to parameterize both column name and table/view name, the SQL syntax of the layer’s definition should look similar to:

SELECT "%fieldname%" as VALUE, count("%fieldname%") as COUNT

FROM "%table_name%"

GROUP BY "%fieldname%" ORDER BY "%fieldname%"

*note the quotes before/after columns and tables just to make the SQL query work if there are UPPERCASE and mixed-case table/column names in the database.

All that remains is to set the SQL parameters, set default values so that the SQL query will work if no values are specified with the “viewparams” parameter for the WFS endpoint.

An example for getting the exact values from the database for protected sites designations:

http://inspire.teamnet.ro/geoserver_WFS_default/ows?service=wfs&version=1.1.0&request=GetFeature&typename=select:GroupBy&viewparams=fieldname:designation_href;table_name:vw_protectedsites

These values can then be used in querying data using a stored query, where the exact value for the propertyValue parameter is obtained from the select:GroupBy feature on the Geoserver.

http://inspire.teamnet.ro/geoserver/ows?

service=wfs&

version=2.0.0&

request=GetFeature&

StoredQueryId=GetFeatureSingleCriterion&

featureType=ps:ProtectedSite&

propertyPath=ps:ProtectedSite/ps:siteDesignation/ps:DesignationType/ps:designation/@xlink:href&

propertyValue=http://inspire.ec.europa.eu/codelist/IUCNDesignationValue/naturalMonument

*where the black bold is the property being filtered, and the red being the value used in the filtering.

There is still the issue of users being told explicitly what valid combinations of column & table/view should be specified to the users in either the dataset’s metadata, or some other wiki for it. While this is not perfect it might be a step in the right direction.

  • Ilkka RINNE

    Hi Sorin,

    Interesting application for Geoserver parametrized views! I was wondering how did you validate the view parameters against SQL injections? Quoting the Geoserver blog that you referred to:

    Only use SQL parameters as a last resort, improperly validated parameters can open the door to SQL injection attacks.

     

  • Sorin RUSU

    Hi Ilkka,

    Thank you for your kind consideration and you raise a valid point. The geosolutions post does point out this vulnerability and they also provide an useful way of preventing most SQL injections.

    Always provide default values for each parameter in order to let the layer work properly and also be sure the regular expression for the values validation are correct (the second bullet).

    • ^[\d\.\+-eE]+$ will check that the parameter value is composed with valid elements for a floating point number, eventually in scientific notation, but will not check that the provided value is actually a valid floating point number
    • [^;']+ will check the parameter value does not contain quotes or semicolumn, preventing common sql injection attacks, without actually imposing much on the parameter value structure
  • Sorin RUSU

    You can also just create non-parameterized SQL selects, and publish as many layers (select distincts) as necessary to retrieve the data from the database. The downside is increasing the number of layers in the server.

    Kathi Schleidt did something like this with PD dataset: http://bolegweb.geof.unizg.hr:2017/geoserver/pd-s/ows?service=WFS&version=2.0.0&request=GetFeature&typeNames=pd-s:sd_statdistmeasure_test&count=50

     

  • Iurie MAXIM

    Hi,

    But what to do if the data provider is not making such an SQL Layer or if the user/machine/application is not aware that such an SQL Layer exist, nor he does not know what the results of such an layer means or how to use the results in other WFS requests ?

    The solution is to extend the WFS standard in order to allow requests to provide distinct values of an element.

    By then, a proposal for adding a new functionality to Geoserver to allow users to get distinct values of an element trough a GetPropertyValue request was added to OSGEO's JIRA.

    If you are interested in this proposed functionality, you can see its description or you can vote for it here:

    https://osgeo-org.atlassian.net/browse/GEOS-8229?filter=-2

    Best regards,

    Iurie Maxim

Biodiversity & Area Management

Biodiversity & Area Management

If themes like Protected Sites, Area Management/Restriction/Regulation Zones and Reporting Units, Habitats and Biotopes, Species Distribution, Bio-geographical Regions matters to you, join these groups!