In DataStage, wildcards in the Azure Data Lake Storage connector are only allowed the filename, not in the filepath

In the Azure Data Lake Storage connector DataStage can read multiple files at once by specifying wildcards. But these wildcards are only allowed in the filename, not in the filepath. This means that all the files we want to read have to be present in the same directory.

In a data lake files are often distributed in folders and subfolders to improve performance, by limiting the number of files to be read or allowing parallel processing.

As an example, let’s assume we are measuring the air quality at hundreds of measurement points. In the data lake the folder structure that has been set up and the filenames looks like this: /gold/air/<measurement point>/<year>/<pollutant>/<month><day>.parquet

In the same directory we have 365 (or 366 for leap years) files with de measurement of 1 pollutant in 1 year at 1 measurement point.

Some folders:

/gold/air/Antwerp/2019/NH3
/gold/air/Antwerp/2019/CO2
/gold/air/Antwerp/2020/NH3
/gold/air/Antwerp/2020/CO2
/gold/air/Ghent/2019/NH3
/gold/air/Ghent/2019/CO2
/gold/air/Ghent/2020/NH3
/gold/air/Ghent/2020/CO2
…

Depending on the question asked I will select the files based on the measurement point, the year, the pollutant, the month and/or the day of the month

To get the concentration values of Ammoniac in July (of all available years) I will use the wildcard “/gold/air/*/*/NH3/07*.parquet”, and for all pollutants in Antwerp in 2019 “/gold/air/Antwerp/2019/*/*.parquet”.

Fast and easy.

To achieve the same functionality with the wildcards only on the filename we would have to locate a huge number (more than 1.000.000) of files in one directory.

Needed By

Quarter

Post comment

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

In DataStage, wildcards in the Azure Data Lake Storage connector are only allowed the filename, not in the filepath