Read Parquet Schema from Origin that can be used in the Pipeline

When a Parquet file saved from a previous pipeline where the schema was inferred from the record, the parquetSchema field does not appear to be populated correctly. It appears to be an Avro schema and in the destination stage when the Parquet Schema in Record Header option is configured for the Parquet Schema Location, the following error occurs:

com.streamsets.pipeline.api.StageException: HADOOPFS_13 - Error while writing to HDFS: java.lang.IllegalArgumentException: start with 'message': expected 'message' but got '{' at line 0: {

Should the schema be pulled from the Parquet file's metadata?

Needed By

Not sure -- Just thought it was cool

Post comment

Guest

Feb 11, 2025

Hi @Guest , I am happy to announce that that Avro Schema problem was already fixed in the 5.11 version. Our recommendation is to check the updates on these engines.

Reply
Hide replies

Guest

Feb 10, 2025

Hi @Guest , could you tell us what version of engines is using the client? because in Documentation, we can see that the Origins Schema can be taken and we have checked that it works from Origin Directory. Maybe they are using and older version and the fix is in a newer one.
https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Origins/Directory.html?hl=directory%2Corigin

Reply
Hide replies

Guest

Feb 7, 2025

My thought for this is that if there were a schema that was nested with different types, writing it into parquetSchema would bypass the need to have to infer from the records and generate it.

Reply
Hide replies

Guest

Feb 7, 2025

Hi @Guest @Guest Could you please give us the information that Fran is requesting? it would really help us to resolve this for Simon. Thanks in advance

Reply
Hide replies

Guest

Feb 7, 2025
Hi. We really appreciate your comments and we'll take a careful look to it so you can successfully run your pipeline.
As you say, the parquet schema should be extracted right away from the parquet file's metadata and most probably is the case. Anyways seems to be an error there.
I'd like to take a deeper look but I'd need some more information.
1. I understand that you have one pipeline to generate parquet files and then a second pipeline to ingest them into some data lake. Am I right?
2. If this is the case, in the second pipeline which exact origin you are using for reading parquet files.
Looking forward to hearing from you.
Best regards.
Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

Read Parquet Schema from Origin that can be used in the Pipeline