UTF-8 Issues between AWS Redshift and Apache Spark when COPY PARQUET

March 31, 2021 2 minutes read | 343 words by Ruben Berenguel

Timezones and UTF are rocks you repeatedly hit in your data journey.

A common ETL (or even ELT) approach when using AWS Redshift is

You have some data in S3,
Load it into Redshift,
Profit.

This profit can be in terms of pure data engineering work, which would involve additional transformations or as a finished product: a report ready for BI consumption.

Redshift is less flexible than Apache Spark when handling UTF-8 codepoints, that is relatively well known, as you can see in these doc pages:

Redshift will accept UTF-8 “fine” in VARCHAR columns, although will have issues with UTF-8 longer than 4 bytes, and outside the valid range. I haven’t tested the length issue with Apache Spark, but hit head on the other issue.

Apache Spark happily accepts EF BF BF (also known as \uFFFF) as a valid UTF-8 codepoint when storing data in Parquet. Redshift can’t load this character.

When confronted with a problematic row, in Redshift you can either use ACCEPTINVCHARS or a reasonable number in the MAXERRORS parameter in your COPY command. Except that these parameters are not available when dealing with Parquet. In other words, if you have a Parquet file with one row with one field with one invalid UTF-8 your whole load will fail. And it fails very noisily.

Luckily, you can replace the problematic character in Spark:

SELECT 
  translate(
   encode(
    decode(col, 'UTF-8'), 'UTF-8'), '\uFFFF', '\uFFFD') 
  AS col 
  FROM problematic_table;

Note:

You may not need the combo of decode and encode but every time I have issues with UTF-8 I prefer playing it one additional layer of safety.

This line uses translate to replace exactly one byte sequence (that of the character EF BF BF) for the byte sequence of the UTF-8 replacement character EF BF BD.

This is not foolproof: you still can find yourself with UTF-8 codepoints that are invalid for Redshift but valid for Spark. But now you know how to replace them in Spark to avoid the issue before it reaches Redshift.