Redshift will accept UTF-8 “fine” in VARCHAR columns, although will have issues with UTF-8 longer than 4 bytes, and outside the valid range. I haven’t tested the length issue with Apache Spark, but hit head on the other issue.
Apache Spark happily accepts EF BF BF (also known as \uFFFF) as a valid UTF-8 codepoint when storing data in Parquet. Redshift can’t load this character.
When confronted with a problematic row, in Redshift you can either use ACCEPTINVCHARS or a reasonable number in the MAXERRORS parameter in your COPY command. Except that these parameters are not available when dealing with Parquet. In other words, if you have a Parquet file with one row with one field with one invalid UTF-8 your whole load will fail. And it fails very noisily.
Luckily, you can replace the problematic character in Spark:
decode(col, 'UTF-8'), 'UTF-8'), '\uFFFF', '\uFFFD')
You may not need the combo of decode and encode but every time I have issues with UTF-8 I prefer playing it one additional layer of safety.
This line uses translate to replace exactly one byte sequence (that of the character EF BF BF) for the byte sequence of the UTF-8 replacement character EF BF BD.
This is not foolproof: you still can find yourself with UTF-8 codepoints that are invalid for Redshift but valid for Spark. But now you know how to replace them in Spark to avoid the issue before it reaches Redshift.