Managing logging in Spark ain’t easy, and is even harder in managed clouds like Databricks or EMR.
In this short post I will explain how to use log4j property files in Databricks clusters, particularly when launching JAR jobs. The method (with adequate modifications) also works for Spark jobs in AWS EMR, I will point out the required changes for that situation.
Why you may want to do this:
As was my case, to send logs to a Graylog instance.
To be able to adjust the verbosity of Spark logging and your packages, separately.
To customise how the logs look.
There are no details on how to do this in Databricks' documentation, although the required steps to be able to do it are well documented. It all boils down to using a custom boot script that copies the required files in the correct places, and passing the right arguments to spark-submit.
In the end, it was the same approach I had used some years ago to use custom log4j in EMR, the only difference is that the destinations are different and how I build the script is also different.
The operational workflow I have is:
Airflow triggers a Databricks submit-run JAR job, configured via a task in a DAG. The init script and parameters to spark-submit are passed here. In EMR, it was a custom operator creating a transient spot or on-demand cluster. The behaviour was thus the same as in submit-run, but we added a custom EMRSensor to monitor the job/cluster status.
All artifacts required for this job (JAR, log4j.properties, bootstrap script, additional JARs) are copied to DBFS (with the CLI tool, dbfs) as part of the deployment process. They could also be copied to S3 (or whatever storage you use), but you will see this makes it easier. For the EMR case, it was copied to S3.
The init script for Databricks looks like this:
The “spark conf” passed to our Databricks operator is this one, and would be verbatim the same in EMR (the Glue setting has nothing to do with logging, but you better use AWS Glue as metastore if you are on AWS… it will pay off if you ever let go of Databricks).
I’m not 100% sure now that we need to pass the executor extra classpath, but I vaguely remember that without it, no dice. You can try removing it.
The -D prefixed arguments is how you pass additional parameters that log4j interpolates in properties files. I use that for the environment in the file below, but you could pass anything that can go in a properties value.
I recommend you additionally make sure additional verbose logs are killed with the following:
Graylog, gelfj and keepalive connections in AWS ELB
For the specific case of using Graylog, we copy the gelfj JAR to the classpath, and additionally fetch directly the JAR for json-simple from maven, for some reason it is not bundled with our packaging of gelfj. We use a custom gelfj connector that makes sure the connection is closed and re-opened every time logs are sent (this could be inefficent, yes).
Why, we do this then? In AWS, ELB (Elastic Load Balancer) TCP/UDP connections there is a timeout for persistent connections without transmission, and some of our Spark jobs may not log for a long while (since we silence verbose logs and only log at the “right level” to Graylog, so we may not log while a task is running at all). If the connection is closed, the default gelfj (at least the version we started to work with) would not reconnect and we’d stop logging after a while. So we forked and added this, has worked well since.
I hope you find this post helpful in setting up your logging. Don’t neglect having good logs: they will save your 🥓 very often.