Headless Chrome: A few pointers to note

I’ve been helping a client scale their technology platform for a new area of the business and they have a need for headless Chrome.

The workflow is simple, get some data, inspect that data, do some other things with it and save the results. We are using SQS as a part of this process, I’ve got some tips on SQS for you to read.

Don’t use headless Chrome

I’ll get this out the way early. Do you really need to use it? Can HttpClient or CURL do what you want? It’s not cheap to run instances of Chrome relative to doing the work manually with a web request library. If you can get away with it, you’ll increase your work load density significantly.

Don’t take screenshots at high volume

Don’t use headless chrome to take screenshots at high volume. Probably don’t even take screenshots at any volume. It’s slow & resource-intensive.

If you have to take screenshots, see if you can do it asynchronously to your primary processes.

Don’t load images if you don’t need them

If you’re not taking screenshots you can speed up page loads by disabling images. You’re in a headless browser, nobody is going to see those images.

Don’t cache.

Set --disk-cache-size=0 and –-media-cache-size=0 in your arguments.

I also set --user-data-dir= to a value similar to /tmp/c_743. Where 743 is the process ID. When I’m done with Chrome I delete it (and if I don’t, /tmp will get cleared up eventually). If you’re using cluster on node you can do this on the parent instance and delete when the child dies.

w.on('exit', () => {
try {
   fs.rmdirSync('/tmp/c_'+ w.process.pid, { recursive: true });
}catch (err){

}
});

Keep the browser alive.

Don’t close and shut the browser on every request. Keep it alive. It’s faster that way.

Watch disk IO and iowait.

In high volume scenarios you may run into issues with disk IO. If this occurs, your processes itself will slow down and requests will take longer. You can avoid this by reducing your caching and also watch /dev/shm. You will see lots of suggestions to use use --disable-dev-shm but this has consequences.

I would suggest reading up on what /dev/shm is and then decide on --disable-dev-shm. By it’s very nature it will increase disk IO.

Your milage may very on this option.

Watch browser memory

I’ve got no hard evidence of this but I have observed over many requests that memory usage can grow. In this application after X requests, I just restart Chromium and start again.

It can be observed on occasion if you’re using a page multiple times that memory does not reduce when visiting pages. Try visiting ‘about:blank‘ after a request and see what happens.

Headless Chrome & Proxies.

You can’t set a different proxy per page without some plugin’s that do little more than do request interception and then redirect the request.

If you’re using proxy services look at Luminati. They can give you a constant proxy server address but you can use a special username to send requests from a particular country or location.

Stop loading remote fonts

There isn’t much of a need for them in this environment so use this switch --disable-remote-fonts to stop them loading.

Comments