Why doesn't lighttpd always generate ETags?

ETags are a wonderful part of the HTTP standard. You can read about them in RFC 2616 and on wikipedia. ETags are an optional opaque identifier returned by an HTTP server to identify a version of a resource. Servers generate the ETag of a file by using a piece of data that changes when the file does. For example, the modified time of the file can be used. Clients optionally send the ETag back to the server in the headers of requests for the same resource. The server compares the ETag provided by the client against the current ETag of the resource. If the ETags match, the server can respond with a status line of 304 Not Modified. The server omits the actual content of the resource in the body of the response since the client has it already.

This is particularly excellent if you have a process that needs to take some action when a resource changes. The process can poll the resource at some frequency and only actually get the new copy of the resource when it changes. As you may have noticed I have been working with the pickle format recently. In my case, a Python process periodically writes a new pickled file into a part of the filesystem that is served by the web server lighttpd. On Ubuntu 12.04 LTS the included package from the repositories serves requests with ETags by default.

ETag generation is part of the core of lighttpd. If your distribution doesn't enable ETags by default in lighttpd, you can do so by setting the following value in your lighttpd.conf.

#This configuration file is usually located /etc/lighttpd/lighttpd.conf on Debian & Ubuntu machines.
static-file.etags="true"

The missing ETags

For example, using curl to retrieve a file from a server on my home network returns the resource with an ETag. the -D - flag to the curl command cause it to display the HTTP headers.

ericu@ericu-desktop:~$ curl -D - http://corei7/~ericu/example.txt
HTTP/1.1 200 OK
Content-Type: text/plain
Accept-Ranges: bytes
ETag: "3912221703"
Last-Modified: Sat, 20 Sep 2014 16:37:00 GMT
Content-Length: 13
Date: Sat, 20 Sep 2014 16:37:26 GMT
Server: lighttpd/1.4.28

Hello ETags!

The ETag header in this case has a value of "3912221703". The double quotes are actually part of the header. I executed touch example.txt on the server to change the modified time of the file. Then I performed the same request again.

ericu@ericu-desktop:~$ curl -D - http://corei7/~ericu/example.txt
HTTP/1.1 200 OK
Content-Type: text/plain
Accept-Ranges: bytes
ETag: "3912220878"
Last-Modified: Sat, 20 Sep 2014 16:39:09 GMT
Content-Length: 13
Date: Sat, 20 Sep 2014 16:39:11 GMT
Server: lighttpd/1.4.28

Hello ETags!

The ETag now has a value of "3912220878" because lighttpd generated a new ETag when the file changed.

Interestingly, I found that this did not work with pickled data. This is what I got when I serialized the string "Hello ETags!" using the pickle module into a file called example.pkl.

ericu@ericu-desktop:~$ curl -D - http://corei7/~ericu/example.pkl
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Accept-Ranges: bytes
Content-Length: 20
Date: Sat, 20 Sep 2014 16:41:14 GMT
Server: lighttpd/1.4.28

S'Hello ETags!'
p0

The content is returned but there is no ETag header. This means I cannot use the pattern of having a process poll for a change in the file. This would result in unacceptably high bandwidth usage as the file would be transferred completely on each request.

Finding the problem

I initially poked around in the lighttpd configuration. I assumed I just needed to flip some configuration value to cause lighttpd to always generate ETags. Unfortunately the only thing I was able to find in my search was numerous individuals indicating problems with ETags and mod_compress. No changes to lighttpd.conf seem to be able to do what I wanted.

Giving up on that avenue, I decided to pull the source code for lighttpd. On Ubuntu 12.04 LTS the version is 1.4.28. You can download the source here. Using find, xargs, and grep I located the section of code responsible for ETag generation in src/mod_staticfile.c. Here is the relevant excerpt.

    440         if (NULL == array_get_element(con->response.headers, "Content-Type")) {
    441                 if (buffer_is_empty(sce->content_type)) {
    442                         /* we are setting application/octet-stream, but also announce that
    443                          * this header field might change in the seconds few requests 
    444                          *
    445                          * This should fix the aggressive caching of FF and the script download
    446                          * seen by the first installations
    447                          */
    448                         response_header_overwrite(srv, con, CONST_STR_LEN("Content-Type"), CONST_STR_LEN("application/octet-stream"));
    449 
    450                         allow_caching = 0;
    451                 } else {
    452                         response_header_overwrite(srv, con, CONST_STR_LEN("Content-Type"), CONST_BUF_LEN(sce->content_type));
    453                 }
    454         }
    455 
    456         if (con->conf.range_requests) {
    457                 response_header_overwrite(srv, con, CONST_STR_LEN("Accept-Ranges"), CONST_STR_LEN("bytes"));
    458         }
    459 
    460         if (allow_caching) {
    461                 if (p->conf.etags_used && con->etag_flags != 0 && !buffer_is_empty(sce->etag)) {
    462                         if (NULL == array_get_element(con->response.headers, "ETag")) {
    463                                 /* generate e-tag */
    464                                 etag_mutate(con->physical.etag, sce->etag);
    465 
    466                                 response_header_overwrite(srv, con, CONST_STR_LEN("ETag"), CONST_BUF_LEN(con->physical.etag));
    467                         }
    468                 }

On line 460 if allow_caching is non-zero then an ETag is generated if none already exists. The value allow_caching is set to zero on line 450. The if statement on line 440 checks for the absence of a Content-Type header in the response. On line 441 if the content type is empty, a default of application/octet-stream is used and caching is disabled. The result of this is simple: if lighttpd doesn't know the content type of a file, it won't generate an ETag.

I am unsure why lighttpd was implemented in this fashion. I can't find anything in RFC2616 indicating that this is the required behavior. The comments on lines 442-447 indicate that this may be some sort of workaround for caching problems in Firefox.

Fixing the problem

Armed with the knowledge that lighttpd needs to know the content type of a file to generate an ETag, I set out to inform lighttpd thusly. I found the configuration value mimetype.assign in lighttpd's documentation. It was absent in my lighttpd.conf so I set it as follows.

mimetype.assign = (
  ".pkl" => "application/pickle"
)

However, this didn't work. Lighttpd refused to start. On Ubuntu the following line is present in lighttpd.conf.

include_shell "/usr/share/lighttpd/create-mime.assign.pl"

This means that lighttpd runs the script and reads configuration from standard output. Unsuprisingly this script generates the configuration value mimetype.assign. I didn't bother understanding all of the script, but it reads from /etc/mime.types if the file exists. I edited that file and added the following line to the end.

application/pickle                              pkl pickle

This defines files ending in .pkl or .pickle to have a MIME type of application/pickle. Next, lighttpd must be restarted so its configuration is updated. Then I performed my curl request again for the pickle file.

ericu@ericu-desktop:~$ curl -D - http://corei7/~ericu/example.pkl
HTTP/1.1 200 OK
Content-Type: application/pickle
Accept-Ranges: bytes
ETag: "3635400004"
Last-Modified: Sat, 20 Sep 2014 16:41:03 GMT
Content-Length: 20
Date: Sat, 20 Sep 2014 17:07:31 GMT
Server: lighttpd/1.4.28

S'Hello ETags!'
p0

Success! Lighttpd now generates an ETag for my files ending in .pkl.


© Eric Urban 2014