readr read_log() in R: Read Web Server Log Files
The readr read_log() function reads a web server log written in the Common Log Format into a tibble. It tokenizes each line so that bracketed timestamps and quoted request strings stay in one piece, giving you one column per log field instead of one messy string.
read_log("access.log") # parse a Common Log Format file
read_log("https://site.com/access.log") # read a log straight from a URL
read_log(I(one_line)) # parse literal log text
read_log("access.log", skip = 2) # skip comment lines at the top
read_log("access.log", n_max = 1000) # read only the first 1000 requests
read_log("access.log", col_names = my_names) # supply your own column namesNeed explanation? Read on for examples and pitfalls.
What read_log() does
read_log() reads a Common Log Format file into a tibble. Web servers such as Apache and nginx record every request as one line of text. That line packs together the client IP, the user, a bracketed timestamp, the quoted request, the status code, and the response size. read_log() splits each line into those fields and returns a tidy data frame.
The clever part is the tokenizer. A naive whitespace split would shatter [10/Oct/2000:13:55:36 -0700] and "GET /index.html HTTP/1.0" into pieces. read_log() instead treats anything inside [ ] or " " as a single token, strips the delimiters, and keeps the field whole.
Syntax and key arguments
read_log() takes a file plus a handful of optional controls. Only the file argument is required; the rest tune naming, type guessing, and how much of the file to read.
Unlike read_csv(), a log file has no header row, so col_names defaults to FALSE and every column comes back as X1, X2, and so on. Pass a character vector to col_names to label the fields yourself. The skip, n_max, and col_types arguments behave exactly as they do in the delimited readers.
read_log() handles both the Common Log Format (7 fields) and the Combined Log Format (9 fields, adding referer and user agent). The tokenizer simply returns as many columns as the line contains, so a combined log yields X1 through X9.read_log() examples
Start with a small log you can see in full. Wrapping the text in I() tells read_log() the value is data, not a file path. This three-line sample is standard Common Log Format.
The brackets and quotes are gone, the timestamp survived as one column, and the status and size parsed as numbers.
Name the columns so the tibble is readable. Pass a character vector to col_names with one label per field. This is the single most useful argument, because X1 through X7 tell you nothing.
Read only the start of a huge log. Production access logs run to millions of lines. Use n_max to sample the first few while you work out the column layout.
Summarise traffic once the log is a tibble. With named columns the log behaves like any other data frame, so a status-code count is one dplyr call.
read_log() vs other readers
read_log() is the only readr function that understands the Common Log Format. The delimited readers would mangle the bracketed date and quoted request. Reach for a sibling only when the file is not a web server log.
| Function | Reads | Output |
|---|---|---|
read_log() |
Common or Combined Log Format | tibble, one column per log field |
read_lines() |
any text file | character vector, one element per line |
read_delim() |
delimited text | tibble, columns split by a delimiter |
read_table() |
whitespace-separated text | tibble, columns split by runs of spaces |
The decision rule is simple. If the file is an Apache or nginx access log, use read_log(). If it is an application log with one free-form message per line, use read_lines() and parse with regular expressions afterwards.
[ ] and " " groups. That is why it copes with both the 7-field common format and the 9-field combined format without any extra arguments.Common pitfalls
The timestamp comes back as text. read_log() never parses the date; the time column is a plain string like 10/Oct/2000:13:55:36 -0700. Convert it with parse_datetime() and the Apache date format.
Mixed formats produce ragged columns. If a file mixes common and combined lines, or a request string itself contains an unescaped quote, the column count varies row to row. readr fills the gaps with NA and records the trouble. Call problems() on the result to inspect every flagged row.
Any whitespace text will tokenize. read_log() does not validate that the input is really a log. Hand it an unrelated text file and it returns nonsense columns with no error. Confirm the file is Common Log Format before trusting the output.
Try it yourself
Try it: Read the four-line Common Log Format string below with read_log(), name the seven columns, then save to ex_n the number of requests that returned status 200.
Click to reveal solution
Explanation: Naming the columns turns the sixth field into status, a numeric column. sum(ex_log$status == 200) counts the TRUE values, which is the number of successful requests.
Related readr functions
read_log() handles the structured web server log; reach for a sibling when the text has a different shape.
read_lines(): read a file as a character vector, one string per line.read_delim(): read text split by any single-character delimiter.read_table(): read whitespace-separated files with ragged columns.read_fwf(): read files where columns sit at fixed character positions.parse_datetime(): convert the log timestamp string into a real datetime.
For the full argument reference, see the readr read_log documentation on tidyverse.org.
FAQ
What is the Common Log Format?
The Common Log Format is a standardized text layout that web servers use to record requests. Each line holds the client IP, an identity field, the user, a bracketed timestamp, the quoted request line, the HTTP status code, and the response size in bytes. Apache and nginx both write it by default. read_log() knows this layout, so it can split each line into the correct columns.
How do I read an Apache access log in R?
Call read_log() with the path to the access log: read_log("access.log"). The function tokenizes every line and returns a tibble with one column per field. Because access logs have no header, pass a character vector to col_names to label the columns, then convert the timestamp column with parse_datetime().
Does read_log() work with the Combined Log Format?
Yes. The Combined Log Format adds two quoted fields, the referer and the user agent, to the standard seven. read_log() uses the same tokenizer for both, so a combined log simply returns nine columns instead of seven. You do not need a different function or any extra argument.
Why are my log columns named X1, X2, X3?
Log files have no header row, so read_log() cannot guess field names and defaults to X1, X2, and onward. Supply your own names by passing a character vector to col_names, for example read_log(file, col_names = c("ip", "ident", "user", "time", "request", "status", "size")).
How do I parse the date in a web server log?
read_log() leaves the timestamp as text. Convert it with parse_datetime() and the Apache date format string: parse_datetime(logs$time, format = "%d/%b/%Y:%H:%M:%S %z"). The %z token reads the timezone offset, so the result is a proper POSIXct column you can filter and aggregate by time.