haven read_dta() in R: Import Stata .dta Files
The haven read_dta() function imports a Stata .dta file into an R tibble. It reads every Stata version directly, needs no Stata installation, and keeps the variable and value labels stored inside the file.
read_dta("data.dta") # read a Stata .dta file
read_stata("data.dta") # identical alias of read_dta()
read_dta("data.dta", col_select = c(id, age)) # read only some columns
read_dta("data.dta", n_max = 1000) # read the first 1000 rows
read_dta("data.dta", skip = 5) # skip the first 5 data rows
read_dta("old.dta", encoding = "latin1") # force encoding for old files
write_dta(df, "data.dta") # write a tibble back to StataNeed explanation? Read on for examples and pitfalls.
What read_dta() does
read_dta() turns a Stata .dta file into a tibble. You pass it a file path and it returns a tidy data frame with one column per Stata variable. The function reads the binary Stata format directly through the bundled ReadStat C library, so no Stata license or software is needed on the machine. It handles every Stata file version, from Stata 8 through the current release, with no version argument from you.
Unlike a CSV export, a Stata file carries a full codebook. Variable labels, value labels such as 1 = "Male", and date display formats all live inside the single .dta file. read_dta() reads that metadata in the same call and attaches it to the tibble, so the analysis context travels with the data.
Syntax and key arguments
Most calls need only the file path; the remaining arguments control which rows and columns you read and how text is decoded. Stata files from research projects can be both wide and long, so these arguments let you pull a slice instead of the whole table.
The arguments you reach for most are col_select (keep only the variables you need), n_max (preview a large file fast), and encoding. The encoding argument matters for files written by Stata 13 or earlier, which stored text as Latin-1 rather than UTF-8. read_dta() reads the encoding from modern files on its own; only an old file with garbled accented characters needs an explicit encoding = "latin1".
read_dta("data.dta") is pandas.read_stata("data.dta"). Both return a tabular object, but haven gives you a tibble and keeps Stata value labels as a haven_labelled class that pandas converts to plain categories.read_dta() examples
haven ships an example iris.dta file, so every example below runs without a Stata file of your own. Build the path with system.file() and pass it to read_dta().
The result is a tibble. Stata variable names cannot contain dots, so Sepal.Length from base R is stored as Sepal_Length here.
Read only the columns you need with col_select. It uses tidyselect syntax, so you name columns without quotes. This trims a wide Stata export down to the variables your analysis touches.
Preview a large file with n_max. Reading a capped number of rows returns a sample in milliseconds, even when the full file holds hundreds of thousands of observations.
Confirm the object once the file is in memory. read_dta() always returns a tibble, which prints cleanly and works with every tidyverse function.
read_stata() as an identical alias of read_dta(), so you can use whichever name reads more clearly in your script. Both call the same underlying reader and return the same tibble.read_dta() vs read_stata(), read_sav() and other readers
read_dta() is one of several haven readers, each tied to a statistical file format. They share the same return type and most arguments, so picking the right one is mostly a matter of matching the file extension.
| Function | Reads | Source software |
|---|---|---|
read_dta() |
.dta |
Stata |
read_stata() |
.dta |
Stata, identical alias |
read_sav() |
.sav, .zsav |
SPSS |
read_sas() |
.sas7bdat |
SAS |
read_csv() |
.csv |
Plain text export |
Use read_dta() or read_stata() for Stata data files, which end in .dta. Use read_sav() for SPSS files and read_sas() for SAS files. If the data reached you as a plain .csv export from Stata, use readr's read_csv() instead, since that file no longer holds the Stata metadata.
as_factor() to decode.Common pitfalls
Pointing read_dta() at a path that does not exist. The function cannot guess where the file is. A typo in the path or the wrong working directory triggers an immediate error.
Garbled text from an old Stata file. Files written by Stata 13 or earlier store text as Latin-1. read_dta() assumes UTF-8 unless told otherwise, so accented names can arrive as mojibake. Re-read the file with encoding = "latin1" to fix the characters.
Coded columns arriving as labelled values. When a .dta file stores value labels, read_dta() returns those columns with class haven_labelled. They look like numbers but carry hidden labels, which breaks some functions. Convert them with as_factor() for readable factors or zap_labels() to keep the raw values.
%td date format into R Date values, but a date column saved without that format reads as a large raw number. Check any suspicious column and convert it with as.Date(x, origin = "1960-01-01").Try it yourself
Try it: Read the bundled iris.dta file, then compute the mean of the Petal_Width column. Save the result to ex_mean.
Click to reveal solution
Explanation: read_dta() returns the Stata file as a tibble, so ex_data$Petal_Width is an ordinary numeric column that mean() summarises directly.
Related haven functions
read_dta() works alongside a small set of haven readers, writers, and label helpers. Reach for the one that matches your file or your cleanup task.
read_stata(): identical alias ofread_dta()for reading a Stata.dtafile.write_dta(): write a tibble back out as a Stata.dtafile, with an optionalversionargument.read_sav(): read an SPSS.savor.zsavfile into a tibble.read_sas(): read a SAS.sas7bdatdata file into a tibble.as_factor(): turn Stata labelled columns into readable factors.zap_labels(): strip value labels and keep the raw underlying values.
For the full argument reference, see the haven read_dta() documentation on tidyverse.org.
FAQ
How do I import a Stata file into R?
Install the haven package with install.packages("haven"), load it with library(haven), then call read_dta("path/to/file.dta"). The function reads the binary Stata format directly and returns the data as a tibble. You do not need Stata itself, a license, or any StataCorp software on the machine. The path can be absolute or relative to your current working directory.
What package reads .dta files in R?
The haven package is the standard choice. It reads Stata .dta files with read_dta() or its alias read_stata(), and it is part of the tidyverse. The older foreign::read.dta() still works for files up to Stata 12, but haven is faster, actively maintained, reads every modern Stata version, and returns a tidy tibble with value labels preserved.
What is the difference between read_dta() and read.dta()?
read_dta() comes from haven and reads every Stata version, returning a tibble with labels kept as a haven_labelled class. read.dta() comes from the older foreign package and only reads files up to Stata 12, returning a base data frame. For any recent Stata file, use haven's read_dta(). The dotted name is the legacy reader.
Why are my Stata columns showing as labelled in R?
The .dta file stored value labels, so read_dta() returns those columns with class haven_labelled. They display as numbers with attached labels. Call as_factor() on the data to convert labelled columns into factors, or zap_labels() to drop the labels and keep the raw values. This is common when importing coded survey or panel data.
Can read_dta() read all Stata versions?
Yes. read_dta() reads every Stata file format from Stata 8 through the current release, and it detects the version automatically, so you never pass a version argument when reading. The version argument exists only on the writing side, in write_dta(), where it controls which Stata release can open the file you create.