This section will describe how to use the `data.table`

package. This package is a much faster implementation of the `data.frame`

object which it also extends.

This is rather superficial overview of `data.table`

where we describe briefly how to construct a panel data set with lagged variables. For a thorough overview of the capabilities of data.table, please refer to the excellent documentation of the package, for example in the vignettes at http://cran.r-project.org/web/packages/data.table/index.html

A `data.table`

is a collection of vectors of identical length. Each vector forms a column, and so the `data.table`

can be thought of as a table with a number of row and a number of columns.

Let's create a data.table:

`require(data.table)`

`## Loading required package: data.table`

```
v1 = seq(0, 1, l = 10)
v2 = sample(c("a", "b", "c"), 10, replace = TRUE)
dd = data.table(i = v1, l = v2)
dd
```

```
## i l
## 1: 0.0000 c
## 2: 0.1111 b
## 3: 0.2222 c
## 4: 0.3333 c
## 5: 0.4444 b
## 6: 0.5556 c
## 7: 0.6667 b
## 8: 0.7778 c
## 9: 0.8889 b
## 10: 1.0000 b
```

Then columns can be accessed by names using the dollar sign as in a standard data.frame, or by their name, or by handling the table as a list (a data.table **is** a list)

```
# use dollar sign
dd$i
```

`## [1] 0.0000 0.1111 0.2222 0.3333 0.4444 0.5556 0.6667 0.7778 0.8889 1.0000`

```
# same as using the j-argument, the list index or the the list name?
all(all.equal(dd$i,dd[,i]),
all.equal(dd$i,dd[[1]]),
all.equal(dd$i,dd[["i"]]))
```

`## [1] TRUE`

`[`

in data.tablethe first argument allows to subset the data, the second allows to perform function on that subset, the third argument allows to apply the function by groups. TBD.

`keys`

Here we show how the keys optins can be used to compute lag variables in a `data.table`

. This requires that you understood the way data.table works.

For this we are going to use a simulated dynamic panel. This panel will represent draws from an AR(1) process with a random effect

\[ y_{it} = \rho * y_{it-1} + f_{i} + u_{it} \]

Let's first generate the data in a very crude and slow way.

```
p = list(n = 20, t = 10, rho = 0.8, f_sd = 0.2, y0_sd = 1, u_sd = 1)
# create 1 entry per individual and draw a random length
dd = data.table(i = 1:p$n, l = rpois(p$n, p$t), y0 = rnorm(p$n, sd = p$y0_sd),
f = exp(rnorm(p$n, sd = p$y0_sd)))
# for each individual we create the time series
dd = dd[, {
y = rep(0, l)
y[1] = y0
u = rnorm(l, sd = p$u_sd)
for (t in 2:l) {
y[t] = p$rho * y[t - 1] + f + u[t]
}
list(y = y, t = 1:l)
}, i]
```

we plot for a few indidividuals:

Now that we have our data set we can create the first difference to remove the fixed effect. To do so we are going to use the `keys`

of data.table.

We first define the keys for the table. The keys should uniquely identify a row in the data. In our case (i,t) is enough.

`setkey(dd, i, t)`

Next we can use the `J`

function from the package to create the lag `y`

```
dd$y.l1 = dd[J(i, t - 1), y]$y
dd
```

```
## i y t y.l1
## 1: 1 0.3396 1 NA
## 2: 1 -1.1576 2 0.3396
## 3: 1 -0.7036 3 -1.1576
## 4: 1 0.2250 4 -0.7036
## 5: 1 0.3032 5 0.2250
## ---
## 215: 20 9.5737 7 9.6171
## 216: 20 8.1554 8 9.5737
## 217: 20 7.4401 9 8.1554
## 218: 20 6.1551 10 7.4401
## 219: 20 6.3545 11 6.1551
```

The `J`

function creates a data.table with the 2 columns `i,j`

coming from the wrapping data.table. It creates a running index that will go through the table. Then we can use `-1`

to express that we want to shift this index by one. Any transformation can be performed at this point and one can go 4 periods before or after or anything.

`:=`

There is an alternative approach to the above. It differs mainly in how we add the new column containing the lagged values. When the data.table contains a lot of data, it is preferrable to manipulate it **by reference**, i.e. using the function `:=`

. The following is a slight modification of a related stackoverflow.com answer. It relies on the concept of a *self-join*, i.e. we join the data.table to itself based on the value of a key:

```
setcolorder(dd,c("i","t","y","y.l1")) # change column order by reference
dd[list(i,t-1)] # evaluate at value of [lagged] key: for each i, the t index is shifted one back
```

```
## i t y y.l1
## 1: 1 0 NA NA
## 2: 1 1 0.3396 NA
## 3: 1 2 -1.1576 0.3396
## 4: 1 3 -0.7036 -1.1576
## 5: 1 4 0.2250 -0.7036
## ---
## 215: 20 6 9.6171 7.2957
## 216: 20 7 9.5737 9.6171
## 217: 20 8 8.1554 9.5737
## 218: 20 9 7.4401 8.1554
## 219: 20 10 6.1551 7.4401
```

`dd[,y.l2 := dd[list(i,t-1)][["y"]]] # just add the "y" column of that to dd by reference`

```
## i t y y.l1 y.l2
## 1: 1 1 0.3396 NA NA
## 2: 1 2 -1.1576 0.3396 0.3396
## 3: 1 3 -0.7036 -1.1576 -1.1576
## 4: 1 4 0.2250 -0.7036 -0.7036
## 5: 1 5 0.3032 0.2250 0.2250
## ---
## 215: 20 7 9.5737 9.6171 9.6171
## 216: 20 8 8.1554 9.5737 9.5737
## 217: 20 9 7.4401 8.1554 8.1554
## 218: 20 10 6.1551 7.4401 7.4401
## 219: 20 11 6.3545 6.1551 6.1551
```

`dd[,all.equal(y.l1,y.l2)]`

`## [1] TRUE`

You can see that this already produces the desired result: the value of `y`

.