Sunday, August 11, 2013

Comparison of Go data types for string processing.

String processing is a very common operation in an application. This post is going to talk about some data types used in string processing in Go language. Two main concerns are readibility(or maintaining) and efficiency.

String type

String is no-doubt the most direct type for string processing. Go provides many ways to represent a string literal in the source code.

A string in Go is a serial of immutable bytes. Many builtin string related  functions suppose the text a string representing is encoded with UTF-8 encoding.

Each string object contains a pointer to the start of the byte buffer and the length to the string. Sub-strings from a long one may share the same large buffer with the original one.

Since string is immutable, processing on a string means creating new string objects. Two kinds of new strings may be created:
  1. Sub-strings. e.g. sub := org[2:5]. In this case, a new string header is allocated, the pointer is set to some offset to the original pointer, and the lengh is computed and filled.
  2. A totally new string. E.g. catenation of strings. The underlying buffer is newly allocated, and bytes are filled as needed.
The following code fragment contains both string allocations:

    newStr := "prefix " + oldStr[1:10] + " suffix"

A string is comparable, or it can be a key in a Go map object.

Byte slice type

In Go, a string can be easily converted to a byte slice using similar grammar to type convertion:

    sl := []byte(str)

But one has to remember that, this convertion allocates a new byte slice object, and copies the bytes to the new slice. So changing the content of the slice will not affect the original string. (Otherwise that string is no longer immutable). For the same reason, converting back from a byte slice to a string also needs a new buffer to be allocated. In other words, this convertion is not cheap.

Like other type of slices in Go, a byte slice contains a header of a pointer to the buffer, the length of the slice, and the capability of the slice. This means a byte-slice consumes more meta memory than a string.

But a slice is mutable. The content, or the bytes, in a slice can be changed. So when you are appending sub strings to a byte slice, the underlying buffer may keep unchanged, unless the length exceeds the capability. A more commonly used pattern is to use a bytes.Buffer type, (or for my personally used, I prefer a more lightweight object "github.com/daviddengcn/go-villa".ByteSlice sometimes, which utilizing the Go type system and allocates fewer extra bytes). In some special cases, in-place operation can be performed on a byte slice.

A byte slice cannot be used as a key type in a map.

Byte array type

For some special situations, the length of a string is fixed, a byte array can be used. It needs no extra meta bytes, and is mutable. You can easily create a byte slice representing the whole or part of a byte array.

Content of a byte array is mutable. E.g. in-place processing can be performed like the byte slice. But the assignment of a byte array is like that of a struct, i.e. the whole array is copied to the destination.

A byte array type can be used as a key type in a map.

Rune slice type

A string can be converted to a rune slice in a grammar similar to type convertion:

    rsl := []rune(str)

supposing the bytes are UTF-8 encoding. Besides the allocation of the rune slice, the contents are decoded, other than copied, as the rune slice. This is more expensive than convertions between a string and a byte slice.

A rune slice cannot be used as a key type in a map.

Rune array type

Like the byte array type, rune array is used to represent strings with a fixed/limited number of runes. No extra meta bytes, and the contents are  mutable.

A rune array can be used as a key type in a map.

Summary

Here is a table of the summary to the differences of the above types and convertions between them.
byte-array
byte-slice
string
rune-slice
rune-array
type
[4]byte []byte string []rune [4]rune
size
fixed variable variable variable fixed
access
read/write read/write read only read/write read/write
comparable
yes no yes no yes
meta bytes
0 pointer+len+cap pointer+len pointer+len+cap 0
to byte-array
- copy(ar[:], sl) copy(ar[:], st) copy(ar[:], string(rsl)) copy(ar[:], string(rar[:]))
to byte-slice
ar[:] - []byte(st) []byte(string(rsl)) []byte(string(rar[:]))
to string
string(ar) string(sl) - string(rsl) string(rar[:])
to rune-slice
[]rune(string(ar)) []rune(string(sl)) []rune(st) - rar[:]
to rune-array
copy(rar[:], []rune(string(ar))) copy(rar[:], []rune(string(sl))) copy(rar[:], []rune(st)) copy(rar[:], rsl) -
(variables for byte array, byte slice, string, rune slice and rune array are ar, sl, st, rsl, rar, respectively)

Best practice

Some principles or best practice advices are given, out of my own experience:
  1. When you are processing a very small string, e.g. for showing some text to the console, using string directly is the best choice, since it is very direct and easily to read.
  2. Reading from a long string can use strings.Reader
  3. When you have to do a lot of modification on a string, e.g. modifying some content inside a long text, you can consider the whole processing totally in variables of the byte slice type. Text can be read as a byte slice, and a byte slice can be written out through an io.Writer instance. Go provides many builtin functions upon byte slices. Don't convert the bytes to strings, unless they have to be the key of a map.
  4. When possible, using a byte-array is very efficient. E.g. you need a large map with the key a string with a fixed/limited length.
  5. rune slices are used only when the performance is not in consideration.
  6. Using rune array is also efficienty, when possible, but convertion from and to UTF-8 encoding bytes may be expensive.