patterns/idioms/ffi/accepting-strings.md

4.5 KiB

Accepting Strings

Description

When accepting strings via FFI through pointers, there are two principles that should be followed:

  1. Keep foreign strings "borrowed", rather than copying them directly.
  2. Minimize the amount of complexity and unsafe code involved in converting from a C-style string to native Rust strings.

Motivation

The strings used in C have different behaviours to those used in Rust, namely:

  • C strings are null-terminated while Rust strings store their length
  • C strings can contain any arbitrary non-zero byte while Rust strings must be UTF-8
  • C strings are accessed and manipulated using unsafe pointer operations while interactions with Rust strings go through safe methods

The Rust standard library comes with C equivalents of Rust's String and &str called CString and &CStr, that allow us to avoid a lot of the complexity and unsafe code involved in converting between C strings and Rust strings.

The &CStr type also allows us to work with borrowed data, meaning passing strings between Rust and C is a zero-cost operation.

Code Example

pub mod unsafe_module {

    // other module content

    /// Log a message at the specified level.
    ///
    /// # Safety
    ///
    /// It is the caller's guarantee to ensure `msg`:
    ///
    /// - is not a null pointer
    /// - points to valid, initialized data
    /// - points to memory ending in a null byte
    /// - won't be mutated for the duration of this function call
    #[no_mangle]
    pub unsafe extern "C" fn mylib_log(
        msg: *const libc::c_char,
        level: libc::c_int
    ) {
        let level: crate::LogLevel = match level { /* ... */ };

        // SAFETY: The caller has already guaranteed this is okay (see the
        // `# Safety` section of the doc-comment).
        let msg_str: &str = match std::ffi::CStr::from_ptr(msg).to_str() {
            Ok(s) => s,
            Err(e) => {
                crate::log_error("FFI string conversion failed");
                return;
            }
        };

        crate::log(msg_str, level);
    }
}

Advantages

The example is is written to ensure that:

  1. The unsafe block is as small as possible.
  2. The pointer with an "untracked" lifetime becomes a "tracked" shared reference

Consider an alternative, where the string is actually copied:

pub mod unsafe_module {

    // other module content

    pub extern "C" fn mylib_log(msg: *const libc::c_char, level: libc::c_int) {
        // DO NOT USE THIS CODE.
        // IT IS UGLY, VERBOSE, AND CONTAINS A SUBTLE BUG.

        let level: crate::LogLevel = match level { /* ... */ };

        let msg_len = unsafe { /* SAFETY: strlen is what it is, I guess? */
            libc::strlen(msg)
        };

        let mut msg_data = Vec::with_capacity(msg_len + 1);

        let msg_cstr: std::ffi::CString = unsafe {
            // SAFETY: copying from a foreign pointer expected to live
            // for the entire stack frame into owned memory
            std::ptr::copy_nonoverlapping(msg, msg_data.as_mut(), msg_len);

            msg_data.set_len(msg_len + 1);

            std::ffi::CString::from_vec_with_nul(msg_data).unwrap()
        }

        let msg_str: String = unsafe {
            match msg_cstr.into_string() {
                Ok(s) => s,
                Err(e) => {
                    crate::log_error("FFI string conversion failed");
                    return;
                }
            }
        };

        crate::log(&msg_str, level);
    }
}

This code in inferior to the original in two respects:

  1. There is much more unsafe code, and more importantly, more invariants it must uphold.
  2. Due to the extensive arithmetic required, there is a bug in this version that cases Rust undefined behaviour.

The bug here is a simple mistake in pointer arithmetic: the string was copied, all msg_len bytes of it. However, the NUL terminator at the end was not.

The Vector then had its size set to the length of the zero padded string -- rather than resized to it, which could have added a zero at the end. As a result, the last byte in the Vector is uninitialized memory. When the CString is created at the bottom of the block, its read of the Vector will cause undefined behaviour!

Like many such issues, this would be difficult issue to track down. Sometimes it would panic because the string was not UTF-8, sometimes it would put a weird character at the end of the string, sometimes it would just completely crash.

Disadvantages

None?