scrcpy/doc/develop.md
Romain Vimont b9d244b4c9 Document UHID
Rework the documentation to present the keyboard and mouse input modes.

PR #4473 <https://github.com/Genymobile/scrcpy/pull/4473>
2024-03-01 00:52:28 +01:00

21 KiB

scrcpy for developers

Overview

This application is composed of two parts:

  • the server (scrcpy-server), to be executed on the device,
  • the client (the scrcpy binary), executed on the host computer.

The client is responsible to push the server to the device and start its execution.

The client and the server establish communication using separate sockets for video, audio and controls. Any of them may be disabled (but not all), so there are 1, 2 or 3 socket(s).

The server initially sends the device name on the first socket (it is used for the scrcpy window title), then each socket is used for its own purpose. All reads and writes are performed from a dedicated thread for each socket, both on the client and on the server.

If video is enabled, then the server sends a raw video stream (H.264 by default) of the device screen, with some additional headers for each packet. The client decodes the video frames, and displays them as soon as possible, without buffering (unless --display-buffer=delay is specified) to minimize latency. The client is not aware of the device rotation (which is handled by the server), it just knows the dimensions of the video frames it receives.

Similarly, if audio is enabled, then the server sends a raw audio stream (OPUS by default) of the device audio output (or the microphone if --audio-source=mic is specified), with some additional headers for each packet. The client decodes the stream, attempts to keep a minimal latency by maintaining an average buffering. The blog post of the scrcpy v2.0 release gives more details about the audio feature.

If control is enabled, then the client captures relevant keyboard and mouse events, that it transmits to the server, which injects them to the device. This is the only socket which is used in both direction: input events are sent from the client to the device, and when the device clipboard changes, the new content is sent from the device to the client to support seamless copy-paste.

Note that the client-server roles are expressed at the application level:

  • the server serves video and audio streams, and handle requests from the client,
  • the client controls the device through the server.

However, by default (when --force-adb-forward is not set), the roles are reversed at the network level:

  • the client opens a server socket and listen on a port before starting the server,
  • the server connects to the client.

This role inversion guarantees that the connection will not fail due to race conditions without polling.

Server

Privileges

Capturing the screen requires some privileges, which are granted to shell.

The server is a Java application (with a public static void main(String... args) method), compiled against the Android framework, and executed as shell on the Android device.

To run such a Java application, the classes must be dexed (typically, to classes.dex). If my.package.MainClass is the main class, compiled to classes.dex, pushed to the device in /data/local/tmp, then it can be run with:

adb shell CLASSPATH=/data/local/tmp/classes.dex app_process / my.package.MainClass

The path /data/local/tmp is a good candidate to push the server, since it's readable and writable by shell, but not world-writable, so a malicious application may not replace the server just before the client executes it.

Instead of a raw dex file, app_process accepts a jar containing classes.dex (e.g. an APK). For simplicity, and to benefit from the gradle build system, the server is built to an (unsigned) APK (renamed to scrcpy-server.jar).

Hidden methods

Although compiled against the Android framework, hidden methods and classes are not directly accessible (and they may differ from one Android version to another).

They can be called using reflection though. The communication with hidden components is provided by wrappers classes and aidl.

Execution

The server is started by the client basically by executing the following commands:

adb push scrcpy-server /data/local/tmp/scrcpy-server.jar
adb forward tcp:27183 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/scrcpy-server.jar app_process / com.genymobile.scrcpy.Server 2.1

The first argument (2.1 in the example) is the client scrcpy version. The server fails if the client and the server do not have the exact same version. The protocol between the client and the server may change from version to version (see protocol below), and there is no backward or forward compatibility (there is no point to use different client and server versions). This check allows to detect misconfiguration (running an older or newer server by mistake).

It is followed by any number of arguments, in the form of key=value pairs. Their order is irrelevant. The possible keys and associated value types can be found in the server and client code.

For example, if we execute scrcpy -m1920 --no-audio, then the server execution will look like this:

# scid is a random number to identify different clients running on the same device
adb shell CLASSPATH=/data/local/tmp/scrcpy-server.jar app_process / com.genymobile.scrcpy.Server 2.1 scid=12345678 log_level=info audio=false max_size=1920

Components

When executed, its main() method is executed (on the "main" thread). It parses the arguments, establishes the connection with the client and starts the other "components":

  • the video streamer: it captures the video screen and send encoded video packets on the video socket (from the video thread).
  • the audio streamer: it uses several threads to capture raw packets, submits them to encoding and retrieve encoded packets, which it sends on the audio socket.
  • the controller: it receives control messages (typically input events) on the control socket from one thread, and sends device messages (e.g. to transmit the device clipboard content to the client) on the same control socket from another thread. Thus, the control socket is used in both directions (contrary to the video and audio sockets).

Screen video encoding

The encoding is managed by ScreenEncoder.

The video is encoded using the MediaCodec API. The codec encodes the content of a Surface associated to the display, and writes the encoding packets to the client (on the video socket).

On device rotation (or folding), the encoding session is reset and restarted.

New frames are produced only when changes occur on the surface. This avoids to send unnecessary frames, but by default there might be drawbacks:

  • it does not send any frame on start if the device screen does not change,
  • after fast motion changes, the last frame may have poor quality.

Both problems are solved by the flag KEY_REPEAT_PREVIOUS_FRAME_AFTER.

Audio encoding

Similarly, the audio is captured using an AudioRecord, and encoded using the MediaCodec asynchronous API.

More details are available on the blog post introducing the audio feature.

Input events injection

Control messages are received from the client by the Controller (run in a separate thread). There are several types of input events:

  • keycode (cf KeyEvent),
  • text (special characters may not be handled by keycodes directly),
  • mouse motion/click,
  • mouse scroll,
  • other commands (e.g. to switch the screen on or to copy the clipboard).

Some of them need to inject input events to the system. To do so, they use the hidden method InputManager.injectInputEvent() (exposed by the InputManager wrapper).

Client

The client relies on SDL, which provides cross-platform API for UI, input events, threading, etc.

The video and audio streams are decoded by FFmpeg.

Initialization

The client parses the command line arguments, then runs one of two code paths:

In the remaining of this document, we assume that the "normal" mode is used (read the code for the OTG mode).

On startup, the client:

  • opens the video, audio and control sockets;
  • pushes and starts the server on the device;
  • initializes its components (demuxers, decoders, recorder…).

Video and audio streams

Depending on the arguments passed to scrcpy, several components may be used. Here is an overview of the video and audio components:

                                                 V4L2 sink
                                               /
                                       decoder
                                     /         \
        VIDEO -------------> demuxer             display
                                     \
                                       recorder
                                     /
        AUDIO -------------> demuxer
                                     \
                                       decoder --- audio player

The demuxer is responsible to extract video and audio packets (read some header, split the video stream into packets at correct boundaries, etc.).

The demuxed packets may be sent to a decoder (one per stream, to produce frames) and to a recorder (receiving both video and audio stream to record a single file). The packets are encoded on the device (by MediaCodec), but when recording, they are muxed (asynchronously) into a container (MKV or MP4) on the client side.

Video frames are sent to the screen/display to be rendered in the scrcpy window. They may also be sent to a V4L2 sink.

Audio "frames" (an array of decoded samples) are sent to the audio player.

Controller

The controller is responsible to send control messages to the device. It runs in a separate thread, to avoid I/O on the main thread.

On SDL event, received on the main thread, the input manager creates appropriate control messages. It is responsible to convert SDL events to Android events. It then pushes the control messages to a queue hold by the controller. On its own thread, the controller takes messages from the queue, that it serializes and sends to the client.

Protocol

The protocol between the client and the server must be considered internal: it may (and will) change at any time for any reason. Everything may change (the number of sockets, the order in which the sockets must be opened, the data format on the wire…) from version to version. A client must always be run with a matching server version.

This section documents the current protocol in scrcpy v2.1.

Connection

Firstly, the client sets up an adb tunnel:

# By default, a reverse redirection: the computer listens, the device connects
adb reverse localabstract:scrcpy_<SCID> tcp:27183

# As a fallback (or if --force-adb forward is set), a forward redirection:
# the device listens, the computer connects
adb forward tcp:27183 localabstract:scrcpy_<SCID>

(<SCID> is a 31-bit random number, so that it does not fail when several scrcpy instances start "at the same time" for the same device.)

Then, up to 3 sockets are opened, in that order:

  • a video socket
  • an audio socket
  • a control socket

Each one may be disabled (respectively by --no-video, --no-audio and --no-control, directly or indirectly). For example, if --no-audio is set, then the video socket is opened first, then the control socket.

On the first socket opened (whichever it is), if the tunnel is forward, then a dummy byte is sent from the device to the client. This allows to detect a connection error (the client connection does not fail as long as there is an adb forward redirection, even if nothing is listening on the device side).

Still on this first socket, the device sends some metadata to the client (currently only the device name, used as the window title, but there might be other fields in the future).

You can read the client and server code for more details.

Then each socket is used for its intended purpose.

Video and audio

On the video and audio sockets, the device first sends some codec metadata:

  • On the video socket, 12 bytes:
    • the codec id (u32) (H264, H265 or AV1)
    • the initial video width (u32)
    • the initial video height (u32)
  • On the audio socket, 4 bytes:
    • the codec id (u32) (OPUS, AAC or RAW)

Then each packet produced by MediaCodec is sent, prefixed by a 12-byte frame header:

  • config packet flag (u1)
  • key frame flag (u1)
  • PTS (u62)
  • packet size (u32)

Here is a schema describing the frame header:

    [. . . . . . . .|. . . .]. . . . . . . . . . . . . . . ...
     <-------------> <-----> <-----------------------------...
           PTS        packet        raw packet
                       size
     <--------------------->
           frame header

The most significant bits of the PTS are used for packet flags:

     byte 7   byte 6   byte 5   byte 4   byte 3   byte 2   byte 1   byte 0
    CK...... ........ ........ ........ ........ ........ ........ ........
    ^^<------------------------------------------------------------------->
    ||                                PTS
    | `- key frame
     `-- config packet

Controls

Controls messages are sent via a custom binary protocol.

The only documentation for this protocol is the set of unit tests on both sides:

Standalone server

Although the server is designed to work for the scrcpy client, it can be used with any client which uses the same protocol.

For simplicity, some server-specific options have been added to produce raw streams easily:

  • send_device_meta=false: disable the device metata (in practice, the device name) sent on the first socket
  • send_frame_meta=false: disable the 12-byte header for each packet
  • send_dummy_byte: disable the dummy byte sent on forward connections
  • send_codec_meta: disable the codec information (and initial device size for video)
  • raw_stream: disable all the above

Concretely, here is how to expose a raw H.264 stream on a TCP socket:

adb push scrcpy-server-v2.1 /data/local/tmp/scrcpy-server-manual.jar
adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/scrcpy-server-manual.jar \
    app_process / com.genymobile.scrcpy.Server 2.1 \
    tunnel_forward=true audio=false control=false cleanup=false \
    raw_stream=true max_size=1920

As soon as a client connects over TCP on port 1234, the device will start streaming the video. For example, VLC can play the video (although you will experience a very high latency, more details here):

vlc -Idummy --demux=h264 --network-caching=0 tcp://localhost:1234

Hack

For more details, go read the code!

If you find a bug, or have an awesome idea to implement, please discuss and contribute ;-)

Debug the server

The server is pushed to the device by the client on startup.

To debug it, enable the server debugger during configuration:

meson setup x -Dserver_debugger=true
# or, if x is already configured
meson configure x -Dserver_debugger=true

If your device runs Android 8 or below, set the server_debugger_method to old in addition:

meson setup x -Dserver_debugger=true -Dserver_debugger_method=old
# or, if x is already configured
meson configure x -Dserver_debugger=true -Dserver_debugger_method=old

Then recompile.

When you start scrcpy, it will start a debugger on port 5005 on the device. Redirect that port to the computer:

adb forward tcp:5005 tcp:5005

In Android Studio, Run > Debug > Edit configurations... On the left, click on +, Remote, and fill the form:

  • Host: localhost
  • Port: 5005

Then click on Debug.