****************** CI-tron Components ****************** The executor service ==================== The main service in a CI-tron gateway is called the executor. It is the service that coordinates different services to enable time-sharing test machines, AKA DUTs. This service can be interacted with using `executorctl`_, our client, and/or our :ref:`REST API `. The executor coordinates the different states for the DUTs. Here is a flow of all the states a DUT can be in: .. _dut_state_machine: .. mermaid:: :align: center :alt: DUT state machine :caption: DUT state machine graph TD subgraph "DUT state machine" START --> |is retired?| RETIRED START --> |is marked ready for service?| QUICK_CHECK QUICK_CHECK --> |Success| IDLE QUICK_CHECK --> |Failed| TRAINING TRAINING --> |Failed| TRAINING TRAINING --> |Success| IDLE RETIRED --> |Activate| QUICK_CHECK IDLE --> |Retire| RETIRED IDLE --> |Job received| QUEUED QUEUED --> RUNNING RUNNING --> IDLE end Let's see what every state of a DUT means: * ``IDLE``: The device is available (but powered down to save energy), waiting for a job. * ``TRAINING``: The device is being tested for boot reliability (20 rounds by default). * ``RETIRED``: The device is undergoing maintenance, and cannot accept jobs. * ``QUICK_CHECK``: The device is verifying that its current configuration matches what is described in the database. * ``QUEUED``: The device has been chosen to execute a job, but the executor isn't ready just yet (expected to last <1s) * ``RUNNING``: The device is running a job. .. _executor_config: Executor configuration ---------------------- The executor service is configured through the use of environment variables. In a CI-tron gateway, the default configuration is stored at ``/etc/base_config.env``, while user-provided overrides are usually located in ``/config/config.env``. Here are the relevant options to most deployment, usually set in ``/config/config.env``: * ``FARM_NAME``: Name of the test farm. Derived from the hostname when it ends with `[...]-gateway` (the farm will be named with the prefix), otherwise mandatory. (**sometimes mandatory**, default: None) * ``EXECUTOR_REGISTRATION_JOB``: Local path to the registration job (default: `$package_dir/job_templates/register.yml.j2`) * ``EXECUTOR_BOOTLOOP_JOB``: Local path to the registration job (default: `$package_dir/job_templates/bootloop.yml.j2`) * ``SERGENT_HARTMAN_BOOT_COUNT``: How many rounds of testing should be used to qualify a test machine. Setting this value to 0, a DUT will be considered ready for service as soon as registration occurs. A negative value disables registration/training/quick_check altogether (default: `100`) * ``SERGENT_HARTMAN_QUALIFYING_BOOT_COUNT``: How many successful rounds of testing should be used to qualify a test machine (default: `100`) * ``SERGENT_HARTMAN_REGISTRATION_RETRIAL_DELAY``: How many seconds should be waited after an unsuccessful registration attempt before trying another one (default: `120`) * ``SERGENT_HARTMAN_QUICK_CHECK``: Should the DUTs be booted/checked after a reboot or after being activated after being retired (default: `enabled`) * ``IMAGESTORE_PATH``: The base path where the executor should store the container images it downloads when asked by `storage:imagestore:`. This folder should contain a `config.yml` file that documents how clients are supposed to get access to the image store. See :ref:`imagestore_config` for details about its format. (default: `$TMP/imagestores/`) * ``IMAGESTORE_PULL_CMD``: The command to execute to pull the container in the image store (default: `podman --root ${imgstore} pull --tls-verify=${tls_verify} --platform=${platform} ${image_name}`) * ``IMAGESTORE_IMAGE_EXISTS_CMD``: The command to execute to check if an image exists in the store (default: `podman --root ${imgstore} image exists ${image_name}`) The following config options are partially auto-generated, and set via ``/config/minio/minio.env``: * ``MINIO_URL``: URL to the local minio service, accessible both locally and by test machines (default: `http://ci-gateway:9000``) * ``MINIO_ROOT_USER``: Admin username for the local minio service (default `minioadmin`) * ``MINIO_ROOT_PASSWORD``: Admin password for the local minio service (default `minio-root-password`) And here are the lower-level options: * ``BOOTS_DISABLE_SERVERS``: Set to a non-empty value to disable netbooting services (DHCP and TFTP). (default: None) * ``BOOTS_DHCP_IPv4_SOCKET_NAME``: Name of the socket to use for the DHCP server, as set by systemd's socket activation unit using `FileDescriptorName=` (default: `dhcp_ipv4`) * ``BOOTS_TFTP_IPv4_SOCKET_NAME``: Name of the socket to use for the TFTP server, as set by systemd's socket activation unit using `FileDescriptorName=` (default: `tftp_ipv4`) * ``BOOTS_DB_USER_FILE``: Path to the file overriding the `default boots_db.yml.j2 file `_ (default: `/config/boots_db.yml.j2`) * ``CONSOLE_PATTERN_DEFAULT_MACHINE_UNFIT_FOR_SERVICE_REGEX``: Automatically tag a DUT as unfit for service if it generates a line matched by this regular expression (default: None) * ``EXECUTOR_HOST``: Binding address for the HTTP service (default: `0.0.0.0`) * ``EXECUTOR_PORT``: Binding port for the HTTP service (default: `80`) * ``EXECUTOR_HTTP_IPv4_SOCKET_NAME``: Name of the socket to use for the HTTP server, as set by systemd's socket activation unit using `FileDescriptorName=` (default: `http_ipv4`). Overrides ``EXECUTOR_PORT``/``EXECUTOR_PORT``. * ``EXECUTOR_URL``: HTTP url of the executor service, reachable locally and from the test machines (default: `http://ci-gateway`) * ``EXECUTOR_ARTIFACT_CACHE_ROOT``: Folder to use as a cache for the kernel/initrd artifacts used by the jobs (**recommended**, default: None) * ``EXECUTOR_VPDU_ENDPOINT``: Automatically add a virtual PDU for local testing (format: `host:port`, default: None) * ``GITLAB_CONF_FILE``: Path to the gitlab runner configuration file, which will be overridden as new test machines are added to the farm (default: `/etc/gitlab-runner/config.toml`) * ``GITLAB_CONF_TEMPLATE_FILE``: Template to use for the creation of the gitlab runner configuration file (default: `$package_dir/templates/gitlab_runner_config.toml.j2`) * ``GITLAB_ALLOW_INSECURE``: Allow ``MARS_DB_FILE`` to reference a GitLab instance using an ``http://` URL rather than ``https://`` (default: `false`) * ``IMAGESTORE_PATH``: The base path where the executor should store the container images it downloads when asked by `storage:imagestore:`. This folder should contain a `config.yml` file that documents how clients are supposed to get access to the image store. See :ref:`imagestore_config` for details about its format. (default: `$TMP/imagestores/`) * ``IMAGESTORE_PULL_CMD``: The command to execute to pull the container in the image store (default: `podman --root ${imgstore} pull --tls-verify=${tls_verify} --platform=${platform} ${image_name}`) * ``IMAGESTORE_IMAGE_EXISTS_CMD``: The command to execute to check if an image exists in the store (default: `podman --root ${imgstore} image exists ${image_name}`) * ``MARS_DB_FILE``: Path to the database (default: `/config/mars_db.yaml`) * ``MINIO_ADMIN_ALIAS``: Alias set up by the executor to refer to the minio instanced specified by ``MINIO_URL``, ``MINIO_ROOT_USER``, and ``MINIO_ROOT_PASSWORD`` (default: `local`) * ``PRIVATE_INTERFACE``: Network interface connected to the DUTs' network (default: `private`) * ``SALAD_URL``: URL to the salad service (default: `http://ci-gateway:8005`) .. _imagestore_config: Imagestore config.yml's format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: yaml mount: # List of mount points the DUTs can mount the image store - type: nfs # The name of the filesystem src: "ci-gateway:/imagestores" # The source of the filesystem opts: # The list of mounting options that should be set - vers=4.2 - ro - addr=10.42.0.1 - ... # Add more mounting methods here .. _registry_config: Registry config.yml's format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The goal of this configuration file is to allow job descriptions to redirect accesses to container registries to a local proxy, thus saving bandwidth and reducing execution time by using the following function: ``{{ registry.to_local_proxy("$IMAGE_NAME") }}``. .. code-block:: yaml images: - match: ^quay.io # Regular expression that will match the part of the image replace: ci-gateway:8100 # String to replace the match with - ... # Add more replacement rules here .. note:: This file is meant to be auto-generated by the registryd service, based on the content of the registry description file found in ``/config/registries/``. Check out ``/config/registries/8100_quay.yml.j2.example`` for more details. .. _pdu_module: PDU module ---------- .. literalinclude:: ../../executor/server/src/valve_gfx_ci/executor/server/pdu/README.md :language: Markdown .. _MarsDB: MarsDB ------ MarsDB is the database for all the runtime data of the CI instance: - List of PDUs connected - List of test machines - List of Gitlab instances where to expose the test machines Its location is set using the ``MARS_DB_FILE`` environment variable, and is live-editable. This means you can edit the file directly and changes will be reflected instantly in the executor. Machines can be added to MarsDB by POSTing or PUTing to the ``/api/v1/dut/`` REST endpoint. Fields in the REST API match the ones found in the database, but some fields cannot be set at the creation of the machine for safety reasons as we want to enforce a separation between fields that are meant to be auto-generated and the ones that are meant to be manually-configured (denoted by the ``(MANUAL)`` tag in the DB file description below). The most prominent manual fields are ``pdu`` and ``pdu_port``, which means a newly-added machine won't be usable until manually associated to its PDU port by manually editing the DB file. An easier solution to enroll a new machine is to use the discovery process by POSTing to the ``/api/v1/dut/discover`` endpoint the ``pdu`` and ``pdu_port_id`` fields. This will initiate the discovery sequence where the executor will turn this port ON, wait for the machine to register itself, then automatically add associate the machine to the PDU port specified in the discovery process. Using the discovery process allows a machine to go through the ``TRAINING`` process without further manual intervention. Here is an annotated sample file, where ``AUTO`` means you should not be modifying this value (and all children of it) while ``MANUAL`` means that you are expected to set these values by editing the DB file manually, or through the ``REST`` interface. All the other values should be machine-generated, for example using the ``machine-registration`` container: .. code-block:: yaml pdus: # List of all the power delivery units (MANUAL) APC: # Name of the PDU driver: apc_masterswitch # The [driver of your PDU](pdu/README.md) config: # The configuration of the driver (driver-dependent) hostname: 10.0.0.2 VPDU: # A virtual PDU, spawning virtual machines driver: vpdu config: hostname: localhost:9191 reserved_port_ids: [] # List of reserved ports in the PDU where no virtual DUT can be added (REST) duts: # List of all the test machines de:ad:be:ef:ca:fe: # MAC address of the machine base_name: gfx9 # Most significant characteristic of the machine. Basis of the auto-generated name ip_address: 192.168.0.42 # IP address of the machine tags: # List of tags representing the machine - amdgpu:architecture:GCN5.1 - amdgpu:family:RV - amdgpu:codename:RENOIR - amdgpu:gfxversion:gfx9 - amdgpu:APU - amdgpu:pciid:0x1002:0x1636 manual_tags: # List of tags that cannot be automatically generated (MANUAL) - freesync_display local_tty_device: ttyUSB0 # Test machine's serial port to talk to the gateway gitlab: # List of GitLab instances to expose this runner on freedesktop: # Parameters for the `freedesktop` GitLab instance token: # Token given by the registration process (AUTO) exposed: true # Should this machine be exposed on `freedesktop`? (MANUAL) runner_id: 4242 # GitLab's runner ID associated to this machine acl: # Access control list for this DUT on that GitLab instance (see explanations in the sub-section below) deny: users: # Username of the user creating the job - bad_user allow: projects: # Full path (relative to the instance) of the project that created the job - gfx-ci/ci-tron projects_in_groups: # Matches if the project is in of one of these groups - gfx-ci pdu: APC # Name of the PDU to contact to turn ON/OFF this machine (MANUAL/REST) pdu_port_id: 1 # ID of the port where the machine is connected (MANUAL/REST) pdu_off_delay: 30 # How long should the PDU port be off when rebooting the machine? (REST) ready_for_service: true # The machine has been tested and can now be used by users (AUTO/REST) is_retired: false # The user specified that the machine is no longer in use first_seen: 2021-12-22 16:57:08.146275 # When was the machine first seen in CI (AUTO) comment: null # Field used to add a quick note about a DUT for admins (MANUAL/REST) gitlab: # Configuration of anything related to exposing the machines on GitLab (MANUAL) freedesktop: # Name of the gitlab instance url: https://gitlab.freedesktop.org/ # URL of the instance registration_token: # API token with the `create_runner` scope. For instance runners, you also need `admin_mode`. For project or group tokens, `role` must be `Maintainer` or `Owner`. runner_type: (instance|group|project)_type # Where you want to register your runner group_id: # ID of the group you want to register the runners in. Only needed for group_type runners project_id: # ID of the project you want to register the runners in. Only needed for project_type runners access_token: # A `read_api` or a `manage_runner` token, used to verify consistency between the local and gitlab state. For project or group tokens, `role` must be `Maintainer` or `Owner`. expose_runners: true # Expose the test machines on this instance? Handy for quickly disabling all machines maximum_timeout: 21600 # Maximum timeout allowed for any job running on our test machines acl: # Access control list for the DUT & gateway runners on this GitLab instance, when no ACL rule on the DUT/gateway runner matches (see explanations in the sub-section below) ... gateway_runner: # Expose a runner that will run locally, and not on test machines token: # Token given by the registration process (AUTO) exposed: true # Should the gateway runner be exposed? runner_id: 4243 # GitLab's runner ID associated to this machine acl: # Access control list for the gateway runner (see explanations in the sub-section below) allow: # At least one `allow` item must be defined to allow the gateway runner to be exposed users: - eric Access Control Lists (ACL) ^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``acl:`` items in the example above can be a bit complex, so let's look at them in details: - They are split in 2 levels: DUT/gateway ACL, and instance ACL. The former is more specific and is evaluated first, and the latter is only evaluated if no rule matched. - The ``deny`` list is evaluated before ``allow``, so if a job would match both, it is rejected. - If no rule matches, the default decision depends on whether any rule was set at either level: - If there was no ACL rule set anywhere, you don't want any restriction, so the default decision is to allow. - If an ACL rule was set, you care about access control, so the default decision is to deny. .. warning:: Gateway runners can't be exposed unless they define an ACL with at least one ``allow`` -- typically, the farm admin(s). .. danger:: Keep in mind that gateway runners are ``--privileged``, so don't give access to them to people you don't trust! See the `podman documentation `_ for more information. Frequently asked questions ^^^^^^^^^^^^^^^^^^^^^^^^^^ * How do I move runners from one GitLab project to another? There are currently no easy ways of doing so currently. The best solution is to call the following command line for every runner in MaRS DB: .. code-block:: bash $ curl -X DELETE "https://gitlab.example.com/api/v4/runners" --form "token=" The executor will periodically check the validity of the tokens, and upon seeing they got deleted, it will re-create them in the new project. .. _executor_rest: REST API -------- The executor includes a REST API with various endpoints available. Endpoint ``/duts`` ^^^^^^^^^^^^^^^^^^ Method: GET Lists the available machines and their information (IP address, tags, ...) .. code-block:: bash curl localhost:8000/api/v1/duts Endpoint ``/dut/`` ^^^^^^^^^^^^^^^^^^ Method: POST, PUT Adds a new machine to ``MARS_DB_FILE``, if there is a discovery process on-going it'll use this data to set the PDU and port_id. This endpoint is used from the ``machine_registration.py`` script. Endpoint ``/dut/`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Method: GET Lists all the information of a selected machine. machine_id is the MAC Address. .. code-block:: bash curl localhost:8000/api/v1/dut/ curl localhost:8000/api/v1/dut/52:54:00:11:22:0a Method: DELETE Remove the machine from the database, and all its associated GitLab runner tokens. .. code-block:: bash curl -X DELETE localhost:8000/api/v1/dut/ .. _patch_dut: Method: PATCH Update one or more of the DUT's editable fields: * ``comment`` (str): Specify a comment about the DUT meant for the farm admins * ``firmware_boot_time`` (float): Number of seconds needed by the DUT's firmware to request the boot parameters after powering up * ``is_retired`` (bool): Tag the DUT as retired/active (see our :ref:`DUT state machine `:) * ``manual_tags`` (list[str]): Overwrite the manual tags * ``pdu_off_delay`` (float): Number of seconds needed to ensure the machine is fully off * ``ready_for_service`` (bool): Tag the DUT as ready for service (see our :ref:`DUT state machine `:) .. code-block:: bash curl -X PATCH localhost:8000/api/v1/dut/52:54:00:11:22:0a \ -H 'Content-Type: application/json' \ -d '{"pdu_off_delay": 10, "comment": "this is an example comment"}' Endpoint ``/duts//boot.ipxe`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Method: GET **TODO:** To be documented. Endpoint ``/dut//quick_check`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Method: GET Returns ``true`` if a quick check of the machine has been queued, ``false`` otherwise. .. code-block:: bash curl localhost:8000/api/v1/dut//quick_check Method: POST Queue a quick check on the machine. No parameters are needed. .. code-block:: bash curl -X POST localhost:8000/api/v1/dut//quick_check Endpoint ``/dut/discover`` ^^^^^^^^^^^^^^^^^^^^^^^^^^ Method: GET Shows if there is a discovery process on-going and the data of this discovery: pdu, port_id and start date. .. code-block:: bash curl localhost:8000/api/v1/dut/discover Method: POST Launches a discovery process, it will boot the machine behind a given PDU/port_id and will put this data in ``discover_data`` to be used by the ``machine_registration.py`` script. .. code-block:: bash curl -X POST localhost:8000/api/v1/dut/discover \ -H 'Content-Type: application/json' \ -d '{"pdu": "VPDU", "port_id": '10'}' If no machines show up, the discovery process will automatically timeout after 150 seconds by default. This value can be specified using the ``timeout`` parameter: .. code-block:: bash curl -X POST localhost:8000/api/v1/dut/discover \ -H 'Content-Type: application/json' \ -d '{"pdu": "VPDU", "port_id": '10', "timeout": '60'}' Method: DELETE Erases all the discovery data, discover_data will be emptied. .. code-block:: bash curl -X DELETE localhost:8000/api/v1/dut/discover Endpoint ``/dut//cancel_job`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Method: POST Cancel the jobs running in a machine. machine_id is the MAC Address. .. code-block:: bash curl -X POST localhost:8000/api/v1/dut//cancel_job curl -X POST localhost:8000/api/v1/dut/52:54:00:11:22:0a/cancel_job Endpoint ``/pdus`` ^^^^^^^^^^^^^^^^^^ Method: GET Lists the available PDUs and the list of their port_ids with some information such as label or state. .. code-block:: bash curl localhost:8000/api/v1/pdus Endpoint ``/pdu/`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Method: GET Lists all the information of a selected PDU .. code-block:: bash curl localhost:8000/api/v1/pdu/ curl localhost:8000/api/v1/pdu/VPDU Endpoint ``/pdu//port/`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Method: GET Lists the information of a port_id: label, min_off_time and state .. code-block:: bash curl localhost:8000/api/v1/pdu//port/ curl localhost:8000/api/v1/pdu/VPDU/port/10 Method: PATCH Turns a port OFF or ON. .. code-block:: bash curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \ -H 'Content-Type: application/json' \ -d '{"state": "on"}' Reserve or un-reserve a port. Use True to reserve, False to un-reserve. .. code-block:: bash curl -X PATCH localhost:8000/api/v1/pdu/VPDU/port/10 \ -H 'Content-Type: application/json' \ -d '{"reserved": True}' Endpoint ``/full-state`` ^^^^^^^^^^^^^^^^^^^^^^^^ Method: GET Provides all the information from the endpoints ``/pdus``, ``/duts``, and ``/dut/discover`` in a single call. Endpoint ``/jobs`` ^^^^^^^^^^^^^^^^^^ Method: POST Used to submit jobs. To be documented. .. _executorctl: Executor client - executorctl ============================= The executor client ``executorctl`` can be used to list the DUTs exposed through this CI farm, and run jobs on them. The executor client can be found in git under `executor/client `_ and installed with ``pip``. .. code-block:: bash $ executorctl run -t $machine_tag $/path/to/job/file Examples of job that can be run under vivian can be found at `job_templates`_ .. _job_templates: https://gitlab.freedesktop.org/gfx-ci/ci-tron/-/tree/main/executor/server/src/valve_gfx_ci/executor/server/job_templates .. _SALAD: SALAD ===== .. literalinclude:: ../../salad/README.md :language: Markdown .. _GFXInfo: GFX Info ======== .. literalinclude:: ../../gfxinfo/README.md :language: Markdown .. _Machine registration: Machine Registration Container ============================== The machine registration container is responsible for the following functions: * Registering new test machines: * Creating tags for the test machine without manual intervention, with the help of :ref:`GFXInfo`; * Finding out which TTY device is connected to the gateway's :ref:`SALAD` service; * Verifying that the state of the test machine matches the state found in :ref:`MarsDB`, including verifying that the serial console is talking to the gateway; * Verifying that the test machine has the wanted list of tags, and hiding the hardware that is not needed for testing (useful for multi-GPU setups). Find out more about it in the :download:`source code <../../machine_registration/machine_registration.py>`.