OpenTelemetry
=============

Generated apps can emit OpenTelemetry traces, metrics, and (optionally)
logs.  Telemetry is fully **opt-in**: when the project config does not
set ``telemetry``, the generated tree contains zero references to OTel
and the runtime cost is exactly zero.

Enabling
--------

Add a ``telemetry`` block to your project config:

.. code-block:: jsonnet

   local telemetry = import 'be/telemetry/telemetry.libsonnet';

   {
     databases: [...],
     apps: [...],
     telemetry: telemetry.otel('my-service-name', {
       sampler: 'parentbased_traceidratio',
       sampler_ratio: 0.1,
       resource_attributes: { team: 'platform' },
     }),
   }

``service_name`` is required; everything else has a sensible default.
The full schema lives in :class:`be.config.schema.TelemetryConfig`.

After regenerating, install the pinned OTel package set via the
``opentelemetry`` extra:

.. code-block:: shell

   pip install kiln-generator[opentelemetry]

Generated apps already depend on ``kiln-generator`` (they import from
``ingot``), so the extra is the single source of truth for OTel
versions -- nothing extra to vendor or copy.

Then call ``init_telemetry`` from your app entry point, before mounting
the generated router:

.. code-block:: python

   from fastapi import FastAPI
   from _generated.routes import router
   from _generated.telemetry import init_telemetry

   app = FastAPI()
   init_telemetry(app)
   app.include_router(router)

Deployment environment
----------------------

The ``deployment.environment.name`` resource attribute (dev / staging /
prod) is intentionally **not** a code-gen argument: the same artifact
should ship across environments.  It's read from the env var named by
``environment_env`` at startup (default: ``ENVIRONMENT``):

.. code-block:: shell

   ENVIRONMENT=prod ./run-app

Override the variable name if your deployment already exports a
different one:

.. code-block:: jsonnet

   telemetry.otel('my-service-name', {
     environment_env: 'DEPLOY_ENV',
   })

When the variable is unset (or empty), the attribute is omitted.

What you get
------------

.. list-table::
   :header-rows: 1
   :widths: 25 30 45

   * - Signal
     - Source
     - Span / metric name
   * - HTTP server span
     - ``FastAPIInstrumentor``
     - one per request
   * - Internal handler span
     - ``@traced_handler``
     - ``{resource}.{op}`` (CRUD or action)
   * - DB client span
     - ``SQLAlchemyInstrumentor``
     - per query
   * - Outbound HTTP (``httpx``)
     - ``HTTPXClientInstrumentor``
     - opt-in via ``instrument_httpx``
   * - Outbound HTTP (``requests``)
     - ``RequestsInstrumentor``
     - opt-in via ``instrument_requests``
   * - Metrics
     - OTLP ``MeterProvider``
     - wired; user code emits

Internal handler spans carry low-cardinality attributes for filtering:

* ``be.resource`` — e.g. ``"article"``
* ``be.op`` — e.g. ``"get"`` for CRUD, ``"publish"`` for actions

Both CRUD ops and user-defined actions go through the same
``@traced_handler`` decorator and the same ``be.op`` attribute — the
*value* discriminates (be's CRUD names are a fixed small set;
anything else is a user-defined action).

Sampler defaults
----------------

The default sampler is ``parentbased_always_on``: friendly for
development, expensive in production.  Production deployments
typically switch to:

.. code-block:: jsonnet

   sampler: 'parentbased_traceidratio',
   sampler_ratio: 0.05,

Sampling at 5% with parent-based propagation gives you full traces for
sampled requests while keeping ingest volume manageable.

Exporter
--------

By default the generated ``init_telemetry`` does not pin a transport --
it instantiates the OTLP HTTP exporter with no arguments, and the OTel
SDK reads the standard environment variables itself at construct time:

.. code-block:: shell

   OTEL_EXPORTER_OTLP_ENDPOINT=https://collector.example.com
   OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer abc123

This keeps the same artifact deployable across environments.  Override
with ``exporter: 'otlp_grpc' | 'console' | 'none'`` if you want to pin
a specific transport at code-generation time.

There are no kiln-side knobs for the env-var *names* — point your
deployment at the standard OTel ones.

``otlp_grpc`` is additive: install it alongside the base extra,

.. code-block:: shell

   pip install 'kiln-generator[opentelemetry,opentelemetry-grpc]'

The gRPC exporter lives in its own extra because it pulls in protobuf
and grpc-io -- roughly an order of magnitude heavier than the HTTP
transport.  Generated code imports the gRPC exporter lazily, so apps
that stay on OTLP/HTTP never load the gRPC stack even when both
extras are present.

Per-resource and per-op opt-out
-------------------------------

The project-level ``span_per_handler`` / ``span_per_action`` toggles
control tracing globally.  Hot-path or low-value resources can opt out
without disabling telemetry overall:

.. code-block:: jsonnet

   {
     model: 'health.models.Probe',
     trace: false,  // skip spans for every op on this resource
     operations: [
       { name: 'get' },
     ],
   }

The same field works per-operation:

.. code-block:: jsonnet

   { name: 'list', trace: false }  // skip the spans for this op only

The HTTP server span from ``FastAPIInstrumentor`` is unaffected by these
overrides -- they only suppress be's internal handler/action span
and its ``be.resource`` / ``be.op`` attributes.

PII and the auth router
-----------------------

``capture_request_body`` and ``capture_response_body`` default to
**off** because request and response payloads commonly contain PII.
Even when they are turned on, the generated auth router
(``auth/router.py``) explicitly scrubs:

* ``http.request.body``, ``http.response.body``
* ``http.request.header.authorization``
* ``http.request.header.cookie``, ``http.response.header.set-cookie``

with a ``[scrubbed]`` placeholder via
``scrub_current_span_attributes(...)``.  A placeholder rather than
attribute removal so a "missing ``http.request.body``" alert doesn't
mask a real outage.

Logging
-------

be does not generate logging calls in CRUD handlers -- *you* emit
logs, and the two telemetry knobs below decide what happens to them.
Both are off by default.

Library assumption
^^^^^^^^^^^^^^^^^^

be assumes the **stdlib** ``logging`` **module**.  Loguru and
structlog both interoperate with stdlib (loguru via
``InterceptHandler``, structlog via ``LoggerFactory(stdlib=True)``);
set them up that way and the rest of this section applies unchanged.

``instrument_logging``: trace correlation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: jsonnet

   instrument_logging: true,

Wires ``opentelemetry.instrumentation.logging.LoggingInstrumentor``,
which patches ``logging.LogRecord`` so every record carries
``otelTraceID``, ``otelSpanID``, ``otelTraceSampled``, and
``otelServiceName``.  The default log format string is also updated to
include them.  This does **not** export logs anywhere -- it only adds
trace IDs to whatever sink you're already using (stdout, file, syslog,
etc.).  Use it when your logs go to a different backend than your
traces and you want to jump between them.

``logs``: OTLP log export
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: jsonnet

   logs: true,

Builds an ``opentelemetry.sdk._logs.LoggerProvider``, installs it
globally, and attaches an ``opentelemetry.sdk._logs.LoggingHandler`` to
the **stdlib root logger** at level ``NOTSET`` so every record routes
through OTLP alongside your traces and metrics.  The handler runs
*in addition* to your existing handlers -- ``print``-style stdout logs
keep working; OTLP just becomes another sink.

Two consequences worth knowing:

1. The root logger needs a level set somewhere
   (``logging.basicConfig``, uvicorn's logging config, etc.) for any
   record to actually reach the handler.  ``NOTSET`` defers to the
   loggers' levels, it doesn't override them upward.
2. The OTel logs SDK API is the youngest of the three signals and is
   most likely to churn between OTel releases.  Pin versions tightly
   (the ``[opentelemetry]`` extra already does this) and re-test on
   upgrade.

Combining
^^^^^^^^^

Most teams that turn on either knob want both:

.. code-block:: jsonnet

   logs: true,
   instrument_logging: true,

Records flow:

.. code-block:: text

   your code
     -> logging.getLogger().info(...)
          -> LoggingInstrumentor adds trace IDs to the record
          -> root logger handlers (stdout, etc.) fire
          -> OTLP LoggingHandler fires, ships to the collector

Emitting your own signals
-------------------------

``init_telemetry`` installs the global tracer, meter, and logger
providers — anything in your code can ask the OTel API for them and
start emitting.  No kiln-side wiring required.

Custom traces
^^^^^^^^^^^^^

Get a tracer once at module level; start spans where you need them.
Spans nest automatically under whatever's active, so a span started
inside a handler ends up under that handler's ``@traced_handler``
span:

.. code-block:: python

   from opentelemetry import trace

   tracer = trace.get_tracer(__name__)

   async def publish_article(article, db, body):
       with tracer.start_as_current_span("render_markdown") as span:
           span.set_attribute("article.length", len(body.content))
           rendered = render(body.content)
       # rest of the action…

Span attributes are **per-span**, so high-cardinality values like
``article.id`` are fine here.  Don't put them on metric attributes
(see below).

Custom metrics
^^^^^^^^^^^^^^

Get a meter once, register an instrument once at module level, record
on it from anywhere.  Counters, histograms, up-down counters, and
observable gauges are all available:

.. code-block:: python

   from opentelemetry import metrics

   meter = metrics.get_meter(__name__)

   published_total = meter.create_counter(
       "blog.articles.published",
       unit="1",
       description="Articles successfully published.",
   )
   publish_latency = meter.create_histogram(
       "blog.articles.publish_duration",
       unit="ms",
       description="Time spent in the publish action.",
   )

   async def publish_article(article, db, body):
       started = time.monotonic()
       # …work…
       published_total.add(1, {"author_type": article.author.type})
       publish_latency.record(
           (time.monotonic() - started) * 1000,
           {"author_type": article.author.type},
       )

**Cardinality matters here.**  Metric attributes are *dimensions* —
each unique combination is a separate time series at the backend.
Use small enumerations (``author_type ∈ {staff, guest}``), never
per-row identifiers (``article.id``).  Put the high-cardinality stuff
on a span attribute instead.

Custom logs
^^^^^^^^^^^

Use stdlib ``logging``.  With ``instrument_logging=True``, every
record gets ``otelTraceID`` / ``otelSpanID`` injected; with
``logs=True``, every record also ships over OTLP via the handler
be attaches to the root logger.  You don't need to import anything
OTel-specific:

.. code-block:: python

   import logging

   logger = logging.getLogger(__name__)

   async def publish_article(article, db, body):
       logger.info(
           "publishing article",
           extra={"article_id": str(article.id), "kind": body.kind},
       )

If you use loguru or structlog, route them through stdlib (loguru's
``InterceptHandler``, structlog's ``LoggerFactory(stdlib=True)``) and
the two toggles still work unchanged.

Naming and gotchas
^^^^^^^^^^^^^^^^^^

* **Tracer / meter / logger names** are conventionally ``__name__``.
  They populate the *instrumentation scope* facet at the backend —
  keep them stable so dashboards stay readable.
* **Resource attributes vs span/metric attributes.** Resource
  attributes (set once at ``init_telemetry``, e.g. ``service.name``,
  ``team``) describe the *service*; span and metric attributes
  describe an *event*.  Don't repeat resource attributes per span.
* **Imports.**  Use ``from opentelemetry import metrics`` (not
  ``import opentelemetry.metrics``) — the SDK assumes the former
  import shape for some of its internal lazy-loading.

Pinned versions
---------------

The ``kiln-generator[opentelemetry]`` extra pins the OTel packages to
a coherent release pair:

.. code-block:: text

   opentelemetry-api==1.29.0
   opentelemetry-sdk==1.29.0
   opentelemetry-exporter-otlp-proto-http==1.29.0
   opentelemetry-instrumentation-fastapi==0.50b0
   opentelemetry-instrumentation-sqlalchemy==0.50b0
   opentelemetry-instrumentation-requests==0.50b0

The optional ``kiln-generator[opentelemetry-grpc]`` extra adds:

.. code-block:: text

   opentelemetry-exporter-otlp-proto-grpc==1.29.0

The instrumentation packages ride a separate ``0.x.b`` version line
that stabilises later than the core SDK; bump core (``1.x``) and
instrumentation (``0.x.b``) in lockstep when upgrading.