Open-source data observability for modern data teams

Logo

License

Use cases

Monitor your data warehouse in minutes:

  • Data anomalies monitoring as dbt tests
  • Data lineage made simple, reliable, and automated
  • dbt operational monitoring
  • Slack alerts

Support us with a

Quick start

Quick start: Data monitoring as dbt tests in minutes.

Quick start: Data lineage.

Our full documentation is available here.

Join our Slack to learn more on Elementary.

(Not a dbt user? you can still use Elementary data monitoring, reach out to us on Slack and we will help).

Data anomalies monitoring as dbt tests

Elementary delivers data monitoring and anomaly detection as dbt tests.

Elementary dbt tests are data monitors that collect metrics and metadata over time. On each execution, the tests analyze the new data, compare it to historical metrics, and alert on anomalies and outliers.

Elementary data monitors as tests are configured and executed like native tests in your project!

Demo & sandbox

Data anomalies monitoring as dbt tests demo video.
Try out our live lineage sandbox here.

Slack configuration

Community & Support

For additional information and help, you can use one of these channels:

  • Slack (Live chat with the team, support, discussions, etc.)
  • GitHub issues (Bug reports, feature requests)
  • Roadmap (Vote for features and add your inputs)
  • Twitter (Updates on new releases and stuff)

Integrations

  • Snowflake
  • BigQuery
  • Redshift - Data monitoring

Ask us for integrations on Slack or as a GitHub issue.

License

Elementary is licensed under Apache License 2.0. See the LICENSE file for licensing information.

Comments
  • [Question] How can we prepare `table_monitors_config`?

    [Question] How can we prepare `table_monitors_config`?

    Overview

    I tried to monitor dbt tests with jaffle_shopw by following the documentation. But, I was not able to upload artifacts due to the lack of the destination table. If I am correct, we have to create the table table_monitors_config a head. But, the documentation doesn't describes table_monitors_config. How can we prepare the table?

    Environments

    • Python 3.8
    • dbt 1.0.3
    • elementary 0.3.2.

    Error message

    09:22:03  Running 2 on-run-end hooks
    09:22:33  1 of 2 START hook: jaffle_shop.on-run-end.0..................................... [RUN]
    09:22:33  1 of 2 OK hook: jaffle_shop.on-run-end.0........................................ [OK in 0.00s]
    09:22:33  2 of 2 START hook: elementary.on-run-end.0...................................... [RUN]
    09:22:33  2 of 2 OK hook: elementary.on-run-end.0......................................... [OK in 0.00s]
    09:22:33
    09:22:33
    09:22:33  Finished running 8 view models, 9 incremental models, 2 table models, 3 seeds, 17 tests, 3 hooks in 44.13s.
    09:22:33
    09:22:33  Completed with 2 errors and 0 warnings:
    09:22:33
    09:22:33  Runtime Error in model filtered_information_schema_columns (models/edr/metadata_store/filtered_information_schema_columns.sql)
    09:22:33    404 Not found: Table sandbox-project:jaffle_shop_elementary.table_monitors_config was not found in location asia-northeast1
    09:22:33
    09:22:33    (job ID: d0f16aa6-b33d-493f-b0d1-8b934a090682)
    09:22:33
    09:22:33  Runtime Error in model filtered_information_schema_tables (models/edr/metadata_store/filtered_information_schema_tables.sql)
    09:22:33    404 Not found: Table sandbox-project:jaffle_shop_elementary.table_monitors_config was not found in location asia-northeast1
    09:22:33
    09:22:33    (job ID: 03fc0b75-262f-49f8-8957-96b55268e9ac)
    
  • SQL compilation error when using column-level anomalies

    SQL compilation error when using column-level anomalies

    Hey all,

    I added the elementary package to the dbt repository and used dbt run to create all required tables. But when I tried to add column-level anomalies, the dbt run gave me an error as

    19:10:32    001003 (42000): SQL compilation error:
    19:10:32    syntax error line 7 at position 19 unexpected '""'.
    19:10:32    syntax error line 10 at position 15 unexpected ''day''.
    19:10:32    syntax error line 10 at position 26 unexpected '('.
    19:10:32    syntax error line 10 at position 40 unexpected 'as'.
    19:10:32    syntax error line 12 at position 1 unexpected ')'.
    

    The configuration I added to the yml file is as:

      - name: table_name
        config:
          elementary:
            timestamp_column: "_inserted_at"
        tests:
          - elementary.table_anomalies:
              table_anomalies:
                - row_count
                - freshness
        columns:
          - name: "id"
            description: " "
            quote: true
            tests:
              - not_null
              - unique
              - elementary.column_anomalies:
                  column_anomalies:
                    - missing_count
                    - min_length
    

    However, table-level anomalies worked as expected. I tried to look up compiled SQL files from target/compiled and target/run, but couldn't find any models relevant to this problem. Any ideas?

  • Running process got stuck

    Running process got stuck

    Hi team, I tried to run elementary, but it just got stuck after logging into Snowflake. Nothing has changed on my screen at least for an hour. image

    Do you have any ideas about what could go wrong?

  • [BigQuery] Syntax error: Illegal escape sequence

    [BigQuery] Syntax error: Illegal escape sequence

    Hi, I'm testing your dbt package for our data warehouse which is placed on Google BigQuery The generated sql scripts are producing erros. One big error is coming from the on-run-end hook:

    Database Error
      Syntax error: Illegal escape sequence: \E at [6:566]
    

    I've checked the generated SQL code and it seems that all backslashes coming from the model names are producing an error e.g. this part: ...'models\business_vault\EXCHANGERATESAPI\msrglB\currency_datedexchangerates_xrio_brs.sql','business_vault\EXCHANGERATESAPI\msrglB\currency_datedexchangerates_xrio_brs.sql','2022-04-05 11:04:47')

    I think there is some escaping missing.

  • Add support for Slack workflows format in alerts

    Add support for Slack workflows format in alerts

    The current format does not support workflows, as it only supports key-value pairs.

    {
      "description": <ALERT_DESCRIPTION>,
      "table": <table_name>,
      "detected_at" <detected_at>
    }
    

    Possible solution: Config property where you can set the type of slack integration you are using (either 'workflow' or 'webhook'). If you set the config to 'workflow' the response body would be formatted in the proper way.

  • Surround schema information table with back quotes

    Surround schema information table with back quotes

    Overview

    We have to surround the schema information table with back quotes.

    Error logs

    05:20:19  target not specified in profile 'elementary', using 'default'
    Pulling query history from BigQuery (!)  in 0.9s (4.32/s)
    Traceback (most recent call last):
      File "/Users/yu/anaconda2/envs/python3.8/bin/edr", line 8, in <module>
        sys.exit(cli())
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
        return self.main(*args, **kwargs)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/click/core.py", line 1053, in main
        rv = self.invoke(ctx)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/click/core.py", line 754, in invoke
        return __callback(*args, **kwargs)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/cli/../lineage/cli.py", line 253, in generate
        queries = query_history_extractor.extract_queries(start_date, end_date)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/lineage/query_history.py", line 85, in extract_queries
        self._query_history_table(start_date, end_date)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/lineage/bigquery_query_history.py", line 98, in _query_history_table
        rows = list(job.result())
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py", line 1447, in result
        do_get_result()
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
        return retry_target(
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 189, in retry_target
        return target()
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/job/query.py", line 1437, in do_get_result
        super(QueryJob, self).result(retry=retry, timeout=timeout)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/job/base.py", line 727, in result
        return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
      File "/Users/yu/anaconda2/envs/python3.8/lib/python3.8/site-packages/google/api_core/future/polling.py", line 135, in result
        raise self._exception
    google.api_core.exceptions.BadRequest: 400 Syntax error: Expected ")" but got "-" at [7:33]
    
  • Artifacts uploader fails on Bigquery if reaches query size limit

    Artifacts uploader fails on Bigquery if reaches query size limit

    The query on the artifacts uploader on-run-end hook failed with the following error:

    Database error while running on-run-end
    Encountered an error:
    Database Error
      The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.
     
    

    This probably happens on very large dbt projects.

  • Column size issue in 'all_columns_anomalies' test

    Column size issue in 'all_columns_anomalies' test

    Issue description: On the test 'all_columns_anomalies' the results table is created using the first batch of results (the first column), and the results column sizes are assigned by the content of this batch. If there is a following batch with larger values, the insert of the batch fails.

    Solution: Create the table with an empty template (like we do with the incremental tables).

    The fix will be deployed as part of the Redshift integration branch: https://github.com/elementary-data/dbt-data-reliability/pull/20

  • Export lineage relationships in a file

    Export lineage relationships in a file

    Some data discovery platforms like Amundsen have options to visualize data lineage relationships at a glance on the moment that you are exploring data. But you need to bring your own lineage metadata, Im really interested in an option to export the lineage relations in a file (csv or json) so I can pass that to the amundsen extractor. In the case of the basic lineage extractor of amundsen uses a single .CSV file with the structure: source_table , target_table If this file is a JSON , additional info can put it in there. So we could filter per type: CREATE_VIEW or CREATE_TABLE operations

  • Move schema and/or database definition from profiles.yml to CLI

    Move schema and/or database definition from profiles.yml to CLI

    Today the database and schema for the lineage graph are defined in a yml file. These are used for both the connection and the filtering of the graph/queries. The change is to get these as input in the CLI.

    Pros: Would enable easier workflows creation for different datasets using the same configuration file.

    Cons: Change from how these are used in dbt, whic is familiar to the users.

  • Integrate with Snowflake's new ACCESS_HISTORY

    Integrate with Snowflake's new ACCESS_HISTORY

    Snowflake is about to release a new feature that will enable parsing the query and extracting tables and columns set (probably still need to parse the query to learn the relation between the columns for column level lineage).

    Main benefits -

    • Make the tool faster (no need to extract tables lineage using our python parser)
    • We might be able to simplify the setup (we will consider removing some python dependencies as they won't be needed anymore)

    Downsides -

    • Requires permissions to account_usage
    • For column level lineage parsing the relation between columns is still needed

    Open questions / how does this work with the following -

    • Views
    • Copy into commands
    • Subqueries

    Open questions for columns -

    • Operators / Select * / Cases
    • Subqueries
    • Relations

    See more details about this upcoming release here -

    • https://community.snowflake.com/s/article/Pending-Behavior-Change-Log (2021_10)
  • Add filtering options for the data monitoring tests

    Add filtering options for the data monitoring tests

    Currently, we use the timestamp column as a filter, or no filtering at all (run on the entire table). For some use cases, this is not enough.

    Use cases:

    • Snapshot tables - timestamp is not relevant, you want to filter on rows where 'valid_to' is null
    • Big tables with no timestamp column - order by + limit?

    dbt supports where conditions using this macro: https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/include/global_project/macros/materializations/tests/where_subquery.sql (Documented here: https://docs.getdbt.com/reference/resource-configs/where)

    We need to understand if there are use cases where you need both the timestamp and additional filtering? Should there be a different behaviour for such tests?

  • Pass timestamp_column as a test param

    Pass timestamp_column as a test param

    Task Overview

    • Currently timestamp_column is the only configuration that is needed to be configured globally in the model config section (usually it's being configured in the properties.yml under elementary in the config tag).
    • Passing the timestamp_column as a test param will enable running multiple tests with different timestamp columns. For example running a test with updated_at column which represents the update time of the row or running a test with event_time which represents the time the event was sent.

    Design

    • There are three main files where the test macros are implemented - test_table_anomalies.sql, test_column_anomalies.sql and test_all_columns_anomalies.sql (please note that currently there is some code duplication in these files and in the future we will probably fix it).

    • All of these test macros should receive a new parameter (defined at the end) with a default value 'none', called 'timestamp_column'.

    • In each test currently there are two lines of code which are responsible for extracting the timestamp_column from the global model config {%- set table_config = elementary.get_table_config_from_graph(model) %} {%- set timestamp_column = elementary.insensitive_get_dict_value(table_config, 'timestamp_column') %}

    • The macro 'get_table_config_from_graph' returns the timestamp_column and its normalized data type (called 'timestamp_column_data_type')

    • The following code in the macro 'get_table_config_from_graph' that is responsible for finding the timestamp column data type should be extracted to a macro called find_normalized_data_type_for_column - {% set columns_from_relation = adapter.get_columns_in_relation(model_relation) %} {% if columns_from_relation and columns_from_relation is iterable %} {% for column_obj in columns_from_relation %} {% if column_obj.column | lower == timestamp_column | lower %} {% set timestamp_column_data_type = elementary.normalize_data_type(column_obj.dtype) %}

    • Then in the test itself if the received timestamp_column new param is not none, use this extracted macro to find the column normalized data type and pass this timestamp_column and timestamp_column_data_type to the relevant functions (get_is_column_timestamp, column_monitoring_query, table_monitoring_query).

    • If the timestamp_column is none, use the global timestamp column as it is implemented today

  • Support custom anomaly threshold

    Support custom anomaly threshold

    Task Overview

    • Currently all anomalies are calculated statistically with a fixed global threshold (it's configurable with a var called anomaly_score_threshold, its default is 3). In some use cases it's better to define a more / less sensitive threshold based on the underlying dataset. Currently all the monitors are implemented as dbt tests and therefore the ideal solution would be to provide an additional test parameter that could receive this custom anomaly threshold for a specific test. If this parameter was not provided to the test, the the default should remain as it's configured with the global var 'anomaly_score_threshold'.

    Design

    • There are three main files where the test macros are implemented - test_table_anomalies.sql, test_column_anomalies.sql and test_all_columns_anomalies.sql (please note that currently there is some code duplication in these files and in the future we will probably fix it).
    • All of these test macros should receive a new parameter (defined at the end) with a default value 'none', called 'anomaly_threshold'.
    • Then each test should pass this value to the macro 'get_anomaly_query'
    • If the received value of this param is none, the code in 'get_anomaly_query' macro should use the macro elementary.get_config_var to get the global var 'anomaly_score_threshold' and use its value instead (this is the behavior today).
    • Then the anomaly query should use this param (or the global value of the var anomaly_score_threshold) to determine if there is an anomaly or not (look for 'where abs(anomaly_score)')
  • Databricks integration

    Databricks integration

    This is a new type of integration that was requested in the Slack community.

    • From a quick look it seems like dbt already supports databricks and it seems like most of the features are supported
    • The monitoring is implemented as dbt tests and therefore we will need to run the package and its tests on a databricks env to see if the tests are working as expected on this platform
  • Multiple profiles for alerting

    Multiple profiles for alerting

    Our current configuration for alerting using the CLI is to configure a profile named 'elementary' and provide it the schema where 'data_monitoring_metrics' is. This means that there can only be one profile for alerting.

    The need here is to enable monitoring of two different schemas that are managed separately on the same db.

    Hemant - could you confirm that this describes the need well?

  • Tests configuration changes

    Tests configuration changes

    The key for choosing specific tests today has the same name as the test. We got feedback that this is confusing and inconsistent between tests.

    # Current format:
    
            - elementary.all_columns_anomalies:
                all_columns_anomalies:
                  - null_count
    
    # Suggested format:
    
            - elementary.all_columns_anomalies:
                monitors:
                  - null_count
    
    

    Also - we should accept the param 'all' as another option to activate all the monitors, as this is more intuitive than all by default.

Cloud Native sample microservices showcasing Full Stack Observability using AppDynamics and ThousandEyes
Cloud Native sample microservices showcasing Full Stack Observability using AppDynamics and ThousandEyes

Cloud Native Sample Bookinfo App Observability Bookinfo is a sample application composed of four Microservices written in different languages.

Jan 6, 2022
Tracing and Observability with OpenFaaS

Tracing and Observability with OpenFaaS Today we will walk through how to add OpenTracing or OpenTelemetry with Grafana's Tempo. For this walk-through

May 6, 2022
BloodCheck enables Red and Blue Teams to manage multiple Neo4j databases and run Cypher queries against a BloodHound dataset.

BloodCheck BloodCheck enables Red and Blue Teams to manage multiple Neo4j databases and run Cypher queries against a BloodHound dataset. Installation

Nov 5, 2021
Virtual webcam that takes real webcam footage and replaces the background in order to have Virtual Backgrounds in MS Teams for Linux where the feature is unimplemented.

Background Remover The Need It's been good long while since Microsoft first released a Teams version for Linux and yet, one of Teams' coolest features

May 5, 2022
Track testrail productivity in automated reporting to multiple teams
Track testrail productivity in automated reporting to multiple teams

django_web_app_for_testrail testrail is a test case management tool which helps any organization to track all consumption and testing of manual and au

Nov 21, 2021
Wunderland desktop wallpaper and Microsoft Teams background.
Wunderland desktop wallpaper and Microsoft Teams background.

Wunderland Professional Impress your colleagues, friends and family with this edition of the "Wunderland" wallpaper. With the nostalgic feel of the or

Jan 17, 2022
chiarose(XCR) based on chia(XCH) source code fork, open source public chain
chiarose(XCR) based on chia(XCH) source code fork, open source public chain

chia-rosechain 一个无耻的小活动 | A shameless little event 如果您喜欢这个项目,请点击star 将赠送您520朵玫瑰,可以去 facebook 留下您的(xcr)地址,和github用户名。 If you like this project, please

May 11, 2022
Source-o-grapher is a tool built with the aim to investigate software resilience aspects of Open Source Software (OSS) projects.

Source-o-grapher is a tool built with the aim to investigate software resilience aspects of Open Source Software (OSS) projects.

Apr 28, 2022
A collection of modern themes for Tkinter TTK
A collection of modern themes for Tkinter TTK

ttkbootstrap A collection of modern flat themes inspired by Bootstrap. Also includes TTK Creator which allows you to easily create and use your own th

May 23, 2022
A modern Python build backend

trampolim A modern Python build backend. Features Task system, allowing to run arbitrary Python code during the build process (Planned) Easy to use CL

Feb 11, 2022
A python program with an Objective-C GUI for building and booting OpenCore on both legacy and modern Macs
A python program with an Objective-C GUI for building and booting OpenCore on both legacy and modern Macs

A python program with an Objective-C GUI for building and booting OpenCore on both legacy and modern Macs, see our in-depth Guide for more information.

May 21, 2022
A Modern Fetch Tool for Linux!
A Modern Fetch Tool for Linux!

Ufetch A Modern Fetch Tool for Linux! Programming Language: Python IDE: Visual Studio Code Developed by Avishek Dutta If you get any kind of problem,

Dec 12, 2021
Lightweight and Modern kernel for VK Bots

This is the kernel for creating VK Bots written in Python 3.9

Nov 21, 2021
Convert Roman numerals to modern numerals and vice-versa

Roman Numeral Conversion Utilities This is a utility module for converting from and to Roman numerals. It supports numbers upto 3,999,999, using the v

Dec 17, 2021
A modern python module including many useful features that make discord bot programming extremely easy.

discord-super-utils Documentation Secondary Documentation A modern python module including many useful features that make discord bot programming extr

May 12, 2022
This Program Automates The Procces Of Adding Camos On Guns And Saving Them On Modern Warfare Guns
This Program Automates The Procces Of Adding Camos On Guns And Saving Them On Modern Warfare Guns

This Program Automates The Procces Of Adding Camos On Guns And Saving Them On Modern Warfare Guns

May 4, 2022
May 12, 2022
A free and open-source chess improvement app that combines the power of Lichess and Anki.
A free and open-source chess improvement app that combines the power of Lichess and Anki.

A free and open-source chess improvement app that combines the power of Lichess and Anki. Chessli Project Activity & Issue Tracking PyPI Build & Healt

May 22, 2022