Preamble:
Interactive Brokers has an imediate need for this but there may be other customers, in my view in mature or advanced levels of Manta usage, that might benefit from this. Interactive Brokers is currently using Manta to produce data-lineage over two technologies: Oracle and OpenManta Extensions. OpenManta Extensions are used to model technologies like Java applications, Google ProtoBufs RPCs, Complex flat-file strcutures, and Oracle data structures. As you can imagine, a user might declare in OpenManta Extensions, let's take Oracle for example, a Oracle Table and Columns that plain and simply do not exist or the user might type in the wrong Table name or column name that will never be found by the Oracle Out-of-The-box scanner.
This raises an issue of data-lineage untrustworthiness where a given customer is no longer sure to trust the metadata that constitues the basis of data-lineage. The user producing OpenManta Extensions files, or if we want to generize, producing custom data-lineage outside of out-of-the-box scanners, can more quickly and/or more easily introduce entropy or incorrect metadata that will in turn produce incorrect data lineage.
The idea:
In order to tackle the sentiment of data-lineage untrustworthiness in the presence of custom sources of lineage like OpenManta Extensions, Direct links or even custom facets in OpenLineage, I propose to disclose the source of a given lineage node via the Manta Query API. So from the customer's perpective, any customer that makes use of the Manta API would be able to:
1) To determine the source connection name when searching for a given node
2) inquire about the source connection name providing a given node identifier
This proposal introduces therefore changes in the following REST Services of the Manta Query API:
Service: GET /manta-dataflow-server/public/v1/rev/{rev}/nodes/{id}
Change: Introduce a new url optional boolean query parameter "include-connection" that defaults to False but when set to True returns the connection name that created the node identifed by the {id} parameter
Example:
Request: GET /manta-dataflow-server/public/v1/rev/0.00008/nodes/123?include-connection=true
Result:
{
"id": 0,
"name": "Party",
"type": "Table",
"resource": "Oracle",
"Source": "Connection1",
"parent": 0
}
Service: GET manta-dataflow-server/public/v1/rev/{rev}/nodes/{id}/path?selectedLayer={selectedLayer}
Change: New url optional boolean query parameter "include-connection" that defaults to False but when set to True returns the connection name that created the node identifed by the {id}
Example:
GET manta-dataflow-server/public/v1/rev/0.00008/nodes/456/path?selectedLayer=Physical&include-connection=true
Result:
[
{
"id": 0,
"name": "Party",
"type": "Table",
"resource": "Oracle",
"Source": "Connection1",
"parent": 0
}
]
Service: GET manta-dataflow-server/public/v1/rev/{rev}/nodes/search-by-name?name={name}&type={type}
Change: New url optional boolean query parameter "include-connection" that defaults to False but when set to True returns the connection name that created the node identifed by the {id}
Example:
GET manta-dataflow-server/public/v1/rev/0.00008/nodes/search-by-name?name=table1&type=Table&include-connection=true
Results:
[
{
"id": 0,
"name": "Party",
"type": "Table",
"resource": "Oracle",
"Source": "Connection1",
"parent": 0
}
]
Service: POST manta-dataflow-server/public/v1/rev/{rev}/nodes/search-by-path
Change:
New url optional boolean query parameter "include-connection" that defaults to False but when set to True returns the connection name that created the node identifed by the {id}
New JSON field "Source" if "include-connection" is set to True with the connection name that created the nodes/path
Example:
POST manta-dataflow-server/public/v1/rev/0.00008/nodes/search-by-path?include-connection=true
Request:
[
{"name": "column1", "type": "Column"},
{"name": "table1", "type": "Table"},
{"name": "schema1", "type": "Schema"},
{"name": "database1", "type": "Database", "resource": "Oracle"}
]
Response:
{
"id": 0,
"name": "Party",
"type": "Table",
"resource": "Oracle",
"Source": "Connection1"
"parent": 0
}
This proposal also introduces a new REST Service for the Manta Query API:
Service: GET /manta-dataflow-server/public/v1/rev/{rev}/node-source-connection/{id}
Parameters:
rev - The revision number of the search
id - The node identifier from which to retrieve the configured connection name that created the node
Output:
Returns the configured connection name that created the node identifed by the id parameter
Example:
request: GET /manta-dataflow-server/public/v1/rev/0.00008/node-source-connection/123
response: DB2_SANDOX
In theory it should be safe to add the new Source attribute to the response without adding a new parameter because the new value would simply be ignore if not expected, but there's always someone...