Did you miss our LinkedIn Live session on Unity Catalog?
Blueprint Solution Architect and Databricks Champion, Shannon Lowder, answered questions live during the webinar and addressed many more that came through. The following is a compilation of all Q&A’s including those that we ran out of time for.
What is Unity Catalog, and what’s the biggest reason I should migrate to UC?
Unity Catalog, a data governance solution integrated into Databricks, offers relief from the complexity of data management. It efficiently organizes your data assets, controls access, and provides insights into data lineage, all centered around the user-friendly Catalog Explorer in your workspace.
We walked through the different types of objects you can find in a catalog then looked at the other information available through Unity Catalog and Catalog Explorer, such as sample data, hh lineage, and insights.
When you move from Hive_Metastore to UC, you want to stop thinking of your data objects as files and start thinking of them as logical objects like tables. Additionally, you can control access through Unity Catalog; rather than managing file ACLs or roles, you can control access from the catalog itself. All of this can be done through the user interface or through SQL, python, or API calls!
The best practice for access control in Unity Catalog is to determine the minimum number of groups and controls needed to express your security. This approach, rather than trying to have explicit control by object and user, ensures a secure environment without overwhelming you with unnecessary work.
How can I see lineage as of a certain date?
While the UI is nice to explore lineage data, there are times when you want to sort and filter this data. Good news, the system catalog is a feature in unity catalog that exposes both information schema and lineage information.
By querying the system.access.table_lineage column, you can filter results based on their event date and time. Now, instead of seeing all the lineage information for the past 90 days, you can look at events for a subset of that time period!
How can I find assets in unity catalog?
We worked through the search bar at the top of Databricks’ web interface.
We talked about the different assets you can find using search. We also discussed how the search will look at a Unity Catalog’s catalog, schema, and column names looking for your search term. We also discussed tagging and how it works with the search function.
How do you migrate from Hive Metastore to Unity Catalog
We have created a three-step process to migrate from Hive_Metastore to Unity Catalog.
Assess. First, we want to gather an inventory of all your data objects, ACLs, users, groups, clusters, notebooks, queries, etc., that you have today. We need this to see where the complexity lies in your environment. While Databricks Labs has released a product, UCX, that can perform several of the assessments, ours covers more object types and provides an impact study of the code in your workspaces and your data.
When migrating to Unity Catalog, subtle changes to the code are required. Some are as simple as moving from two-part names to three-part names, and some are as complex as rewriting a process to use data references rather than direct file access.
Next, we configure the desired state. During this process, we determine how you want to organize your catalogs. This will be different for each customer. Some are organized by line of business, some by subject area, and some by environment. You could also mix-and-match approaches. The goal is to simplify finding data in your environment.
Once we have this configuration, we move on to execution. During this step, we run several scripts that move you from the Hive Metastore to the Unity Catalog. Objects can be moved, copied, or redefined to meet your desired configuration.
The execution step isn’t an all-or-nothing process. We’ve designed these scripts so you can incrementally move to UC!
Moving external tables to UC
During the session, a question came in about how to move external tables from Hive to Unity Catalog. We talked through the SYNC command. This command allows a Hive Metastore table and a Unity Catalog table to reference the same cloud storage. That way, old code can continue to work without changing anything.
This also allows you to move to UC incrementally. Some users could continue to access the Hive metastore table, and others can move forward with the UC version.
How do you address disruptions during migration?
Managing data can be challenging, especially when it comes in via streaming with no way to shut down the stream. It’s easier to find the SLA of batched data updates, but when it’s streaming, it’s more difficult. In such cases, parallel processing is typically set up. For example, with Kafka, multiple consumer groups can be spun up to handle messages without interfering with each other. This allows for concentrating on moving historical data from the Hive Metastore into the Unity catalog. It’s crucial to get this process right, especially when dealing with a high volume of messages per hour or very wide messages.
How do you handle errors and rollbacks during migration?
All our scripts are designed to be non-destructive. Thus, rolling back is as simple as continuing to use the original Hive Metastore objects or the original code.
Is there a plan to allow catalogs to be shared across regions?
You can use Delta Sharing to share a catalog with another region. The secondary region can read the data but not update it. If you need to update the data from both regions, you could use an external location, your updates will target the external location or external table on top in both metastores.
Unity Catalog Migration
Our solution, powered by the Lakehouse Optimizer, propels your UC migration process, expediting it while minimizing errors, thanks to our meticulously crafted and rigorously tested solution.