BigQuery to access data stored in AWS S3

s3-biquery

BigQuery, a google's product that can enable data users to perform interactive analytics on massively large datasets within Google Cloud Platform (GCP). It is very similar to Redshift from AWS, DashDB from IBM but it is serverless. In order to use it, GCP account is needed and offcourse data has to reside on storage within Google infrastructure such as Google Drive, Google Cloud Storage etc. There is a service available in GCP that allows bringing data from AWS S3 using their GCP's native service named Cloud Storage Transfer.

This part of article is intended to provide step by step guideline and highlight some issues that you may encounter (I did encounter) while bringing data from AWS S3 (Simple Storage Service) to Google Cloud Storage.

Assumption:

You have access to Google Cloud Platform and AWS. 

Scenario:

Need to transfer data from S3 to GCS 

Steps:

  • Unload data from Redshift to S3 (Refer http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html for syntax if needed)

Point to note, By default Unload generates pipe (|) delimited file so you may want to change that depending on how data is stored in redshift.

  • Login to GCP and go to "STORAGE" 

gcs1

  • Use Transfer Option

gcs2

  • As mentioned BigQuery can access data from anywhere within Google infrastructure. Hence, we need to first transfer data from S3 to GCS. Using Transfer, we can do so.

gcs3

  • Specify Source, Target and Schedule for data transfer

gcs4

In above screen, following are key configuration needed to transfer data from S3 to GCS:

  1. Select Source as Amazon S3 bucket
  2. Specify bucket name (Do not specify path symbol i.e. forward slash (/) after bucket name). Also do not specify exact path. This parameter is to provide bucket name ONLY. 
  3. Provide Access Key and Secret Key valid with permission to read files for bucket that you are using it in source.
  4. If there is a folder inside S3 bucket then use Filter and specify that as prefixes. For an example: In my case, files were available under subfolders (InputData/dataplatformexperts) in bucket specified so specified value as shown for "Transfer files with these prefixes". It can be done using "Add Item" button shown above.

Once you have specified all valid values, then Continue to provide detail for Destination (GCS)

  • Add Destination (GCS) detail here:

gcs5

In this screen, use Browse button to explore GCS bucket(s). It will prompt a screen where you have a choice either use existing bucket or Create new one. In my case, I wanted to create new, so here is what had to be done:

gcs6

select bucket from list:

gcs-select

that will bring following page. 

gcs8

You may directly type GCS bucket name here but ensure path is NOT ending with Forward Slash (/). Other options (check box) are depending on cases, you may check. 

  • Now you can proceed with Create/Executing job:

gcs10

If all information specified for Source and Destination, are valid then it will return following screen which is intern a confirmation page:

gcs confirmation

GCP Transfer service will now begin data transfer from S3 to GCS and return status on Transfer Service page. I was amazed to see performance as it took less 8 minutes to transfer 478 GB data. 

 

Possible error that is not captured in documentation:

Permissions issue. You must be a bucket owner of the destination bucket. Ask a bucket owner or project owner to grant you this permission.

There are multiple reason for this error. Some of them are related with permission but it also occurs If path is ending with Forward Slash as shown below:

gcs9

Otherwise, following message is expected (Unless Google Cloud Platform team changes error message to be specific or allows adding Forward slash at the end of Bucket path)

gcs11-ErrorPage

 

Reference: 

https://cloud.google.com/bigquery/streaming-data-into-bigquery

https://cloud.google.com/storage/transfer/

 

 

[Amazon](500310) Invalid operation: CREDENTIALS argument is not supported when loading from file system

Sometimes, we missed to notice reason for an error with small oversight. That's what exactly happened with me when I was loading data from S3 to Redshift. I researched on google to find out an answer and had no luck. Finally, I realized the mistake and so thought of sharing my mistake that may help others who may come across with similar situation:

My Scenario : Loading data from S3 to Redshift

Sample Syntax :

copy ReadLog(LogData)
FROM 'S3://bucketname/' CREDENTIALS 'aws_access_key_id=xxxxxxxxxxx;aws_secret_access_key=xxxxxxxxxx'

Error Message:

An error occurred when executing the SQL command:
copy ReadLog(LogData)
FROM 'S3://bucketname/' CREDENTIALS 'aws_ac…

[Amazon](500310) Invalid operation: CREDENTIALS argument is not supported when loading from file system;

Execution time: 0.08s

1 statement failed.

Root Cause:

S3 typed in UPPER CASE. Hence, Error message indicating issue with CREDENTIALS argument, was NOT the case

Resolution:

Changed S3 to LOWER CASE that changed syntax as shown below:

copy ReadLog(LogData)
FROM 's3://bucketname/' CREDENTIALS 'aws_access_key_id=xxxxxxxxxxx;aws_secret_access_key=xxxxxxxxxx'

Final Outcome:

Data Loaded successfully with following message returned:

Warnings:
Load into table 'readlog' completed, 1 record(s) loaded successfully.

0 rows affected
copy executed successfully

Execution time: 0.36s

 

Writing Stored Procedure in Netezza

Intent of this article to help those who want to write stored procedure with multiple DML operations and do not have SP development background using Netezza. I did not find any article that could help new developer to write stored procedure in Netezza with OLTP development background. There are many documents available on this subject but I did not find any of these helpful so thought of putting this together.

Scenario: Requirement was to transfer data between tables with following conditions:

  1. Delete corresponding records from Final table matching Stage table records
  2. Insert data into Final Table from Stage table
  3. Delete Stage table for new load

Solution: This procedure was written for transferring data from Microsoft SQL Server to Netezza database using Change Data Capture

CREATE OR REPLACE PROCEDURE CHANGEDATACAPTURE_AUDIT_DI()
RETURNS BOOLEAN
LANGUAGE NZPLSQL AS
BEGIN_PROC

BEGIN

execute immediate 'delete from ADMIN."AUDIT" t1
where exists (select t2.AuditID from ADMIN."STGAUDIT" t2 where t1.AuditID = t2.AuditID)'
;

execute immediate 'insert into ADMIN."AUDIT" select * from ADMIN."STGAUDIT"';

execute immediate 'delete from ADMIN."STGAUDIT"';

END;

END_PROC;

In above example, AUDIT is Final table, and STGAUDIT is Stage table. In order to execute above scripts to create stored procedure, follow steps below:

  1. Select all scripts
  2. Execute it in single batch as shown below (Tool used for this example : Aginity Workbench).

 

Now, to execute; run following command:

EXECUTE CHANGEDATACAPTURE_AUDIT_DI();

 

Unable to enable CDC

Problem: Unable to enable CDC due to either of following errors even when database owner is SA.

Error Description #1

Msg 3930, Level 16, State 1, Procedure sp_cdc_enable_db_internal, Line 178

The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.

Msg 22830, Level 16, State 1, Procedure sp_cdc_enable_db_internal, Line 186

Could not update the metadata that indicates database DPE_CustTicketDetail is enabled for Change Data Capture. The failure occurred when executing the command 'create user cdc'. The error returned was 916: 'The server principal "sa" is not able to access the database "DPEAudit" under the current security context.'. Use the action and error to determine the cause of the failure and resubmit the request.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_db_internal, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 0, current count = 1.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_db, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 0, current count = 1.

Msg 3998, Level 16, State 1, Line 1

Uncommittable transaction is detected at the end of the batch. The transaction is rolled back.

Error Description #2

Msg 3930, Level 16, State 1, Procedure sp_cdc_enable_db_internal, Line 178

The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.

Msg 22830, Level 16, State 1, Procedure sp_cdc_enable_db_internal, Line 186

Could not update the metadata that indicates database DPE_CustTicketDetail is enabled for Change Data Capture. The failure occurred when executing the command 'create user cdc'. The error returned was 208: 'Invalid object name 'DPEAudit.dbo.t_util_DatabaseChangeLog'.'. Use the action and error to determine the cause of the failure and resubmit the request.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_db_internal, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 0, current count = 1.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_db, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 0, current count = 1.

Msg 3998, Level 16, State 1, Line 1

Uncommittable transaction is detected at the end of the batch. The transaction is rolled back.

Error Description #3
 

Msg 22832, Level 16, State 1, Procedure sp_cdc_enable_table_internal, Line 607

Could not update the metadata that indicates table [dbo].[Users] is enabled for Change Data Capture. The failure occurred when executing the command 'sp_cdc_create_change_table'. The error returned was 916: 'The server principal "S-1-9-3-727003095-1134527967-2886334085-1972679761." is not able to access the database "DPEAudit" under the current security context.'. Use the action and error to determine the cause of the failure and resubmit the request.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_table_internal, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 0, current count = 1.

Msg 3930, Level 16, State 1, Procedure sp_cdc_enable_table, Line 61

The current transaction cannot be committed and cannot support operations that write to the log file. Roll back the transaction.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_table, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 0, current count = 1.

Msg 3998, Level 16, State 1, Line 1

Uncommittable transaction is detected at the end of the batch. The transaction is rolled back.

Root Cause: There is a trigger enabled on objects.

Troubleshooting Steps:

Step 1: Query sys.triggers to verify Trigger and get Trigger name.

Step2: Disable Trigger

DISABLE TRIGGER TrgDBChangeLog_TrackingDB ON DATABASE

Step3: Now run command to enable CDC (depending on error description*)

exec sys.sp_cdc_enable_db

or

exec sys.sp_cdc_enable_table

*Error Description #1 and #2 is for condition when enabling CDC on database. You may experience Error Description #3 while enabling CDC on tables.

However If your error condition is similar to following:

Msg 22830, Level 16, State 1, Procedure sp_cdc_enable_db_internal, Line 186

Could not update the metadata that indicates database DPE_CustTicketDetail  is enabled for Change Data Capture. The failure occurred when executing the command 'SetCDCTracked(Value = 1)'. The error returned was 15404: 'Could not obtain information about Windows NT group/user 'Domain\user', error code 0x5.'. Use the action and error to determine the cause of the failure and resubmit the request.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_db_internal, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 1, current count = 2.

Msg 266, Level 16, State 2, Procedure sp_cdc_enable_db, Line 0

Transaction count after EXECUTE indicates a mismatching number of BEGIN and COMMIT statements. Previous count = 1, current count = 2.

Above error can be resolved by setting database owner to new login. for example:

USE <Database Name>

GO

EXEC sp_changedbowner 'sa'

For Detail refer article published at http://support.microsoft.com/en-us/kb/913423

AWS – Introduction – Part 1

AWS (Amazon Web Services) has been leader in providing Infrastructure services since 2006. You may read about AWS @ http://aws.amazon.com/about-aws/. Intent of this video is to introduce you with AWS and provide high level introduction to some of the key services that you may need if decided to use AWS. If you are new to AWS and want to learn without spending money then visit http://aws.amazon.com/free/ for detail.

 

AWS – Redshift

Amazon Redshift is one of the leading DWH solution in cloud which is fast, and fully managed data warehouse solution based on shared nothing parallel processing columnar store database that supports columnar compression. Like any other services with AWS, Redshift is also fully managed system that requires ZERO admin effort to host.

Tracer Token

Please note that I had authored this article originally for www.sqlservercentral.com that was featured in "Best of SQLServerCentral, Volume 6".

SQL Server 2005 has lot of new and improved components, and the Tracer Token in replication is one of them. It allows us to validate connections and helps in measuring latencies between the publisher, distributor and subscriber(s). This functionality allows an accurate calculation of latencies that ultimately helps in finding which subscriber take more time to receive a change from the publisher than expected.

You may be wondering whether it puts an extra load on your system or if it will slow down replication? The answer is NO because it only writes very small amount of data in transaction log of the publication database.

I used this feature recently in a company where they had transaction replication failures almost every night. The SQL Server was reporting following errors:

Query timeout expired
The agent failed with a 'Retry' status. Try to run the agent at a later time.

I did not want to just setup just any value for the QueryTimeout without knowing what it should be. So, I setup the Tracer Token feature and ran it for two days. Afterwards I knew what value to use and configured the system accordingly. Today, it is the fourth week using that setting, and it has not failed yet.

This is a wonderful new feature SQL Server 2005 introduced, and here is how you can implement it and start using it today.

Method #1. (Replication Monitor)

Launch Replication Monitor, and then Expand the Server => Select Publication

Click on the Tracer Token Tab in right pane and then click on Insert Tracer as shown below

After SQL Server sends the token through the replication process, it will display Latencies as shown below:

The latency will be recorded when you click on Insert Tracer. Every time you click on Insert Tracer, a new date/time will be added in Dropdown box (Time Inserted). Hence, you can view your latency at any given time using this dropdown box provided you had Inserted a Tracer Token.

The Publisher to Distributor column displays the time elapsed between a transaction committed at the Pulisher and the corresponding command entered in the distribution database.

The Distributor to Subscriber column display the time elapsed before the commit of the transaction at the subscriber.

Limitation of using Replication Monitor

There is a limitation in setting up the Tracer Token using Replication Monitor. The Tracer Token has to be inserted manually and It limits you to viewing one value at a time.

To overcome this limitation, we have another method to accomplish same thing.

Method #2. (Thru SQL Server Agent)

You can automate your tracer token insert using this method. You need to follow the steps below:

  • Create a SQL Server Agent Job on Publisher Server
  •  Add a Job Step with the following T-SQL command:

sp_posttracertoken 'xyz' — substituting xyz for the name of your publication

  • Schedule this Job to run at whatever interval you want to insert a token into your replication process.

How to analyze these tracer values?

This data is stored in the distribution database. There are a few system defined stored procedure available to view the recorded data. They are:

  • sp_helptracertokens returns data for every tracer token inserted at publisher for specified publication
  • sp_helptracertokenhistory returns tokenid specific data

But, I have simplified a way to get the data by writing a query mentioned below:

use distribution
GO
SELECT publisher_commit,
       distributor_commit,
       datediff(ss,publisher_commit,distributor_commit) 'Latency bw Pub and Dis',
       subscriber_commit,
       datediff(ss,distributor_commit,subscriber_commit) 'Latency bw Dis and Sub'
FROM MSTracer_tokens
  JOIN MSTracer_history ON tracer_id = parent_tracer_id
 

The result of this query will have the following data:

  • Date and Time when transaction committed at publisher
  • Date and Time when Command for same transaction inserted at Distributor,
  • Time (in Seconds) Elapses to commit data between publisher and distributor
  • Date and Time when transaction committed at Subscriber
  • Time (in Seconds) Elapses to commit data between Distributor and Subscriber.

Conclusion

Tracer Token is excellent feature to measure latencies and validate connections that can be used to validate or set Agent profile settings and/or identify timeout related problem that can cause replication to fail.

Cassandra

Non Relational (NoSQL/BigData) database systems are designed to address 3V (Variety, Velocity, Volume) of data with high availability and scalability without compromising perfomance. In this short presentation, you will get to know few important facts about non relational database systems, theory behind it and main focus around Cassandra.