Thursday, February 9, 2012

Advice requested : Choosing the best Key for a table

I would appreciate good advice about the following Schema for a Table called
Messages. The table will hold archived messages from Microsoft Outlook. The
SQL 2005 schema that I have posted, is simply what the data looks like in
it's *raw* form. I intend to modify the schema in the most efficient way
possible. This table was created using MS SQL server 2005 and uses two if
the new (MAX) data types.
You can help me the most by focusing on the best Key for the Messages table.
The "natural key" for the table (in raw form), would be MessageID, so lets
discuss it's problems. The first problem is that varbinary(MAX) (new in
SQL2005), does not allow Primary Key constraints, and so it should be!
Obviously we will need to use a smaller DataType than varbinary(MAX), once
you pick something reasonable in size you are allowed to create PrimaryKey
constraint. MessageID's do not have a documented or published limit, they
can be *any* length. From my observations MessageID will typically be 70-340
bytes. So I suppose we could set varbinary(500). This would allow us to
handle most situations. But this is still way too long for a Primary Key in
my opinion. The other columns that are set as xxx(MAX) have similar
problems, ignore these columns for now until we find a decent Key for the
table.
In my opinion this Table does not have a "Natural Key" we can use it's
simply to long, and this table will have "hundreds of millions of rows", so
I am worried about JOIN performance. Once a message is archived (then
deleted), from it's original location and placed in SQL server, the
MessageID has no value and is never displayed to the user. There are more
complications because this table is replicated using (Merge replication).
The database is a distributed database, so using auto identity columns
complicate replication issues, not too mention irritating --Celko--. So I
really want to find a suitable surrogate key.
So what to do?
1. Use RSA to reduce MessageID to 20 bytes, and deal with the collisions?
2. Use a rowGuid (Outch!)
At this point I like option #1 the best because it works nicely with
"replication" as the original MessageID is a Globally Unique Identifier, so
no two MessageID's (PR_ENTRY_ID long term) will ever exist. But I think
there is are issues with generating these shorter keys using RSA, collisions
will happen. Additionally it is quite simple to encode the MessageID using
RSA. The MessageID would now be 20 bytes or [varbinary](20).
Does anyone have any good suggestions?
Thanks for your time.
Russell Mangel
Las Vegas, NV
-- Note: I commented out the Constraint, as varbinary(MAX) is not allowed.
-- This schema is just what the data looks like in the real world.
CREATE TABLE
[dbo].[Messages]
(
[MessageID] [varbinary](MAX) NOT NULL,
[ParentID] [varbinary](MAX) NOT NULL, -- FK for Folders Table
[SenderID] [int] NOT NULL, -- FK for Sender of message
[MessageType] [int] NOT NULL, -- FK MessageType
[Length] [bigint] NOT NULL,
[To] [nvarchar](MAX) NOT NULL,
[CC] [nvarchar](MAX) NOT NULL,
[Subject] [nvarchar](MAX) NOT NULL,
[Body] [ntext] NOT NULL,
[HasAttachment] [bit] NOT NULL,
-- CONSTRAINT [PK_Messages] PRIMARY KEY CLUSTERED
-- (
-- [MessageID]
-- )
)ON [PRIMARY]
-- End SchemaHi, Russel
Well, why do you want the MessageId column to be varbinary datatype? Any
Reason?
Sure, I don't know your business requirements so my guess is (if you do not
want to use an IDENTITYT)
BTW an IDENTITY in most cases is a good candidate for surrogate keys
CREATE TABLE Messages
(
MessageId INT NOT NULL PRIMARY KEY
)
Then you will need to generate a next MessageId by your self like
DECLARE @.MaxId INT
BEGIN TRAN
SELECT @.MaxId =COALESCE(MAX(MessageId),0) FROM Messages WITH
(UPDLOCK,HOLDLOCK)
INSERT INTO Messages SELECT @.MaxId ,....... FROM Table
COMMIT TRAN
"Russell Mangel" <russell@.tymer.net> wrote in message
news:%23JD8adcaGHA.3304@.TK2MSFTNGP04.phx.gbl...
>I would appreciate good advice about the following Schema for a Table
>called Messages. The table will hold archived messages from Microsoft
>Outlook. The SQL 2005 schema that I have posted, is simply what the data
>looks like in it's *raw* form. I intend to modify the schema in the most
>efficient way possible. This table was created using MS SQL server 2005 and
>uses two if the new (MAX) data types.
> You can help me the most by focusing on the best Key for the Messages
> table.
> The "natural key" for the table (in raw form), would be MessageID, so lets
> discuss it's problems. The first problem is that varbinary(MAX) (new in
> SQL2005), does not allow Primary Key constraints, and so it should be!
> Obviously we will need to use a smaller DataType than varbinary(MAX), once
> you pick something reasonable in size you are allowed to create PrimaryKey
> constraint. MessageID's do not have a documented or published limit, they
> can be *any* length. From my observations MessageID will typically be
> 70-340 bytes. So I suppose we could set varbinary(500). This would allow
> us to handle most situations. But this is still way too long for a Primary
> Key in my opinion. The other columns that are set as xxx(MAX) have similar
> problems, ignore these columns for now until we find a decent Key for the
> table.
> In my opinion this Table does not have a "Natural Key" we can use it's
> simply to long, and this table will have "hundreds of millions of rows",
> so I am worried about JOIN performance. Once a message is archived (then
> deleted), from it's original location and placed in SQL server, the
> MessageID has no value and is never displayed to the user. There are more
> complications because this table is replicated using (Merge replication).
> The database is a distributed database, so using auto identity columns
> complicate replication issues, not too mention irritating --Celko--. So I
> really want to find a suitable surrogate key.
> So what to do?
> 1. Use RSA to reduce MessageID to 20 bytes, and deal with the collisions?
> 2. Use a rowGuid (Outch!)
> At this point I like option #1 the best because it works nicely with
> "replication" as the original MessageID is a Globally Unique Identifier,
> so no two MessageID's (PR_ENTRY_ID long term) will ever exist. But I think
> there is are issues with generating these shorter keys using RSA,
> collisions will happen. Additionally it is quite simple to encode the
> MessageID using RSA. The MessageID would now be 20 bytes or
> [varbinary](20).
> Does anyone have any good suggestions?
> Thanks for your time.
> Russell Mangel
> Las Vegas, NV
>
> -- Note: I commented out the Constraint, as varbinary(MAX) is not
> allowed.
> -- This schema is just what the data looks like in the real world.
> CREATE TABLE
> [dbo].[Messages]
> (
> [MessageID] [varbinary](MAX) NOT NULL,
> [ParentID] [varbinary](MAX) NOT NULL, -- FK for Folders Table
> [SenderID] [int] NOT NULL, -- FK for Sender of message
> [MessageType] [int] NOT NULL, -- FK MessageType
> [Length] [bigint] NOT NULL,
> [To] [nvarchar](MAX) NOT NULL,
> [CC] [nvarchar](MAX) NOT NULL,
> [Subject] [nvarchar](MAX) NOT NULL,
> [Body] [ntext] NOT NULL,
> [HasAttachment] [bit] NOT NULL,
> -- CONSTRAINT [PK_Messages] PRIMARY KEY CLUSTERED
> -- (
> -- [MessageID]
> -- )
> )ON [PRIMARY]
> -- End Schema
>|||Can MessageID ever change for a given message? If not, then use it as the
natural key (as silly as this name sounds in this case). Put a unique
constraint on it, but create a surrogate key for joins.
Although everybody else will say that the GUID is a poor choice for a
primary key, I'd say use it but NEVER as a clustered key. Perhaps you'd have
more use for a clustered index on a datetime column (although I can't see on
e
in your proposed schema). Maybe you could also use an IDENTITY column - it
makes a great
candidate for a clustered index.
Of course a better option still would be to do more research and find a
better natural key (maybe even composite) candidate, then use a single colum
n
surrogate key for joins and references.
ML
http://milambda.blogspot.com/|||There is 1 other Column in the Messages table "Created" it's a DateTime,
that might help us find a better key.
Not sure what to call this Key as it has two columns, must be
Composite/Natural?
Is it possible for a person (SenderID), to send a message at the exact same
time?
Hmmmmmmm. Got to think about this for a bit, and maybe I can find another
column that I can use for Composite key.
-- Here is the new Schema,
CREATE TABLE
[dbo].[Messages]
(
[MessageID] [varbinary](max) NOT NULL,
[ParentID] [varbinary](max) NOT NULL,
[SenderID] [int] NOT NULL,
[MessageType] [int] NOT NULL,
[Length] [bigint] NOT NULL,
[To] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[CC] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Subject] [nvarchar](max) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Body] [ntext] COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[HasAttachment] [bit] NOT NULL,
[Created] [datetime] NOT NULL, -- New Column
CONSTRAINT [PK_Messages] PRIMARY KEY CLUSTERED
(
[SenderID] ASC,
[Created] ASC
)
)ON [PRIMARY]|||Russell Mangel wrote:
> I would appreciate good advice about the following Schema for a Table call
ed
> Messages. The table will hold archived messages from Microsoft Outlook. Th
e
> SQL 2005 schema that I have posted, is simply what the data looks like in
> it's *raw* form. I intend to modify the schema in the most efficient way
> possible. This table was created using MS SQL server 2005 and uses two if
> the new (MAX) data types.
> You can help me the most by focusing on the best Key for the Messages tabl
e.
> The "natural key" for the table (in raw form), would be MessageID, so lets
> discuss it's problems. The first problem is that varbinary(MAX) (new in
> SQL2005), does not allow Primary Key constraints, and so it should be!
> Obviously we will need to use a smaller DataType than varbinary(MAX), once
> you pick something reasonable in size you are allowed to create PrimaryKey
> constraint. MessageID's do not have a documented or published limit, they
> can be *any* length. From my observations MessageID will typically be 70-3
40
> bytes. So I suppose we could set varbinary(500). This would allow us to
> handle most situations. But this is still way too long for a Primary Key i
n
> my opinion. The other columns that are set as xxx(MAX) have similar
> problems, ignore these columns for now until we find a decent Key for the
> table.
> In my opinion this Table does not have a "Natural Key" we can use it's
> simply to long, and this table will have "hundreds of millions of rows", s
o
> I am worried about JOIN performance. Once a message is archived (then
> deleted), from it's original location and placed in SQL server, the
> MessageID has no value and is never displayed to the user. There are more
> complications because this table is replicated using (Merge replication).
> The database is a distributed database, so using auto identity columns
> complicate replication issues, not too mention irritating --Celko--. So I
> really want to find a suitable surrogate key.
> So what to do?
> 1. Use RSA to reduce MessageID to 20 bytes, and deal with the collisions?
> 2. Use a rowGuid (Outch!)
> At this point I like option #1 the best because it works nicely with
> "replication" as the original MessageID is a Globally Unique Identifier, s
o
> no two MessageID's (PR_ENTRY_ID long term) will ever exist. But I think
> there is are issues with generating these shorter keys using RSA, collisio
ns
> will happen. Additionally it is quite simple to encode the MessageID using
> RSA. The MessageID would now be 20 bytes or [varbinary](20).
> Does anyone have any good suggestions?
> Thanks for your time.
> Russell Mangel
> Las Vegas, NV
>
> -- Note: I commented out the Constraint, as varbinary(MAX) is not allowed
.
> -- This schema is just what the data looks like in the real world.
> CREATE TABLE
> [dbo].[Messages]
> (
> [MessageID] [varbinary](MAX) NOT NULL,
> [ParentID] [varbinary](MAX) NOT NULL, -- FK for Folders Table
> [SenderID] [int] NOT NULL, -- FK for Sender of message
> [MessageType] [int] NOT NULL, -- FK MessageType
> [Length] [bigint] NOT NULL,
> [To] [nvarchar](MAX) NOT NULL,
> [CC] [nvarchar](MAX) NOT NULL,
> [Subject] [nvarchar](MAX) NOT NULL,
> [Body] [ntext] NOT NULL,
> [HasAttachment] [bit] NOT NULL,
> -- CONSTRAINT [PK_Messages] PRIMARY KEY CLUSTERED
> -- (
> -- [MessageID]
> -- )
> )ON [PRIMARY]
> -- End Schema
You can add a constraint on MessageID like this:
CREATE FUNCTION dbo.ufn_sha_hash
(@.msg VARBINARY(8000))
RETURNS VARBINARY(20)
AS
BEGIN
RETURN
ISNULL(HASHBYTES('SHA1',@.msg),0x7C7CDB42
7449446E8EA812C5981C5D6A) ;
END
GO
CREATE TABLE dbo.Messages
(
MessageID varbinary(8000) NOT NULL,
HashID varbinary(20) NOT NULL PRIMARY KEY,
CHECK (HashID = dbo.ufn_sha_hash(MessageID)),
);
GO
Whether you should also use a surrogate key is a different question. It
would probably make sense to do so. Hope this helps.
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||> CREATE FUNCTION dbo.ufn_sha_hash
> (@.msg VARBINARY(8000))
> RETURNS VARBINARY(20)
> AS
> BEGIN
> RETURN
> ISNULL(HASHBYTES('SHA1',@.msg),0x7C7CDB42
7449446E8EA812C5981C5D6A) ;
> END
> GO
> CREATE TABLE dbo.Messages
> (
> MessageID varbinary(8000) NOT NULL,
> HashID varbinary(20) NOT NULL PRIMARY KEY,
> CHECK (HashID = dbo.ufn_sha_hash(MessageID)),
> );
> GO
> Whether you should also use a surrogate key is a different question. It
> would probably make sense to do so. Hope this helps.
> --
> David Portas, SQL Server MVP
I didn't know that SQL could do SHA1 hashes, thanks. This table is a
replicated table so once you add surrogate keys, you have to deal with that.
Personally I would rather have a binary(20) column which replaces MessageID
column completely than deal with the surrogate key problems with Replicated
(Merge) databases.
The Messages table also has a FK (FolderID) for Folders table, so this
column could also be removed and replaced by a binary(20) column. Folders
Table PK (FolderID)would be replaced with SHA1 binary(20) as well.
Which brings a couple questions:
1. If we are going to store an SHA1 hash, wouldn't it be best to use:
binary(20) instead of varbinary(20)?
2. Will I have any collisions (PK violations) with SHA1 hash (more on this
in next sentence)?
The MessageID is a variable length Binary byte array, and is a Globally
UniqueIdentifier (it's just an older form of GUID before GUID was born) this
MessageID is from MAPI (Exchange Server). So if you had 50 million
varbinary(MAX) MessageIDs, and you ran SHA1 hash against all of them would
you find duplicated SHA1 hash values?
Thanks
Russell Mangel|||Russell Mangel wrote:
> I didn't know that SQL could do SHA1 hashes, thanks. This table is a
> replicated table so once you add surrogate keys, you have to deal with tha
t.
> Personally I would rather have a binary(20) column which replaces Message
ID
> column completely than deal with the surrogate key problems with Replicate
d
> (Merge) databases.
> The Messages table also has a FK (FolderID) for Folders table, so this
> column could also be removed and replaced by a binary(20) column. Folders
> Table PK (FolderID)would be replaced with SHA1 binary(20) as well.
> Which brings a couple questions:
> 1. If we are going to store an SHA1 hash, wouldn't it be best to use:
> binary(20) instead of varbinary(20)?
> 2. Will I have any collisions (PK violations) with SHA1 hash (more on this
> in next sentence)?
> The MessageID is a variable length Binary byte array, and is a Globally
> UniqueIdentifier (it's just an older form of GUID before GUID was born) th
is
> MessageID is from MAPI (Exchange Server). So if you had 50 million
> varbinary(MAX) MessageIDs, and you ran SHA1 hash against all of them would
> you find duplicated SHA1 hash values?
> Thanks
> Russell Mangel
1. True. Although VARBINARY is functionaly equivalent because the
return value is always 20 bytes.
2. Not unless the MessageID exceeds 8000 bytes, which is the maximum
supported length for HashBytes. Although there is a theoretical risk of
a hash collision, you need 2^80 hashed messages on average before you
hit a duplicate. You are hashing GUIDs though. I don't know the GUIDs
used by Exchange but if they have similar qualities to the ones
generated by Windows then theoretically you are much, much more likely
to hit a duplicate GUID than a duplicate hash.
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--

No comments:

Post a Comment