Warehouse Schema
От | Worky Workerson |
---|---|
Тема | Warehouse Schema |
Дата | |
Msg-id | ce4072df0605241150j6781291bjbd64b74655607e44@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Warehouse Schema
|
Список | pgsql-sql |
I'm developing a schema for a large data warehouse (10 billion records) and had a couple of questions about how to optimizeit. My biggest question is on the design of the fact table and how much normalization I should actually do of thedata. <br /><br />The data is going to be keyed by IP address, stored as a non-unique IP4 type. Each record is alsotagged with various attributes, such as a category and a type. Assuming that a category and type are VARCHAR, wouldit make sense to normalize these out of the fact table into their respective tables and key them by an INTEGER? I.e.<br/><br />CREATE TABLE big_fact_table_A (<br /> identifier IP4,<br /> data1 BYTEA,<br /> data2 BYTEA,<br/> ...<br /> dataN BYTEA,<br /> category VARCHAR(16),<br /> type VARCHAR(16)<br /> );<br/><br />... vs ...<br /><br />CREATE TABLE big_fact_table_B (<br /> identifier IP4,<br /> data1 BYTEA,<br /> data2 BYTEA,<br /> ...<br /> dataN BYTEA,<br /> category INTEGER REFERENCES categories (category_id),<br /> type INTEGER REFERENCES types (type_id)<br />);<br /><br />I figure that the normalized facttable should be quicker, as the integer is much smaller than the varchar. On query, however, the table will need tobe joined against the two other tables (categories, types), but I still figure that this is a win because the other tablesare fairly small and should stay resident in memory. Is this reasoning valid? <br /><br />The downside to this (frommy perspective) is that the data comes in the form of big_fact_table_A and could be easily COPYed straight into thetable. with big_fact_table_B it looks like I will have to do the "unJOIN" in a script. Also, I have separate installationsof the warehouse (with different data sources) and it will be difficult to share data between them unless theircategories/types tables are keyed with exactly the same integer IDs which, as I don't directly control the other installations,is not guaranteed. <br /><br />Any suggestions to the above "problems"?<br />
В списке pgsql-sql по дате отправления: