Thursday, October 18, 2007

The 4D SQL Database Engine in Depth

These are notes from the "The SQL Database Engine in Depth" presentation given by Laurent Ribardière (President of 4D Group) & Larent Esnault (VP of Research and Development), 4D SAS at 4D Summit 2007.

Laurent Ribardière and Larent Esnault giving presentation on 4D v11 SQL at 4D Summit 2007

The process started 10 years ago with "4D Universal". We still don't have the new language, but it will be coming "soon".

New limits...

Data file 257 -> limited by OS (currently)
Tables 255 -> 32,767 (limited by integer parameters used by plug-ins)
Fields/Table 511 -> 32,767
Records/Table 16M -> 1B
Index keys/table 16M -> 128B
Alpha Length 80 -> 255
Text Length 32kb -> 2G

There are two types of text fields - ones stored in the record, and ones that are stored in BLOBs outside the record. Ones in the record have more indexing options, etc. Ones stored out of the record help with performance (most times).

The 4D database engine now has two access points - the 4D language and SQL.

Indexes are now in a separate file from data.

Index types in v11:
  • BTree
  • Cluster (new)
  • Composite (new)
  • FullText (new)
Cluster indexes are great when the number of unique values is few. They're far more efficient for these circumstances than BTree indexes. For example color: blue, red, yellow. 4D's cluster indexes can be either based on bit tables or longints - 4D picks whichever is more effient for your circumstance. When bit tables are used, the bit tables are compressed making the indexes even smaller and more efficient. He showed an example where using a BTree the index was nearly 3GB, where using a cluster the index was 38MB. Because the index is larger it takes longer to work with it. A query which took 50 seconds with a BTree index took 1/2 second using the cluster index.

Composite indexed is an index that uses more than one column. Cluster indexes can be BTrees or Cluster indexes. A composite index is about the data source for the index. In the past the data source was a single field. With composite indexes you can use multiple fields as the data source. Using a composite index is more efficient than doing two queries with an AND since it does have to look at two indexes and examine the overlap in the results of the two queries. The biggest benefit is when each of the query arguments returns large number of records.

FullText indexes index parts of a text field - each word is indexed separately. If you use wildcards it does not use the FullText index. FullText is for finding whole words, not parts of words. However, starts with queries (wildcard at the end) are fine and do use the FullText index.

DISTINCT VALUES can be used on text fields with FullText indexes and it will return all the different words used in the text field for a selection of records.

QUERY BY FORMULA([Table];[Table]TextField%$keyValue)

Notice the use of % which used in this context now means find a keyword within a FullText indexed text field.

Both SQL and 4D queries are analyzed and then fed into a "query optimizer" which determines the most efficient way to execute the query. That feeds a "query plan" and only after that is the query executed.

In SQL the relation between tables is defined by the query itself. In 4D the relation between tables is defined in the structure.

In v11.1 you'll be able to view the query plan for a query. If you see problems, report them to 4D and it will go into how they refine the product.

To create a composite index, select two fields in the structure and choose "create composite index".

Related queries (queries using values from the related one record) are much more efficient in v11. It operates like SQL would expect, not sequentially as was the case previously.

It's important to keep in mind that the query path may vary for more or less the same query depending on the data and the query parameters.

QUERY BY FORMULA is now fully optimized. In fact it's recommended, whereas before you were told to only use it when you absolutely had to. Now it will know what it can use indexes on and what it can't and it will keep the slow bits as minimal as possible.

It's important to note that SQL queries never use the relations between tables that you draw in the structure. You can also override the drawn relations using QUERY BY FORMULA.

Queries now find the shortest path between two tables (if there are multiple paths between the tables). This is different than how it was before. For example, let's say you have three tables - employees, companies, and cities. There's a difference between employees working in LA, and employees living in LA. Because [Employees] is directly related to [Cities] by default you'll get employees living in LA. To get employees working in LA you'll need to include the following in the query:

... & [Company]CityID=[City]ID & ...

That's similar to how things are done in SQL, only you're doing it with a 4D query.

BLOBs, Images and Text fields are now stored outside of the record by default. This makes it much quicker to load the record. For text you can choose to store it in or out of the record. In the record is better when the text is small. If you're doing sorting, you'll want to keep it with the record.

You can have both traditional and FullText indexes on the same field.

Null values are now supported. Null is not a value, it's a state - the value is unknown. Null is not zero nor is it an empty or blank value. Two new commands now support null - Is Field Value Null, and SET FIELD VALUE NULL.

The field property Reject NULL value input will throw an error if the field is null.

The field property Mandatory is there for compatibility reasons and is not the same as Reject NULL value input. Mandatory is used during data entry only.

Nested transactions allow you to roll back children transactions that were validated by cancelling the parent transaction.

SET QUERY AND LOCK in 4D is the same as SELECT FOR UPDATE in SQL - both will lock the records for the duration of the transaction they're within. If it can't lock all of the records the query fails, nothing is returned by the query, and the problem records are put into an exception set.

Implications of deleting tables and fields...
  • Count Tables has been renamed Get Last Table Number.
  • Count Fields has been renamed Get last Field Number
  • New commands: Is Table Number Valid, and is Field Number Valid
  • When there's a "hole" in the numbering the tables above are not renumbered, instead the hole remains. However, the next table that's created will use the table number.
  • There was some discussion later in the day about whether the same is true for fields. Apparently it's supposed to be, but it was a bug that was fixed days before the product was released and there was some discussion as to whether the bug was fixed or the documentation changed.
There is now versioning for the structure file and data file. There is now a strong link between the data file and the index files. If you replace index file with an older one it will consider it invalid and rebuild it.

When you add tables dynamically in a production database it can create problems when you put up a new version that doesn't have the new tables. Internal references to tables no longer use table numbers. If the data file has tables the structure has it will not be deleted, nor will the new production tables be overwritten by new tables in the structure file - the data will be retained. Future versions (v11.1?) will let you restore these tables. Compacting the data will delete the data however.

It's also important to know that when you delete a table the data related to the table will be deleted.

Unicode is a universal character set for all languages. 4D supports UTF-16, not UTF-32.

The database engine always supports Unicode. When you convert a database it's in compatibilty mode - get out of this mode as quickly as possible since there is a penalty for the engine converting to Unicode on the fly.

The only drawback with Unicode is the data file can grow - up to 2 times larger in some cases.

The database engine is 64 bit.

There have been improvements to virtual memory and dynamic memory management. The memory allocation scheme now avoids sequential searches to find empty space - which was a problem in the past. The new caching scheme minimizes disk I/O. So frequently accessed objects will stay in the cache longer.

The query plan is different than the query path. The query path is how the query will be done. The query path details the results as well with how many records were found and the time taken for the query.

You'll notice with query paths that the query plans get modified based on the results as it goes through the plan. So if it gets to a point where there is say only one record it will want to act differently than if there were many records.

You'll also see in the query plans that unique field are used first many times because you're more likely just return one record. There are a lot of rules like these that go into how query plans are set up.

Q&A

Query by Formula will be beefed up to support more operators.

4D Open will will most likely not work with the new 4D Server v11, but they're seeing what they can do. But don't assume it will work. However, there will be direct 4D to 4D access which could replace the functionality.

Labels: , ,

Digg It!  Add to del.icio.us  Add to StumbleUpon  Add to Reddit  Add to Technorati  Add to Furl  Add to Netscape

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home