diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README index a5d0c926de..8a68ff054e 100644 --- a/src/backend/storage/buffer/README +++ b/src/backend/storage/buffer/README @@ -1,4 +1,4 @@ -$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.4 2003/10/31 22:48:08 tgl Exp $ +$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.5 2003/11/14 04:32:11 wieck Exp $ Notes about shared buffer access rules -------------------------------------- @@ -95,3 +95,155 @@ concurrent VACUUM. The current implementation only supports a single waiter for pin-count-1 on any particular shared buffer. This is enough for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a single relation anyway. + + +Buffer replacement strategy interface: + +The two files freelist.c and buf_table.c contain the buffer cache +replacement strategy. The interface to the strategy is: + + BufferDesc * + StrategyBufferLookup(BufferTag *tagPtr, bool recheck) + + This is allways the first call made by the buffer manager + to check if a disk page is in memory. If so, the function + returns the buffer descriptor and no further action is + required. + + If the page is not in memory, StrategyBufferLookup() + returns NULL. + + The flag recheck tells the strategy that this is a second + lookup after flushing a dirty block. If the buffer manager + has to evict another buffer, he will release the bufmgr lock + while doing the write IO. During this time, another backend + could possibly fault in the same page this backend is after, + so we have to check again after the IO is done if the page + is in memory now. + + BufferDesc * + StrategyGetBuffer(void) + + The buffer manager calls this function to get an unpinned + cache buffer who's content can be evicted. The returned + buffer might be empty, clean or dirty. + + The returned buffer is only a cadidate for replacement. + It is possible that while the buffer is written, another + backend finds and modifies it, so that it is dirty again. + The buffer manager will then call StrategyGetBuffer() + again to ask for another candidate. + + void + StrategyReplaceBuffer(BufferDesc *buf, Relation rnode, + BlockNumber blockNum) + + Called by the buffer manager at the time it is about to + change the association of a buffer with a disk page. + + Before this call, StrategyBufferLookup() still has to find + the buffer even if it was returned by StrategyGetBuffer() + as a candidate for replacement. + + After this call, this buffer must be returned for a + lookup of the new page identified by rnode and blockNum. + + void + StrategyInvalidateBuffer(BufferDesc *buf) + + Called from various parts to inform that the content of + this buffer has been thrown away. This happens for example + in the case of dropping a relation. + + The buffer must be clean and unpinned on call. + + If the buffer associated with a disk page, StrategyBufferLookup() + must not return it for this page after the call. + + void + StrategyHintVacuum(bool vacuum_active) + + Because vacuum reads all relations of the entire database + through the buffer manager, it can greatly disturb the + buffer replacement strategy. This function is used by vacuum + to inform that all subsequent buffer lookups are caused + by vacuum scanning relations. + + +Buffer replacement strategy: + +The buffer replacement strategy actually used in freelist.c is a +version of the Adaptive Replacement Cache (ARC) special tailored for +PostgreSQL. + +The algorithm works as follows: + + C is the size of the cache in number of pages (conf: shared_buffers) + ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block + is allwayt associated with one unique file page and "can" point to + one shared buffer. + + All file pages known in by the directory are managed in 4 LRU lists + named B1, T1, T2 and B2. The T1 and T2 lists are the "real" cache + entries, linking a file page to a memory buffer where the page is + currently cached. Consequently T1len+T2len <= C. B1 and B2 are + ghost cache directories that extend T1 and T2 so that the strategy + remembers pages longer. The strategy tries to keep B1len+T1len and + B2len+T2len both at C. T1len and T2 len vary over the runtime + depending on the lookup pattern and its resulting cache hits. The + desired size of T1len is called T1target. + + Assuming we have a full cache, one of 5 cases happens on a lookup: + + MISS On a cache miss, depending on T1target and the actual T1len + the LRU buffer of T1 or T2 is evicted. Its CDB is removed + from the T list and added as MRU of the corresponding B list. + The now free buffer is replaced with the requested page + and added as MRU of T1. + + T1 hit The T1 CDB is moved to the MRU position of the T2 list. + + T2 hit The T2 CDB is moved to the MRU position of the T2 list. + + B1 hit This means that a buffer that was evicted from the T1 + list is now requested again, indicating that T1target is + too small (otherwise it would still be in T1 and thus in + memory). The strategy raises T1target, evicts a buffer + depending on T1target and T1len and places the CDB at + MRU of T2. + + B2 hit This means the opposite of B1, the T2 list is probably too + small. So the strategy lowers T1target, evicts a buffer + and places the CDB at MRU of T2. + + Thus, every page that is found on lookup in any of the four lists + ends up as the MRU of the T2 list. The T2 list therefore is the + "frequency" cache, holding frequently requested pages. + + Every page that is seen for the first time ends up as the MRU of + the T1 list. The T1 list is the "recency" cache, holding recent + newcomers. + + The tailoring done for PostgreSQL has to do with the way, the + query executor works. A typical UPDATE or DELETE first scans the + relation, searching for the tuples and then calls heap_update() or + heap_delete(). This causes at least 2 lookups for the block in the + same statement. In the case of multiple matches in one block even + more often. As a result, every block touched in an UPDATE or DELETE + would directly jump into the T2 cache, which is wrong. To prevent + this the strategy remembers which transaction added a buffer to the + T1 list and will not promote it from there into the T2 cache during + the same transaction. + + Another specialty is the change of the strategy during VACUUM. + Lookups during VACUUM do not represent application needs, so it + would be wrong to change the cache balance T1target due to that + or to cause massive cache evictions. Therefore, a page read in to + satisfy vacuum (not those that actually cause a hit on any list) + is placed at the LRU position of the T1 list, for immediate + reuse. Since Vacuum usually requests many pages very fast, the + natural side effect of this is that it will get back the very + buffers it filled and possibly modified on the next call and will + therefore do it's work in a few shared memory buffers, while using + whatever it finds in the cache already. +