By Yui Wah LEE
RVM is designed to be a run-time library to be linked with an external application. It allows the application to be benefited from transactional support. In particular, it allows operations on virtual memory to be atomic and persistent, provided that the application declares its intention through the RVM application programming interface.
Figure fig_overview shows the three important components of RVM: virtual memory, data segment and log.
Virtual memory is where the application manipulates its data. RVM allows the application to declare regions of virtual memory to be associated with regions in external data segments. The data segments are the persistent backing store of the virtual memory regions: latest committed image of the virtual memory region has a copy in the data segments. Since data segment is on stable storage (disk), it is persistent across program crashes. Should a application crashes, it can always find those memory regions from the data segment. The association of virtual memory regions with segment regions is called mapping.
For performance reason, RVM does not write directly the committed
image of virtual memory regions back to the data segment. Doing so
would generate excessive 'seek' operations on the magnetic head of the
disk, since the access pattern can be
random. Disk head seeks are expensive operation and should be
minimized. In the case of RVM, this is achieved by writting new
value of memory ranges as records
in an append-only,
sequential log. This is called flush operation. Log is
also on stable storage (disk or battery-backed ramdisk). Once the new
values are written on log (flushed), they become persistent across
program crashes. From time to time, records on the log are processed
and applied onto the data segment. After that, the log space used by the
the records can be re-claimed for further use. This is called log
truncation. Log truncation is necessary, without it the log will
grow unboundedly.
To facilitate our discussion, let first review briefly the RVM application programming interface. Details of the intefaces can be found in chapter of RVM Library Functions
rvm_options_t options; char *version; rvm_region_t region; rvm_tid_t *tid; char *dest, *src; unsigned long length; rvm_initialize(version, options) rvm_map(region, options); rvm_begin_transaction(tid, restore); rvm_begin_transaction(tid, no_restore); rvm_set_range(tid, dest, length); rvm_modify_bytes(tid, dest, src, length); rvm_end_transaction(tid, flush); rvm_end_transaction(tid, no_flush); rvm_abort_transaction(tid); rvm_flush(); rvm_truncate(); rvm_unmap(region); rvm_terminate();
rvm_map()
. There are two possible modes for
rvm_map()
: copy
and no_copy
. The former is the default
and will cause the corresponding region in the segment device to be
read into virtual memory. The latter is used when the application is
sure the image in segment device is useless (e.g., the application is
going to replace it immediately).
Application declares the begining of a transaction by calling
rvm_being_transaction()
. The transaction can be committed by
calling rvm_end_transaction()
, or be aborted by calling
rvm_abort_transaction()
. In addition, the application must
inform RVM the memory location that will be affected by the
transactions. These can be done by calling rvm_set_range()
or
rvm_modify_bytes()
. RVM guarantees that if the transaction
commits, then all the modifications make on declared ranges will be
made effective atomically, conversely, if the transaction abort,
then none of modifications will be made effective.
Normally, when transactions are aborted, the old value in the affected
memory ranges will be restored. So, a default option for
rvm_begin_transaction
is restore
. However, in some case, the
application can be sure that it will never abort the transaction. For
example, if there are no concurrent transaction, the application can
be sure it will never abort the transaction explicitly. For these
cases, the application can start the transaction with the
no_restore
mode.
If the application want the committed transaction be also recorded in
the log device (stable storage) immediately, it should commit the
transaction with flush
mode. In this mode, the
end_transaction()
calls return only after the transaction records
are written to disk. However, there are
cases when the application can accept slight chance of lose of
permanency. In those case, it can end the transaction with a
no_flush
mode. There will be a big payoff in performance, as the
function rvm_end_transaction()
will return immediately, without
having to the wait for the slow disk I/O to complete. RVM will put
the committed transaction in a queue and write them out to the log
device later. This is called lazy commit and it provides a
bound persistence. The committed transactions may lost if the
program crashed before the queued transaction get a chance to be
written out to log. Note that although the transaction is lost, it is
lost atomically. Upon the next restart of the application, RVM
will restore an old consistent view of the virtual memory.
Commit transaction lazily also give more opportunity to inter- transaction optimization.
To force all previously committed transaction records to be written
out to log device, the application can call rvm_flush()
.
Also, whenever the application commit a new transaction with flush
mode, there will be a side effect that all previous transaction on the
queue will also be flushed.
The log device is only a temporary stable storage for commited
transactions: its primary purpose is to allow records be written on
disk sequentially. Eventually, these records need to be applied to
the data segment. The process of applying transaction log records to
data segment is called log truncation. RVM will carry out log
truncation automatically when the log is too full. Users can specify
the option truncate
to set the threshold. Alternatively, the
application can initiate a trunction by calling rvm_truncate()
.
Also, note that RVM will do a log truncation upon start up and
rvm_map()
operations. This is to ensure the data segment has the
lastest updates before it get mapped into virtual memory.
The log device can be a
Meta data of the log are collectively recorded on the log status
block which is located at the beginning of the log. It is offset by
either FILE_STATUS_OFFSET
or RAW_STATUS_OFFSET
, depending on
the types of the log devices. The two constants are currently 0 and
8192 bytes (16 sectors) respectively.
Log records are appended sequentially on log area after the log status
block. The area starts from log_start
and is of size
log_size
. (Fig
fig_log_struct
)
As new records are appended, the log pointer log_tail
will be
advanced. log_head
point to the first record in the log that is
not yet subjected to truncation (current epoch). If truncation
is not in progress, both of the pointers prev_log_head
and
prev_log_tail
will have a zero value. (Fig
fig_log_trunc
b)
Currently, the log are truncated in a way called epoch
truncation. In one atomic operation, the function new_epoch()
will bracket all the current log records by prev_log_head
and
prev_log_tail
, those records are said to be in truncation
epoch and are subject to truncation. At the same time log_head
will be advanced to log_tail
, effectively resetting the
current epoch to be empty. New log records may
be appended by concurrent threads while truncation is in progress, and
log_tail
will advance to new position. (Fig
fig_log_trunc
a)
It is important to use the log pointer to delimit the log into truncation epoch and current epoch as we don't want the log truncation process to interfere with the log flushing process.
If the truncation goes well and all the records are successfully
applied to the data segment. The log pointers prev_log_head
and prev_log_tail
will be reset to zero. The area that was
once occupied by those truncated records are now reclaimed.
However, it is possible that the program crashes while truncation is
in progress. RVM will know this upon next restart, by finding the two
log pointers prev_log_head
and prev_log_tail
are non-zero.
It will then redo the truncation. Truncations can always be redone as
it is just writing the same value on the same location again.
There are five different types of log records. Each of them has a corresponding in-core structure defined. They are listed in the table record-type
Generic record (rec_hdr_t
) does not actually exist in the log.
It contains the common first four fields as the other five 'real'
records. When RVM scans the log and find a record, it may not know
the type of the record. It can always cast the record as a generic
record, and then interpret the first four fields to determine the
appropiate action. For example, it can know the types of the records
by looking at the first field. (The common four fields are
struct_id
, rec_length
, timestamp
and rec_num
)
In the five 'real' records, type specific fields will follow the
common four fields. For example, in the following we show the
declaration of rec_hdr_t
and trans_hdr_t
.
typedef struct { struct_id_t struct_id; /* type of entry */ rvm_length_t rec_length; /* record length */ struct timeval timestamp; /* timestamp of record entry */ long rec_num; /* record number of entry */ } rec_hdr_t; typedef struct { struct_id_t struct_id; /* self-identifier */ rvm_length_t rec_length; /* log record length, displacement to end mark */ struct timeval timestamp; /* timestamp of log entry */ rvm_length_t rec_num; /* record number of log entry */ rvm_length_t num_ranges; /* number of ranges in record */ struct timeval uname; /* uname of transaction */ struct timeval commit_stamp; /* timestamp of commit */ rvm_length_t n_coalesced; /* count of coalesced transactions */ rvm_length_t flags; /* mode and optimization flags */ } trans_hdr_t;
trans_hdr_t
, nv_range_t
and rec_end_t
represent the
transaction header, new value range header and record
end marker. They are the majority of records on the log. Each
transaction will be written in a sequence of records: it begins with a
transaction header, follows by one or more sub-record (the new value
range record) and ends with a record end marker (Fig
fig_log_rec
). This is for the simple case when the transaction
are not split by a wrap marker in between. If the transaction is
split by a wrap marker, it will have two sequences of records, each
begins and ends with a transaction header and end marker. Details of
log wrapping will be discussed in section
log-wrap
.
New value records have variable lengths, each of them begins with a
new value range header and follows by the actual data. The actual
data is the image of the virtual memory range that the transaction has
committed. The header is of type nv_range_t
and is 56 bytes
long, the actual data length varies. Note that RVM may pad some bytes
after the actual data so that the next sub-record will be word-aligned.
To allow RVM to scan the log in both forward and backward direction,
different forward and backward links are provided. Generally
speaking, rec_length
is a forward link. For transaction header,
it is the forward link to the end marker; for new value record, it is
the forward link to the next new value record (or record end marker).
However, for end record marker, this value means the backward
link to the transaction header. (RVM is somewhat inconsistent in
naming here). On the other hand, sub_rec_len
is a backward
link to the previous sub-record. In new value range header, there is
the third lenght field: length
, it is the length of the actual
data. length
may not be equal to rec_length
because of the
padding. See Fig
fig_rec_link
for the details of those links.
The fourth possible record type that could be find on log is the wrap
marker. Note that wrapping may not happen exactly at the end of the
log, it may happen a little bit earlier. In the latter case, the
bytes follow the wrap marker will all be written with value 0xff
.
The fifth possible record type is special log record. Currently, the
only special log record is segment mapping descriptor (of type
log_seg_t
). RVM allows more than one data segment for one log.
To support this, every new value range will be tagged with a
seg_code
which indicates to which segment that range should be
applied to. seg_code
is only a short hand of the
segment. Details of the segment (num_bytes
and name
) will be
recorded in the segment mapping descriptor. Segment mapping
descriptors are inserted into the log whenever a new data segment is
mapped the first time. At the time of log truncation, RVM will use
those segment mapping descriptors to reconstruct the necessary
information to open the segment device.
When log_tail
approaches the end of the log, RVM will choose an
appropiate point to do a wrapping. This is done by writting out
a wrap marker. After that, RVM will starting appending new records
from log_start
. There will be no risk of overwriting onto useful
records because the old records should have been truncated long
before. Even if they are not truncated yet, RVM will call a internal
function wait_for_space()
to force a truncation.
Log wrapping can happens between transaction. In this case, it simply leave a wrap marker (of type log_wrap_t and start a brand new transaction at the beginning of log log_start (Fig fig_wrap a).
It can also happens within a transaction. There are two possibilities: between different new value records or splitting a new value records. Since a range can be very long, spliting new value records prevents leaving too much space not used by inserting the wrap marker too early.
In the first case, after the current new value records are written, a
record-end marker is inserted into the log, and then followed by a
log-wrap marker. Another transaction header will be inserted at
log_start
before the remaining new value records are written.
Effectively, the same transaction will have two transaction headers:
the first is written before log wrapping, the second is written after
log wrapping. The flag
field of the transaction headers helps to
indicates this situation. The first transaction header will only have
FIRST_ENTRY_FLAG
set, while the second transaction header will
only have the LAST_ENTRY_FLAG
set. (If the whole sequence of
records for the transaction is not interrupted by a log wrapping, both
FIRST_ENTRY_FLAG
and LAST_ENTRY_FLAG
of the transaction
header will be set.) (Fig
fig_wrap
b)
For the latter case where a new value range is split, each of the two
portion of the data will has its own new value range header. The first
portion of the splitted range is written followed by a record-end
marker and a wrap marker. The second transaction header are then
written at log_start
, followed by the second portion of the
splitted range. The remaining new value records then follows. There
is a flag is_split
in the new value record header indicates that
this happens. (Fig 6c) The flag FIRST_ENTRY_FLAG
and
LAST_ENTRY_FLAG
will also be set as in the previous case.
Before we describe the important data structure used by RVM, we will first discuss some of the building block of these data structure. In next section we will discuss the generic list and generic tree data structure. In the following section we will discuss the synchronization primitive used by RVM that enable multi-threading for RVM.
A lot of the RVM data structure are organized as lists or trees. To support that, RVM internally defined a generic list type as well as a generic tree type.
Generic list is declared as list_entry_t
, while generic tree node
is declared as tree_node_t
. Supporting routine for them are
move_list_entry()
and tree_insert()
etc.
Each of these generic list/tree will contain enough topological
information for it to be manuipulated as a list-node or tree node.
Generic list and tree are typically not allocated as themselves,
rather, they are the first member of other specific data structures
that need to be organized. For example, the region descriptor
region_t
is declared with list_entry_t
as the first member
of the structure. That way, the region descriptor can be manuipulated
as a list element. We can do the following:
seg_t seg; region_t region; move_list_entry(NULL, &seg->map_list, ®ion->links);to move a region descriptor
region
to the mapped region list
(map_list
) in the segment descriptor seg
.
RVM is designed to be thread-safe: application is responsible for protecting access to its data structures by concurrent threads, but can otherwise use RVM function freely. RVM will synchronizing its internal operations. This is done by using the following three primitives:
Internal data which may be accessed concurrently are protected by
mutex of type RVM_MUTEX
. Routines accessing those data are
required to do a mutex_lock()
before, and do a
mutex_unlock()
after. Alternatively, they can use the macro
CRITICAL()
to bracket their access to critical data.
RVM_MUTEX lck; mutex_init(&lck); /* initialization */ ... mutex_lock(&lck); /* begin lck crit. sec */ /* body */; /* codes in lck crit. sec */ mutex_unlock(&lck); /* end lck crit. sec */ /* alternatively use macro */ CRITICAL(lck, /* begin lck crit. sec */ /* body */; /* codes in lck crit. sec */ ); /* end lck crit. sec */
Sometimes a data can be read by many threads simultaneously, provided
none of them attempts to update on the data. This kind of concurrency
is supported by read/write lock of type rw_lock_t
. Readers
access those data with rw_lock(...,r)
, they will be allowed to
proceed immediately if there are no other writer, otherwise, they will
be blocked until the writer releases the lock. Writers access those
data with rw_lock(...,w)
, they will be allowed to proceed
immediately if there are no other readers or writers, otherwise, they
will be blocked until all of them release the lock. Both readers and
writers have to do a rw_un_lock()
when they are done accessing the
protected data. There is also a macro RW_CRITICAL()
for
bracketing accesses to critical data protected by read/write lock.
rw_lock_t rwl; init_rw_lock(&rwl); /* initialization */ ... /* reader */ rw_lock(&rwl,r); /* begin rwl crit. sec. */ /* body */; /* read immediately if no other /* body */; /* writer, otherwise, block until /* body */; /* it releases lock */ rw_unlock(&rwl,r); /* end rwl crit. sec. */ ... /* writer */ rw_lock(&rwl,w); /* begin rwl crit. sec. */ /* body */; /* write immediately if no other */ /* body */; /* writer/readers otherwise block */ /* body */; /* until they release lock */ rw_unlock(&rwl,w); /* end rwl crit. sec. */ ... RW_CRITICAL(rwl, mode, /* begin rwl crit. sec., mode=r|w */ /* body */; ); /* end rwl crit. sec. */
Two concurrent threads can synchronize their actions by using the
condition variable of type RVM_CONDITION
. Thread A can sleep by
calling condition_wait()
, waiting to be waked up by thread B when
it calls condition_signal()
.
This techniques is used for the asynchronous log truncation. A seperate thread will execute the code of log_daemon() if there are multi-thread support (either mach's cthread, or unix's lwp or pthread). Normally this thread is block waiting, other thread(s) will run. From time to time, when transactions are committed, the other thread(s) will check and see if the log is too full, if it is, it will wake up the log_daemon() which will then truncate the log asynchronusly.
RVM_CONDITION code; /* thread of log_daemon() */ CRITICAL(lock, { while (state == rvm_idle) condition_wait(&code, &lock); }); /* async. log truncation, or terminate * log daemon here */ /* thread of log_tid() */ CRITICAL(lock, { if (need_truncation) { if (state == rvm_idle) { state = truncating; condition_signal(&code); } } });
The following structure are all declared in the file
rvm_private.h
. The important data structures are the
following.
seg_t
)There can be more than one segment per process. Each of the segment
has a segment descriptor. These segment descriptor are linked
together as a list, and the list header is the global variable
seg_root
.
The segment descriptor contains the header map_list
for an
important list: list of mapped region, which will be discussed in more
details in section
sect2-region
There can be more than one segment per log. In fact, each individual
new value record can be applied to different segment. This is
achieved by tagging a short-name of segment seg_code
with each new
value record. The short name of each segment is also stored in its
own descriptor.
log_t
)Like the segment descriptors, log descriptors are also connected as a
list (header is a global variable log_root
. However, currently
RVM only support one log in a process, which is pointed to by the
global pointer default_log
.
log_status
is the in-core copy of the log status block, it will
be explained in more detail in the next section. Similarily,
trans_hdr
, rec_end
, log_wrap
are the area for preparing
the corresponing records on log (c.f. section
sect2-log-record
).
log_buf
is the log recovery buffer descriptor and it will
explained in the section
log-buf
.
Log descriptor contains headers for three important list:
tid_list
, flush_list
and special_list
. The first link
up transaction that have beginned but not yet ended. The second link
up transaction that have been committed but not yet flushed.
log_status
)Log status has a in-core copy as well as in-disk copy. The in-core
copy is declared as type log_status_t
. The in-core copy is
stored as a sub-structure of the log descriptor, and the in-disk copy
is written on the log status block. Ideally, we would want the
in-core copy to written to disk everytime the log status change,
however, doing so will require a lot of disk head seeks between the
log status block and the log tail, which defeat the purpose of the
log, so RVM updates the on-disk log status block only every
UPDATE_STATUS (currently the value is 100) of status change. This is
keep tracked by the count update_cnt
. Because of that, RVM will
not complete trust on the value in the log status block. For example,
when doing log truncation, RVM trusted the value prev_log_head
but not the value log_tail
, so it will do a locate_tail()
to
make sure it can find the last records written on the log.
Log status area records important log pointers: log_head
,
log_tail
, prev_log_head
, prev_log_tail
and log limit:
log_start
and log_size
. Besides, it also records a huge
numbers of consistency check fields, transaction statistics and log
statistics.
Each of the log record types has a corresponding in-core data
structure. They are used to prepare the records before they are
written out to log. There are only one copy for trans_hdr_t
,
rec_ent_t
and log_wrap_t
per-log and they are part of the
log descriptor. For new value records, areas are required for the
headers and the actual data and they are per-range. Headers are of
type nv_range_t
and are store in the range descriptors (described
in the following section). Actual data are stored in a sperately
allocated buffer pointed to by the field data
in range
descriptor. Log special records are associated with the log, and they
dynamically allocated and are organized as a list (header:
log_special_list
) in the log descriptor.
log_buf->buf
) and its descriptor(log_buf
) Log recovery buffer (also called log buffer) is used in log
truncation. It is used to keep an image of log records for
processing. The descriptor of the buffer is called log recovery
buffer descriptor (type log_buf_t
). It is a sub-structure of log
descriptor. The actually buffer is pointed to by the field buf
,
the length of the buffer is stored by the field buf_len
.
Two routines can be used to fill and refill the log recovery buffer:
init_buffer(log, offset, direction, ...)
refill_buffer(log, direction, ...)
direction
of the two routines. They will try to read as
much as possible of data from the log device, subject to the length
of the log buffer and other
constraints:
Upon returns of these two routines,
log_buf->ptr
will be the index into log buffer that
corresponds the current log position, and log_buf->offset
will
correspond to the log offset of byte 0 in log buffer.
log_buf->offset
will be sector aligned. (Figure
fig_log_buf
)
The following is a typical sequence code in log truncation:
... r_length = init_buffer(log, off, REVERSE, ...); log_buf->ptr -= sizeof(rec_end_t); rec_end = (rec_end_t *) &log_buf->buf[log_buf->ptr]; if (rec_end->struct_id == rec_end_id) { ...
In the code, off
is the current log position. Routine
init_buffer()
is instructed to initialize the buffer by scanning
the log in the reverse direction. When it returns, r_length
number of bytes are read into the log recovery buffer (pointed to by
log_buf->buf
), log_buf->ptr
is index of the byte in log
buffer that corresponds to off
. Assume the routine here is
expecting a end record in log, it decrements log_buf->ptr
by the
known size of rec_end_t
so as to point to the end record. After
the decrement, the address of log_buf->buf[log_buf->ptr]
should
be the beginning of the in core image of the end record. It type-cast
it as (rec_end_t *)
and stores it in rec_end
. After that it
can manipulate the record as it likes. In the example, it examines
and see if struct_id
of the end record is really a
rec_end_id
...
region_t
)Each segment will have a list of regions that are mapped. So in each
segment descriptor (type seg_t
) there is a member
map_list
which is the head of this list. Each node in
the list is a region descriptr of type region_t
.
It records the offset
and end_offset
of the region in the
data segment, as well as the vmaddr
and length
of region
that is mapped to.
The member n_uncommit
records the number of uncommitted range in
this region, it is incremented everytime application does a
rvm_set_range()
, it will be decrement when a transaction is
committed or aborted. When application tries to do rvm_unmap()
,
this value must be zero otherwise it will return RVM_EUNCOMMIT.
region_t
may be confused with rvm_region_t
and
mem_region_t
. We will explain the former in the following and
the latter in the next section. rvm_region_t
is part of the API,
it is the structure used by the application to indicate the mapping
parameters: data_dev
, offset
, length
, vmaddr
and
no_copy
etc. RVM does not use this structure internally, once it
has got all the parameters from the application and has transferred
them to region_t
.
mem_region_t
)In addition to the per segment region list described above, there is also a per process memory region tree. It is used to record the regions in the virtual memory space that are mapped, and it is arranged in the order of vm address.
When application does a rvm_map()
, this tree is consulted to make
sure the requested virtual memory region is not already mapped. If it
is already mapped, RVM_EVM_OVERLAP will be returned. Also, when
application does a rvm_set_range()
, this tree is consulted to
make sure the requested range is totally contained within a single
virtual memory region. If it isn't, RVM_ENOT_MAPPED will be returned.
The root of the tree is region_tree
. Each node of the
tree is of type mem_region_t
.
int_tid_t
)Whenever the application does a rvm_begin_transaction()
, rvm will
allocate a transaction descriptor of type int_tid_t
. These
descriptors are linked up into list. Before the transaction commits,
the descriptor is in the list headed by log->tid_list
. If the
transaction is committed, it will be moved to another list headed by
log->flush_list
. If the transaction record is flushed, it will
be removed from the flush list.
Typically, the application may inform RVM one or more memory range
which will be affected by the transaction. This is done by issuing
one or more rvm_set_range()
or rvm_modify_bytes()
after
rvm_begin_transaction()
. Internally, these memory ranges are
organized as a tree rooted at range_tree
(of type
tree_root_t
). The tree root is a field of the transaction
descriptor.
int_tid_t
may be confused with rvm_tid_t
. The latter is the
transaction identifier used by applications while the former is used
internally by RVM. Althought rvm_tid_t
contains a pointer to
int_tid_t
, applications are not expected to use or touch the
pointer.
range_t
)When application specifies modification range by rvm_set_range()
or rvm_modify_bytes()
, the vm address (range->nv.vmaddr
) and
the length (data_len
) of the range is stored internally in the
range descriptor (range_t
). When the transaction commits, the
content of virtual memory pointed to by (range->nv.vmaddr
) will
be copied to the buffer pointed to by range->data
. The same
buffer has another purpose: if the transaction is in restore
mode,
the buffer is used also for saving the old image of VM. In case the
transaction is aborted, the old image can be restored.
Range descriptors are connected together as a tree, with the tree root
in the corresponding transaction descriptor. If the optimization flag
RVM_COALESCE_RANGES
is specified, the tree is sorted according to
the vm address, and ranges are coalesced before they are inserted into
the tree. If RVM_COALESCE_RANGES
is not specified, the tree is
simply sorted by the range number.
There is also an area of type nv_range_t
called nv
which is
used as the in-memory image of the nv range record on log.
dev_region_t
)This is not to be confused with region_t
and mem_region_t
.
dev_region_t
is used only in log truncation. During log
trunctation, all sub-records of all transactions in the truncation epoch
are processed. A tree called mod_tree
is built. Each node
of the tree is of type dev_region_t
. The tree is sorted according
to the segment offset. The purpose of the tree is to make the disk
write operation on the segment device sequential.
device_t
)Device descriptor (of type device_t
) contains the details of the
phyical devices: name, handle, size. Also, it contains the pointers
to the gather-write I/O vectors.
rvm_initialize()
Global lists and trees are initialized (seg_root
, log_root
,
region_tree
). A log truncation (log_recover()
) is
performed.
rvm_map()
There is always a log truncation at the begin of mapping. This is
necessary to ensure the mapping will get the latest committed image
from the data segment. It also checks if application has
specified a new segment device, if so, it will create a special record
of type log_seg_t
and queue it on log->special_list
. This
record will be flushed to the log before any next transaction.
Application may or may not specified the mapped-to address for the
region. If it is not specified, rvm_map()
will do a
page_alloc()
to find a VM address for the request region. Either
case, the mapped-to address will be stored in region->vmaddr
as
well as mem_region->vmaddr
.
Before carry out the mapping, rvm_map()
also need to check that
the requested mapping is not overlapped with existing mapping in both
the VM address space as well as the segment offset space. The former
is detected with the aid of region_tree
while the latter is
detected with the aid of seg->map_list
. If there are
overlapping, RVM_EVM_OVERLAP will be returned in the former case
and RVM_EOVERLAP will be returned in the latter case.
Finally, the content of the data segment at the specified offset is
copied into the VM space (unless no_copy
mode is specified).
rvm_begin_transaction()
A transaction descriptor (of type int_tid_t
) will be allocated
and inserted in log->tid_list
. The address of that descriptor
is also noted in rvm_tid->tid
.
rvm_set_range()
and rvm_modify_bytes()
The range declared must be totally included inside a mapped region,
other RVM_ENOT_MAPPED will be returned. This is done by consulting
region_tree
.
A new range descriptor (of type range_t
will be allocated and
inserted into tid->range_tree
. Depending on the optimization
flag of the transaction, the range tree may be sorted by vm address or
sorted by range numbers.
If transaction is in restore
mode, the current image of VM in the
range will be first saved to a buffer pointed to by range->data
.
VM address of the range declared is stored in range->nv.vmaddr
.
rvm_end_transaction()
When application calls rvm_end_transaction()
, for each of the
ranges in tid_range_tree
, the vm area pointed to by
range->nv.vmaddr
is copied to range->data
. This is to
ensure the atomicity of transaction: when rvm_end_transaction()
returns, all declared ranges have been copied to a seperate location.
If the program crashes before rvm_end_transaction()
returns, then
the data segment will contain the old value of the of the declared
ranged. Applications are free to write new data on the ranges after
rvm_end_transaction()
returns.
The transaction descriptor tid
is dequeued from
log->tid_list
and is queued to log->flush_list
.
If RVM_COALESCE_TRANS flag is set for the transaction, an inter-transaction coalescing will be carried out.
If the transaction is committed in flush
mode, flush_log()
is called.
flush_log()
Every transaction descriptors in log->flush_list
are written out
to the log, this is done by the routine log_tid()
. First, the
transaction header of type trans_hdr
is built and added to the
gather write vector dev->iov
. Then, for every ranges in
tid->range_tree
, the new value range header of type
nv_range_t
is built and added to the gather write vector,
followed by the actual new value data. After all the new value
records are written, a record end marker (of type rec_end_t
is
built and added to the gather write vector.
If log wrapping is need, there will be intervening record end marker, log wrap marker and transaction header record inserted, as described in section log-wrap .
rvm_truncate()
Log truncation is carried out by log_recover()
. For each
segment, a modification tree mod_tree
is built. Each node is of
type dev_region_t
, it is arranged in order of segment offset.
For each new value records of each transactions on the log, a node of
type dev_region_t
is built and inserted into the modification
tree. If there are overlaps in segment offset , the overlapped
portion is not repeatedly inserted into the modification tree. In
this case, it is not an error to have overlapped segment, for example,
the same memory range can be updated by different transaction. Also
note that since the log is truncated in last-in-first-out order,
i.e. new value records that was flushed last is inserted into tree
first, we always have the most up-to-date range inserted into the
tree.
When all the log records bracket by the log pointers prev_log_head
and prev_log_tail
are scanned and inserted into the modification
tree. The whole tree is written to data device (apply_mods()
).
The program rvmutl
is a utility program that allows:
rvmutl
:
init_log
: initialize a logopen_log
: open an exising logstatus
: display information on log status blockhead
: locate and display the first record of current epochtail
: locate the display the last record of current epochprev (n)
: locate and display previous (n) record(s)next (n)
: locate and display next (n) reocrds(s)In the rvm source codes, you can often see the global variable
rvm_utlsw
. RVM is designed in mind that there are two possible
types of users for rvm's routines: ordinary applications or the
program rvmutl
. This variable allows a differential treatment on
exception condition. For example, in the routine scan_forward()
,
if the record type is found not to be one that the routine is
expecting (possibly means a corrupted log) the program will:
rvmutl
rvmutl
.