Shared Hash Tables In Parallel Model Checking

7m ago
667.53 KB
13 Pages

Electronic Notes in Theoretical Computer Science 198 (2008) 79– Hash Tables in ParallelModel Checking 1Jiřı́ Barnat Petr RočkaiFaculty of Informatics, Masaryk University,Brno, Czech Republicbarnat,[email protected] light of recent shift towards shared-memory systems in parallel explicit model checking, we explorerelative advantages and disadvantages of shared versus private hash tables. Since usage of shared statestorage allows for techniques unavailable in distributed memory, these are evaluated, both theoretically andpractically, in a prototype implementation. Experimental data is presented to assess practical utility ofthose techniques, compared to static partitioning of state space, more traditional in distributed memoryalgorithms.Keywords: Hash tables, locking schemes, parallel1IntroductionMuch of the extensive research on the parallelisation of model checking algorithmsfollowed the distributed-memory programming model [5,4,12] and the algorithmswere parallelised for networks of workstations, largely due to easy access to networks of workstations. Recent shift in architecture design toward multi-cores hasintensified research pertaining to shared-memory paradigm as well.A mostly straightforward transformation of distributed-memory algorithm intoa shared-memory one, using several tailored techniques, is explored in [2]. In thispaper, we intend to build on these results, further augmenting the selected distributed algorithms with extensions specific to shared-memory systems, especiallythose based on using a single shared storage for the explored graph.For the experimental implementation, we have used DiVinE [3], specifically themulti-threaded, shared-memory version – as created for [2] – using the original DVEstate-space generator. The code has been modified for the purposes of this paper.1This work has been partially supported by the Grant Agency of Czech Republic grant No. 201/06/1338and the Academy of Sciences grant No. 1ET408050503.1571-0661 2008 Elsevier B.V. Open access under CC BY-NC-ND license.doi:10.1016/j.entcs.2007.10.021

801.1J. Barnat, P. Ročkai / Electronic Notes in Theoretical Computer Science 198 (2008) 79–91Shared-Memory PlatformSince in the paper, we will work with several assumptions about the targeted hardware architecture, we will describe it briefly first.We work with a model based on threads that share all memory, although theyhave separate stacks in their shared address space and a special thread-local storageto store thread-private data. Our working environment is POSIX, with its implementation of threads as lightweight processes. Switching contexts among differentthreads is cheaper than switching contexts among full-featured processes with separate address spaces, so using more threads than there are CPUs in the systemincurs only a minor penalty.Critical Sections, Locking and Lock Contention. In a shared-memorysetting, access to memory, that may be used for writing by more than a single thread,has to be controlled through use of mutual exclusion, otherwise, race conditions willoccur. This is generally achieved through use of a “mutual exclusion device”, socalled mutex. A thread wishing to enter a critical section has to lock the associatedmutex, which may block the calling thread if the mutex is locked already by someother thread. An effect called resource or lock contention is associated with thisbehaviour. This occurs, when two or more threads happen to need to enter thesame critical section (and therefore lock the same mutex), at the same time. Ifcritical sections are long or they are entered very often, contention starts to causeobservable performance degradation, as more and more time is spent waiting formutexes.Processor Cache: Locality and Coherence. There are currently two mainarchitectures in use for Level 2 cache. One is that each processing unit has itscompletely private Level 2 cache (for the Symmetric Multiprocessing case) or thereis a shared Level 2 cache for a package of 2 cores (designs with a Level 2 cacheshared among 4 cores are not commercially available as of this writing). In biggershared-memory computer systems, it is usual to encounter split cache, since theyoften contain on the order of 8-64 cores attached to a single memory block. Inrecent hardware, the basic building units are dual-core CPUs with shared cache,but among the different units, the caches are still separate. This idiosyncrasy ofthese architectures has important effects on performance and these will be discussedlater in more detail.Shared Memory Bus. Since the memory in SMP machines is attached to asingle shared memory bus, the RAM access from different processors needs to beserialized. This caps total memory throughput of the system and at some point,the available memory bandwidth becomes the bottleneck of computation. Thisis an important factor for memory-intensive workloads, to which model-checkingdefinitely belongs.1.2AlgorithmsThe algorithms used are not the main concern of this paper, but we neverthelesssummarise OWCTY, as it was used in the implementation. Also, since we are using

J. Barnat, P. Ročkai / Electronic Notes in Theoretical Computer Science 198 (2008) 79–9181the algorithm in somewhat non-standard setting, we have slightly modified someof its non-vital aspects – more details on those modifications will be described inSection 3.3. Short description of the original algorithm follow.The algorithm [8] is an extended enumerative version of the One Way CatchThem Young Algorithm [11]. The idea of the algorithm is to repeatedly removevertices from the graph that cannot lie on an accepting cycle. The two removal rulesare as follows. First, a vertex is removed from the graph if it has no successors inthe graph (the vertex cannot lie on a cycle), second, a vertex is removed if it cannotreach an accepting vertex (a potential cycle the vertex lies on is non-accepting).The algorithm performs removal steps as far as there are vertices to be removed.In the end, either there are some vertices remaining in the graph meaning thatthe original graph contained an accepting cycle, or all vertices have been removedmeaning that the original graph had no accepting cycles.The time complexity of the algorithm is O(h · m) where h h(G). Here thefactor m comes from the computation of elimination rules while the factor h relatesto the number of global iterations the removal rules must be applied. Also note, thatan alternative algorithm is obtained if the rules are replaced with their backwardsearch counterparts.2Hash Tables in Model CheckingOne of the traditional approaches, when exploring the state-space of an implicitlyspecified model, is that the algorithm starts from the initial state and using atransition function, generates successors of every explored state. Visited states arestored in a hash-table, to facilitate quick insertion of newly visited states and quicklookup of states that already have been visited.The usual approach in distributed algorithms is to partition the state spacestatically, using a partition function [7,9] (which is usually in turn based on ahash function over the state representation). This partition function unambiguouslyassigns each state to one of the computation nodes. Same approach can be leveragedin shared-memory computation, where each thread of control assumes ownership ofa private hash table, and potentially also a private memory area for storing actualstate representations.The described configuration is often the only feasible option, when dealing withdistributed memory system, since cross-node memory access has to be either manually simulated using message passing, or even if available, is prohibitively expensive.However, the situation in shared-memory systems is somewhat different, sinceall processors (and therefore threads of control) share a single continuous blockof local memory, with uniform accessibility from all the CPUs and/or cores. Thisgives us two new options, compared to situation in distributed environment, namely,if several hash tables are used, threads can look into tables they don’t own, andsecond, probably more interesting option is to have a single shared hash table, usedby all the threads.

822.1J. Barnat, P. Ročkai / Electronic Notes in Theoretical Computer Science 198 (2008) 79–91ImplementationThe threaded version of DiVinE implements an internal collision resolution hash table with quadratic probing. The table is dynamically-sized with exponential growth(i.e., the size of the table doubles every time more space is needed). Originally, thethreshold triggering table growth has been set as half-full, which gives minimumoverhead of 2 key-sized cells per valid item, where in our case the key is a singlepointer. Growing the table starts with allocating a new, double-sized table, iterating over all entries in the old table and rehashing them into the new, bigger one.This is a linear-time operation, amortised over insertions into the table. However,this property may have more far-fetched consequences in a setting where the tableis shared among multiple threads.A somewhat different approach for triggering the growth of the table has beenimplemented as part of the work on shared hash tables. The conditions are nowtwofold, first is that table is 75% full, the second is that there have been too manycollisions upon insert, where too many is defined as 32 sqrt(size)/16. This parameter may be subject to further adujstment, although we haven’t observed significantimpact. This latter trigger produces more tighly packed tables, which may sometimes save time, especially since during the growth, all other processing is halted.Another possibility to reduce the number of grows is to increase the growth factor(this is an user-overridable setting and subject to empirical tuning).2.2Region LockingThere is a need for locking when multiple threads perform concurrent reads andupdates of the table. Since the table is accessed very frequently, it is completelyunfeasible to lock the whole table for each access, as this would lead to very highlock contention and, consequently, reduced performance. Therefore, a region lockingscheme is devised, to only lock the region within which the update or lookup takesplace. Special precautions are necessary for growing the table, since no updates atall are allowed during this window. The regions are fixed-size, so the number ofregions grows linearly with the table size. There are two other options on how toorganise locking, one being of fixed number of locks, which means the locking unitincreases linearly with the hash table size, the second being a square-root basedgrowth of both region size and number of locks.Theoretical benefits of the first approach are that lock granularity and thereforecontention should remain very low throughout program execution. Fixed numberof locks makes competition for any given lock higher, although in theory, it shouldremain constant, as long as number of competing threads is constant. The squareroot approach is a compromise between those two. All the methods are evaluatedin the experimental section.2.3Lockless Shared TableIf implemented with no locking at all, an insertion may silently fail, i.e. it maybe overwritten by a subsequent insert to a colliding position due to a race condi-

J. Barnat, P. Ročkai / Electronic Notes in Theoretical Computer Science 198 (2008) 79–9183tion. However, this is not a fatal problem for reachability analysis, as observedin [17]. We have implemented a lockless hash table, but we have encountered severescalability problems with large, statically sized tables (as opposed to dynamicallygrowing tables). Since growing a lockless table is not implemented, this makes ithard to compare against the locking implementations, which can resize tables andtherefore don’t suffer from the large table problem. However, even lock-based tables, when statically sized, are highly detrimental to any scalability the system maybe exhibiting. As of this writing, we haven’t found the cause of the scalability issueswith pre-sized tables, therefore more investigation is due.3State Space PartitioningTo distribute the workload of graph exploration (in case of safety checking) or cycledetection (in case of liveness checking), the state space is divided into parts, onefor each of the worker threads (in the case of distributed computation, one for eachcluster node).3.1Static PartitioningThe original shared-memory implementation used a partitioning scheme comingdirectly from the distributed world. Each state is uniquely assigned to a thread,based solely on the state representation. This means, that every time a state isgenerated, it is assigned to the same thread. Consequently, each thread can maintainits private hash table, where it stores all states it owns. This has an important sideeffect of the thread being able to operate on the table without resorting to lockingor critical sections. Same goes for the auxiliary state data (like predecessor countin OWCTY elimination) – no locking is necessary.Another benefit is highly efficient use of processor cache, by making the ratio of hash table size to processor cache size much more favourable, than in thecase of shared hash table. This consequently reduces memory load and improvesthroughput.3.2Dynamic PartitioningThe above static partitioning scheme suffers from high communication overhead,since as threads are added, number of cross-transitions (transitions that requireinter-thread communication, because one of the states belongs to different threadthan the other) grows rapidly.A scheme using a different partitioning approach may be devised, when we aredealing with a single, shared table. Since the shared table allows any thread tolookup or update any state, it is no longer necessary to maintain the rule requiringeach state to be unambiguously assigned to one of the threads. Instead, the threadthat is examining a transition can decide on-the-fly w