Michael Abrash’s Graphics Programming Black Book, Special Edition

Michael Abrash

Posts navigation
Execjet Phenom loses control and diverts Reno. Can you identify the plane? Check the oil supply for correct amount, and inspect the lubrication system for leaks. Thank you Dylan for the remap. This would be completed at a later time on 12th Jan during Dylan's next trip down to KL. I remapped my turbo yesterday, 27th Jan It was too bad that there were so many accident while i drive back from kl to jb yesterday plus heavy rain only certain area i could try to drive a bit fast.

USAF Serial Number Search Results


Range with max fuel km nm , range with max payload km nm. The Airbus A is a short- to medium-range wide-body jet airliner. Launched in as the world's first twin-engined widebody the AB4 the 9th A first flew on 25 December Alarmed by the success of the A, Boeing responded with the new Boeing A technologically advanced plane it influenced later subsonic airliner designs. The highlights include Advanced wings by de Havilland later BAE Systems with supercritical airfoil section for economical performance and advanced aerodynamically efficient flight controls.

It was the first airliner to be fitted with wind shear protection, Advanced autopilots and Electrically controlled braking systems. Later As incorporated other advances such as 2-man crew by automating the flight engineer's functions, an industry first, Glass cockpit, and wingtip fences first introduced on the A The basic fuselage of the A was later stretched A and A , and shortened A The final production A made its initial flight on 18 April , Airbus has announced a support package to keep As flying commercially until at least The A family eventually sold with around still in service.

Freighter version of the AB First delivery occurred in , but only very few were built as the AF was soon replaced by the more capable AF official designation: Cargo capacity main deck 15 pallets, lower deck 23 LD3 containers Cruising speed Mach 0.

Uses FSX default A sounds, virtual cockpit and panel. For FS use version below. The A is a minimum change, shortened derivative of the highly successful A The A family share a wider fuselage than the competing Boeing and has a thinner, longer wing for fuel efficiency. The A has the longest range of any type in its size range. Other changes included reducing overwing exits from four to two and a smaller aft cargo door. Minor software programming changes were made to accommodate the different handling characteristics but otherwise the A shares the same fly-by-wire flight controls and state-of-the-art cockpit as the other A Family members meaning the A, A and A can all be flown by pilots with the same type rating.

The A flew for the first time on August 25 from Hamburg in Germany. Seats passengers in a typical two class configuration eight premium class and economy class. Range with passengers in a two-class configuration extends to 6, km 3, nmi. Cruising speed Mach 0. It is the most successful of Airbus' aircraft with over 7, delivered and 5, on order as of June The first member of the A family—the A—was launched in March , and first flew on 22 February The A was the first civil airliner to include a full digital fly-by-wire flight control system.

At the time of the aircraft's introduction, the behaviour of the fly-by-wire system equipped with full flight envelope protection was a new experience for many pilots. Its design also included a full glass cockpit rather than the hybrid versions found in previous airliners.

Typical Cruise Speed at 37, feet: The simplest way to install is to use the Aerosoft Livery Manager. An A is also available on a separate page. The Airbus A family consists of short- to medium-range, narrow-body, commercial passenger twin-engine jet airliners manufactured by Airbus.

The As are also named Aceo current engine option since December , when Airbus announced a new generation of the A family, the Aneo new engine option. The Aneo offers new, more efficient engines, combined with airframe improvements and the addition of winglets, named Sharklets by Airbus. The largest of the A family the A was the first derivative of the successful A, the major change being a stretched fuselage acheived by two fuselage plugs.

Other differences include larger engines and tyres, strengthened undercarriage and some wing improvements. The cockpit remains the same and shares the same pilot type rating. The A was the first stretch of the A but as the fuel tanks were unchanged it had a shorter range so Airbus soon launched the with additional fuel tanks.

The A flew for the first time in December from Hamburg in Germany. There are in service with orders for an additional correct as of March Seats high-density seating 1-class, maximum 1-class, typical 2-class, typical Range 3, nmi 5, km; 3, mi Cruising speed Mach 0.

The AF is a long range, shortened development of the standard A, developed in part as a replacement for the AR and a competitor to the ER. Airbus launched development of the A in November , followed by the first customer order, for 13 from ILFC, placed in February The A is based on the A and shares near identical systems, airframe, flightdeck and wings, the only major difference being the fuselage length.

The AF is an all-cargo derivative of the A capable of carrying 65 t , lb over 7, km 4, nmi or 70 tonnes , lb up to 5, km 3, nmi. The A is the first Airbus aircraft with both fuselage and wing structures made primarily of carbon-fibre-reinforced polymer. Its variants seat to passengers in typical three-class seating layouts. The A is positioned to succeed the A, and compete with the Boeing and The prototype A first flew on 14 June from Toulouse, France.

On 15 January , the A entered service with Qatar Airways, the type's launch customer. The tons MTOW A is the first A model and typically seats passengers over a 8, nmi 15, km range. There were 81 A aircraft in service with 13 operators as of 30 April FS package available separately. It is the world's largest passenger airliner and due to its size, many airports have had to expand their facilities to properly accommodate it.

Designed to challenge Boeing's monopoly in the large-aircraft market, the A made its maiden flight on 27 April and entered initial commercial service in October with Singapore Airlines. The aircraft was known as the Airbus A3XX during much of its development, before receiving the A designation.

The A's upper deck extends along the entire length of the fuselage, with a width equivalent to a wide-body aircraft. This allows for an A's cabin with square metres 5, The A has a design range of 15, kilometres 8, nmi; 9, mi , sufficient to fly from New York to Hong Kong, and a cruising speed of Mach 0. As of January there had been firm orders for the A, of which 69 have been delivered. The largest order, for 90 aircraft, was from Emirates. All mdls are multiplayer compatible by default.

A short to medium range narrowbody airliner. As it turned out, 1, were delivered. In all its variations, 1, of the s were sold. The last to be delivered was a F to Federal Express on Sept. Some s have been retro-fitted with winglets for improved performance. Range with max payload km nm Service Ceiling 40, feet For checklists, etc: Model design by Hiroshi Igami.

Flight dynamics design by Nick Wilkinson, with assistance from David Biggar. Master textures by Kyle Schurb. They are short- to medium-range, narrow-body jet airliners. This aircraft, the 2,rd built, first flew on February 9, with pilots Mike Hewett and Ken Higgins. The was the first of the Next Generation series when launch customer Southwest Airlines ordered the variant in November The variant was based on the and entered service in It replaced the in Boeing's lineup, and its direct competitor is the Airbus A It typically seats passengers in a two-class cabin or in all-economy configuration.

Typical cruising speed Mach 0. This is a repaint of the default FSX model. As an alternative you could use the Kittyhawk version listed under FS downloads below, which includes a UK Sun livery too. Produced since by Boeing, the NG is sold in four sizes seating typically to passengers. The seats passengers in a two-class layout, or in one class, and competes with the A A HUD is optional. BBJ style winglets are offered as an optional feature for the Unlike the other Next Generation s, the and introduced new fuselage lengths.

The prototype rolled out on June 30, and first flew on July 31, Upgraded and re-engined models in development as the MAX series will eventually supplant the NG. Max certificated altitude 41,ft. Maximum Range 3, nautical miles 5, km [2-class with winglets]. P3Dv3 and v4 Compatible. FSX only -model may be FS compatible with another panel. The was launched in and is the longest to date. The has the same fuel capacity as the but exchanged range for payload. The ER, launched in , rectified this allowing it to compete directly with the A It replaces the discontinued and B in the Boeing range.

For replacement sound file I recommend altsnd. The Boeing F is a marriage of the fuselage from the F and the wings of the larger series. It can carry 30 pallets on the main deck and 32 LD1 containers in the lower hold and transport them 4, nautical miles 8, km , or from London to Beijing.

Typical Cruise Speed at 35, feet0. FSX default sounds. The ER is the same size as the current but able to carry more or fly further. Typical Cruise Speed at 35, feet 0. There is an extremely thorough Operations Manual to accompany this plane which will help you get the most from it. With freeware models like this there is no need for payware!

Although designed for short and medium length routes, the has since been used in a variety of roles ranging from high-frequency shuttle services to transatlantic routes. The was produced in two fuselage lengths. The original entered service in The airplane is the 1,th Boeing Production for the totaled aircraft, making the type by far the most popular model. Modifications by Boeing Wichita in Kansas included the removal of passenger amenities, main deck structural reinforcement, and the installation of a PF forward fuselage section with a port-side cargo door.

The forward two entry doors are retained, resulting in a main deck cargo capacity of 14 pallets, which is one less than the PF. Environmental controls can be fitted for animal cargo, and rear exits and window pairs are retained on some aircraft. In July , 70 converted SFs were in service. Recommend Just Flight free-ware sound. Should work in FS Separate sound package below for and models, you only need to download once.

The Boeing is a long-range wide-body twin-engine jet airliner and is the world's largest twinjet. The ER was the most widely produced variant of the twinjet to date until surpassed by the longer ER. The is The longer-range ER variant entered service in The stretched was designed as an A-market replacement for s and s. Compared to the older s, the stretched has comparable passenger capacity and range, but burns one-third less fuel and has 40 percent lower maintenance costs.

The ER, which combined the 's added capacity with the ER's range, became the top-selling variant in the late s, gaining orders as airlines replaced comparable four-engine models with twinjets because of their lower operating costs.

The 's nine-abreast seating for economy provides passengers less space, particularly across the hips and shoulders, than any other jet airliner. The Boeing Dreamliner is an American long-haul, mid-size widebody, twin-engine jet airliner made by Boeing Commercial Airplanes. It is the first airliner with an airframe constructed primarily of composite materials. The Dreamliner's distinguishing features include mostly electrical flight systems, raked wingtips, and noise-reducing chevrons on its engine nacelles.

It shares a common type rating with the larger Boeing to allow qualified pilots to operate both models. The airliner's maiden flight took place on December 15, , and completed flight testing in mid The 's cabin windows are larger than any other civil air transport in-service or in development, with dimensions of The 's cabin features light-emitting diodes LEDs as standard equipment, allowing the aircraft to be entirely 'bulbless' and the internal cabin pressure is increased to the equivalent of 6, feet 1, m altitude instead of the 8, feet 2, m on older conventional aircraft.

With a typical capacity of passengers and a range of 7, nautical miles, the -8 is the base model of the family and was the first to enter service in The is targeted to replace the Boeing ER and ER, as well as expand into new non-stop markets where larger planes would not be economically viable. The flight was to collect data on a new leading-edge design for the wing, and while doing so, the DC-8 became the first civilian jet — and the first jet airliner — to make a supersonic flight and was accompanied by an F Starfighter chase aircraft flown by Chuck Yeager.

Launched after the competing Boeing , the DC-8 nevertheless kept Douglas in a strong position in the airliner market, and remained in production until when it began to be superseded by larger wide-body designs, including the Boeing , McDonnell Douglas DC and Lockheed L TriStar. The DC-8's design allowed it a slightly larger cargo capacity than the and some re-engined DC-8s are still in use as freighters. The long-range Series 62 appeared in April It had two inch 1. The 18, lb JT3D-3B was retained but the engine pylons were redesigned to eliminate their protrusion above the wing and make them sweep forward more sharply, so that the engines were some 40 inches 1.

The engine pods were also modified with a reduction in diameter and the elimination of the and bypass duct. The changes all improved the aircraft's aerodynamic efficiency. The DC-8 Series 62 is slightly heavier than the or at , pounds , kg , and able to seat up to passengers, the had a range with full payload of about 5, nautical miles 9, km; 6, mi , or about the same as the but with 40 extra passengers.

Also available as the cargo door equipped convertible CF or all cargo AF. Based on the DC, it features a stretched fuselage, increased wingspan with winglets, refined airfoils on the wing and smaller tailplane, new engines and increased use of composite materials.

It also features an all-digital glass cockpit that decreases the flight deck crew from the three required on the DC to two by eliminating the necessity for a flight engineer. The cockpit design is called Advanced Common Flightdeck and is shared with the Boeing The MD was launched on December 30, with commitments for 52 firm orders and 40 options; its maiden flight was on January 10, MDF 53 built: The all-cargo aircraft can transport 26 pallets.

It was the third wide-body airliner to enter commercial operations, after the Boeing and the McDonnell Douglas DC The airliner has a seating capacity up to passengers and a range over 4, nautical miles 7, km. Its trijet configuration has three Rolls-Royce RB engines with one engine under each wing, and a third engine, center-mounted with an S-duct air inlet embedded in the tail and the upper fuselage.

The aircraft has an autoland capability, an automated descent control system, and available lower deck galley and lounge facilities.

The L TriStar was produced in two fuselage lengths. The original L first flew in November , and entered service with Eastern Air Lines in The shortened, longer range L first flew in , and entered service with British Airways a year later. The original-length TriStar was also produced as the high gross weight L, up-rated engine L, and further upgraded L Post-production conversions for the L with increased takeoff weights included the L and L Development of the Lockheed L Tristar began in when American Airlines announced a need for a short to medium-range, large-capacity transport.

Another way in which the prefetch queue cycle-eater complicates the use of the Zen timer involves the practice of timing the performance of a few instructions over and over. However, as we just learned, the actual performance of any instruction depends on the code mix preceding any given use of that instruction, which in turn affects the state of the prefetch queue when the instruction starts executing.

Alas, the execution time of an instruction preceded by dozens of identical instructions reflects just one of many possible prefetch states and not a very likely state at that , and some of the other prefetch states may well produce distinctly different results. For example, consider the code in Listings 4. Here, because the prefetch queue is always empty, execution time should work out to about 4 cycles per byte, or 8 cycles per SHR , as shown in Figure 4.

In fact, the Zen timer reports that Listing 4. Going by Listing 4. Since MUL instructions take so long to execute that the prefetch queue is always full when they finish, each SHR should be ready and waiting in the prefetch queue when the preceding MUL ends.

And, by God, when we run Listing 4. The key point is this: Are we talking about two different forms of SHR here? Of course not—the difference is purely a reflection of the differing states in which the preceding code left the prefetch queue. By contrast, each SHR in Listing 4. Clearly, either instruction fetch time or Execution Unit execution time—or even a mix of the two, if an instruction is partially prefetched—can determine code performance.

Some people operate under a rule of thumb by which they assume that the execution time of each instruction is 4 cycles times the number of bytes in the instruction. For one thing, the rule should be 4 cycles times the number of memory accesses, not instruction bytes, since all accesses take 4 cycles on the based PC.

Now consider Listing 4. Both instructions are 2 bytes long, and in both cases it is the 8-cycle instruction fetch time, not the 3 or 4-cycle Execution Unit execution time, that limits performance. The fact of the matter is that a given instruction takes at least as long to execute as the time given for it in the Intel manuals, but may take as much as 4 cycles per byte longer, depending on the state of the prefetch queue when the preceding instruction ends.

The only true execution time for an instruction is a time measured in a certain context, and that time is meaningful only in that context. What we really want is to know how long useful working code takes to run, not how long a single instruction takes, and the Zen timer gives us the tool we need to gather that information.

Get used to the idea that execution times are only meaningful in context, learn the rules of thumb in this book, and use the Zen timer to measure your code. For example, practically speaking, each SHR in Listing 4.

You could think of the extra instruction fetch time for SHR in Listing 4. Alternatively, you could think of each SHR in Listing 4. Whichever perspective you prefer is fine. The important point is that the time during which the execution of one instruction and the fetching of the next instruction overlap should only be counted toward the overall execution time of one of the instructions.

For all intents and purposes, one of the two instructions runs at no performance cost whatsoever while the overlap exists.

Reducing the impact of the prefetch queue cycle-eater is one of the overriding principles of high-performance assembly code. How can you do this? One effective technique is to minimize access to memory operands, since such accesses compete with instruction fetching for precious memory accesses.

You can also greatly reduce instruction fetch time simply by your choice of instructions: Keep your instructions short. Less time is required to fetch instructions that are 1 or 2 bytes long than instructions that are 5 or 6 bytes long. Reduced instruction fetching lowers minimum execution time minimum execution time is 4 cycles times the number of instruction bytes and often leads to faster overall execution.

While short instructions minimize overall prefetch time, ironically they actually often suffer more from the prefetch queue bottleneck than do long instructions. Short instructions generally have such fast execution times that they drain the prefetch queue despite their small size. For example, consider the SHR of Listing 4. However, using the registers is a rule of thumb, not a commandment.

In some circumstances, it may actually be faster to access memory. The look-up table technique is one such case. All in all, writing good assembler code is as much an art as a science. As a result, you should follow the rules of thumb described here—and then time your code to see how fast it really is. You should experiment freely, but always remember that actual, measured performance is the bottom line. Dynamic RAM refresh and wait states—our next topics—together form the lowest level at which the hardware of the PC affects code performance.

Below this level, the PC is of interest only to hardware engineers. What is important is that you understand this: Under certain circumstances, devices on the PC bus can stop the CPU for 1 or more cycles, making your code run more slowly than it seemingly should. By that I mean that DRAM refresh invisibly and inexorably steals a certain fraction of all available memory access time from your programs, when they are accessing memory for code and data.

A bit of background: So long as this is done often enough, a DRAM chip will retain its contents indefinitely. Each DRAM chip in the PC must be completely refreshed about once every four milliseconds in order to ensure the integrity of the data it stores. On the original based IBM PC, timer 1 of the timer chip is programmed at power-up to generate a signal once every 72 cycles, or once every That signal goes to channel 0 of the DMA controller, which requests the bus from the upon receiving the signal.

DMA stands for direct memory access, the ability of a device other than the to control the bus and access memory directly, without any help from the The addresses accessed by the refresh DMA accesses are arranged so that taken together they properly refresh all the memory in the PC.

By accessing one of the addresses every Only the first K of memory is refreshed in the PC; video adapters and other adapters above K containing memory that requires refreshing must provide their own DRAM refresh in pre-AT systems. The important point is this: That means that as much as 5. Consequently, DRAM refresh can slow code performance anywhere from 0 percent to 5. First, consider the series of MUL instructions in Listing 4.

Since a bit MUL on the executes in between and cycles and is only 2 bytes long, there should be plenty of time for the prefetch queue to fill after each instruction, even after DRAM refresh has taken its slice of memory access time. Consequently, the prefetch queue should be able to keep the Execution Unit well-supplied with instruction bytes at all times. Since SHR executes in 2 cycles but is 2 bytes long, the prefetch queue should be empty while Listing 4.

As a result, the time per instruction of Listing 4. In fact, each SHR in Listing 4. In fact, the result indicates that DRAM refresh is stealing not 4, but 5.

How can this be? When the code in Listing 4. Now we see that things can get even worse than we thought: DRAM refresh can steal as much as 8.

While either case can happen, the latter case—significant performance reduction, ranging as high as 8. This is especially true for high-performance assembly code, which uses fast instructions that tend to cause non-stop instruction fetching. When we discovered the prefetch queue cycle-eater, we learned to use short instructions.

When we discovered the 8-bit bus cycle-eater, we learned to use byte-sized memory operands whenever possible, and to keep word-sized variables in registers. What can we do to work around the DRAM refresh cycle-eater? If refresh were any less frequent, the reliability of the PC would be compromised, so tinkering with either timer 1 or DMA channel 0 to reduce DRAM refresh overhead is out.

Nor is there any way to structure code to minimize the impact of DRAM refresh. Sure, some instructions are affected less by DRAM refresh than others, but how many multiplies and divides in a row can you really use? In the old days when code size was measured in bytes, not K bytes, and processors were less powerful—and complex—programmers did in fact use similar tricks to eke every last bit of performance from their code. When programming the PC, however, the prefetch queue cycle-eater would make such careful code synchronization a difficult task indeed, and any modest performance improvement that did result could never justify the increase in programming complexity and the limits on creative programming that such an approach would entail.

Besides, all that effort goes to waste on faster s, s, and other computers with different execution speeds and refresh characteristics. Useful code accesses memory frequently and at irregular intervals, and over the long haul DRAM refresh always exacts its price. The display adapter cycleis another possible culprit, and, on s and later processors, cache misses and pipeline execution hazards produce this sort of effect as well. Thanks to DRAM refresh, variations of up to 8.

Wait states are well and truly the lowest level of code performance. Everything we have discussed and will discuss —even DMA accesses—can be affected by wait states. Wait states exist because the CPU must to be able to coexist with any adapter, no matter how slow within reason.

To resolve this conflict, display adapters can tell the CPU to wait during bus accesses by inserting one or more wait states, as shown in Figure 4. The CPU simply sits and idles as long as wait states are inserted, then completes the access as soon as the display adapter indicates its readiness by no longer inserting wait states. Mind you, this is all transparent to executing code.

An instruction that encounters wait states runs exactly as if there were no wait states, only slower. Wait states are nothing more or less than wasted time as far as the CPU and your program are concerned. By understanding the circumstances in which wait states can occur, you can avoid them when possible. Unlike DRAM refresh, wait states do not occur on any regularly scheduled basis, and are of no particular duration.

Both the presence of wait states and the number of wait states inserted on any given bus access are entirely controlled by the device being accessed. When it comes to wait states, the CPU is passive, merely accepting whatever wait states the accessed device chooses to insert during the course of the access.

All of this makes perfect sense given that the whole point of the wait state mechanism is to allow a device to stretch out any access to itself for however much time it needs to perform the access. However, in the PC, wait states most often occur when an instruction accesses a memory operand, so in fact the Execution Unit usually is stopped by wait states. Instruction fetches rarely wait in an based PC because system memory is zero-wait-state.

AT-class memory systems routinely insert 1 or more wait states, however. As it turns out, wait states pose a serious problem in just one area in the PC. While any adapter can insert wait states, in the PC only display adapters do so to the extent that performance is seriously affected.

Display adapters must serve two masters, and that creates a fundamental performance problem. Master 1 is the circuitry that drives the display screen. This circuitry must constantly read display memory in order to obtain the information used to draw the characters or dots displayed on the screen. Since the screen must be redrawn between 50 and 70 times per second, and since each redraw of the screen can require as many as 36, reads of display memory more in Super VGA modes , master 1 is a demanding master indeed.

No matter how demanding master 1 gets, however, its needs must always be met—otherwise the quality of the picture on the screen would suffer. Master 2 is the CPU, which reads from and writes to display memory in order to manipulate the bytes that the video circuitry reads to form the picture on the screen. Master 2 is less important than master 1, since the CPU affects display quality only indirectly.

In other words, if the video circuitry has to wait for display memory accesses, the picture will develop holes, snow, and the like, but if the CPU has to wait for display memory accesses, the program will just run a bit slower—no big deal. It matters a great deal which master is more important, for while both the CPU and the video circuitry must gain access to display memory, only one of the two masters can read or write display memory at any one time.

Potential conflicts are resolved by flat-out guaranteeing the video circuitry however many accesses to display memory it needs, with the CPU waiting for whatever display memory accesses are left over. It turns out that the CPU has to do a lot of waiting, for three reasons. First, the video circuitry can take as much as about 90 percent of the available display memory access time, as shown in Figure 4.

Finally, the time it takes a display adapter to complete a memory access is related to the speed of the clock which generates pixels on the screen rather than to the memory access speed of the Consequently, the time taken for display memory to complete an read or write access is often longer than the time taken for system memory to complete an access, even if the lucks into hitting a free display memory access just as it becomes available, again as shown in Figure 4.

The important point is that display memory is not very fast compared to normal system memory. How slow is it? The PC jr was at best only half as fast as the PC. The PC jr had an running at 4. All the memory in the PCjr was display memory. Running your code from display memory is sort of like running on a hypothetical —an with a 4-bit bus. Given that your code and data reside in normal system memory below the K mark, how great an impact does the display adapter cycle-eater have on performance?

On the other hand, Super VGAs have more bytes of display memory to be accessed in high-resolution mode. In text mode, the display adapter cycle-eater is rarely a major factor.

After all, the whole point of drawing graphics is to convey visual information, and if that information can be presented faster than the eye can see, that is by definition fast enough. In Chapter 3, I recounted the story of a debate among letter-writers to a magazine about exactly how quickly characters could be written to display memory without causing snow.

Of course, now we know that their cardinal sin was to ignore the prefetch queue; even if there were no wait states, their calculations would have been overly optimistic. There are display memory wait states as well, however, so the calculations were not just optimistic but wildly optimistic. Text mode situations such as the above notwithstanding, where the display adapter cycle-eater really kicks in is in graphics mode, and most especially in the high-resolution graphics modes of the EGA and VGA.

The problem here is not that there are necessarily more wait states per access in highgraphics modes that varies from adapter to adapter and mode to mode.

Rather, the problem is simply that are many more bytes of display memory per screen in these modes than in lower-resolution graphics modes and in text modes, so many more display memory accesses—each incurring its share of display memory wait states—are required in order to draw an image of a given size. When accessing the many thousands of bytes used in the high-resolution graphics modes, the cumulative effects of display memory wait states can seriously impact code performance, even as measured in human time.

That sounds pretty serious, but we did make an unfounded assumption about memory access speed. The code in Listing 4. That means that on average, 1. In other words, the display adapter cycle-eater can more than double the execution time of code!

A line-drawing subroutine, which executes perhaps a dozen instructions for each display memory access, generally loses less performance to the display adapter cycle-eater than does a block-copy or scrolling subroutine that uses REP MOVS instructions. Scaled and three-dimensional graphics, which spend a great deal of time performing calculations often using very slow floating-point arithmetic , tend to suffer less.

In addition, code that accesses display memory infrequently tends to suffer only about half of the maximum display memory wait states, because on average such code will access display memory halfway between one available display memory access slot and the next. As a result, code that accesses display memory less intensively than the code in Listing 4. Nonetheless, the display adapter cycle-eater always takes its toll on graphics code.

Interestingly, that toll becomes much higher on ATs and machines because while those computers can execute many more instructions per microsecond than can the based PC, it takes just as long to access display memory on those computers as on the based PC.

What can we do about the display adapter cycle-eater? Well, we can minimize display memory accesses whenever possible. The key here is that only half as many display memory accesses are required to write a byte to display memory as are required to read a byte from display memory, mask part of it off and alter the rest, and write the byte back to display memory.

Half as many display memory accesses means half as many display memory wait states. Moreover, s and Pentiums, as well as recent Super VGAs, employ write-caching schemes that make display memory writes considerably faster than display memory reads. Along the same line, the display adapter cycle-eater makes the popular exclusive-OR animation technique, which requires paired reads and writes of display memory, less-than-ideal for the PC.

Exclusive-OR animation should be avoided in favor of simply writing images to display memory whenever possible. Another principle for display adapter programming on the is to perform multiple accesses to display memory very rapidly, in order to make use of as many of the scarce accesses to display memory as possible.

This is especially important when many large images need to be drawn quickly, since only by using virtually every available display memory access can many bytes be written to display memory in a short period of time.

Repeated string instructions are ideal for making maximum use of display memory accesses; of course, repeated string instructions can only be used on whole bytes, so this is another point in favor of modifying display memory a byte at a time.

On faster processors, however, display memory is so slow that it often pays to do several instructions worth of work between display memory accesses, to take advantage of cycles that would otherwise be wasted on the wait states. For the time being, all you really need to know about the display adapter cycle-eater is that on the you can lose more than 8 cycles of execution time on each access to display memory.

For intensive access to display memory, the loss really can be as high as 8cycles and up to 50, , or even more on s and Pentiums paired with slow VGAs , while for average graphics code the loss is closer to 4 cycles; in either case, the impact on performance is significant. There is only one way to discover just how significant the impact of the display adapter cycle-eater is for any particular graphics code, and that is of course to measure the performance of that code. There you have it: Still, some of those cycle-eaters can be minimized by keeping instructions short, using the registers, using byte-sized memory operands, and accessing display memory as little as possible.

Those three little words should strike terror into the heart of anyone who owns more than a sleeping bag and a toothbrush. Our last move was the usual zoo—and then some. Because the distance from the old house to the new was only five miles, we used cars to move everything smaller than a washing machine. We have a sizable household—cats, dogs, kids, com, you name it—so the moving process took a number of car trips. A large number—33, to be exact.

I personally spent about 15 hours just driving back and forth between the two houses. The move took days to complete. As it happens, the second question answers the first. It costs quite a bit to drive a car miles, to say nothing of the value of 15 hours of my time. But, at the time, it seemed as though my approach would be easier and cheaper. In Chapter 1, I briefly discussed using restartable blocks. This, you might remember, is the process of handling in chunks data sets too large to fit in memory so that they can be processed just about as fast as if they did fit in memory.

The restartable block approach is very fast but is relatively difficult to program. At the opposite end of the spectrum lies byte-by-byte processing, whereby DOS or, in less extreme cases, a group of library functions is allowed to do all the hard work, so that you only have to deal with one byte at a time. Byte-by-byte processing is easy to program but can be extremely slow, due to the vast overhead that results from invoking DOS each time a byte must be processed. I moved via the byte-by-byte approach, and the overhead of driving back and forth made for miserable performance.

Renting a truck the restartable block approach would have required more effort and forethought, but would have paid off handsomely. The easy, familiar approach often has nothing in its favor except that it requires less thinking; not a great virtue when writing high-performance code—or when moving.

The first point to address in designing our program involves the appropriate text-search approach to use. Literally dozens of workable ways exist to search a file. We can immediately discard all approaches that involve reading any byte of the file more than once, because disk access time is orders of magnitude slower than any data handling performed by our own code.

Based on our experience in Chapter 1, we can also discard all approaches that get bytes either one at a time or in small sets from DOS. A good rough cut is a buffer that will be between 16K and 64K, depending on the exact search approach, 64K being the maximum size because near pointers make for superior performance. So we know we want to work with a large buffer, filling it as infrequently as possible. Now we have to figure out how to search through a file by loading it into that large buffer in chunks.

Where do we begin? Well, it might be instructive to consider how we would search if our search involved only one buffer, already resident in memory.

The closest match to what we need is strstr , which searches one string for the first occurrence of a second string. Where we want to search a fixed-length buffer for the first occurrence of a string, strstr searches a string for the first occurrence of another string. We could put a zero byte at the end of our buffer to allow strstr to work, but why bother?

The strstr function must spend time either checking for the end of the string being searched or determining the length of that string—wasted effort given that we already know exactly how long our search buffer is. Even if a given strstr implementation is well-written, its performance will suffer, at least for our application, from unnecessary overhead. By the way, we could, of course, use our own code, working with pointers in a loop, to perform the comparison in place of memcmp.

If necessary, you could always write your own assembly language implementation of memcmp. Invoking memcmp for each potential match location works, but entails considerable overhead. Each comparison requires that parameters be pushed and that a call to and return from memcmp be performed, along with a pass through the comparison loop. We can eliminate most calls to memcmp by performing a simple test on each potential match location that will reject most such locations right off the bat.

We could make this check by using a pointer in a loop to scan the buffer for the next match for the first character, stopping to check for a match with the rest of the string only when the first character matches, as shown in Figure 5. Our engine also relies heavily on repeated string instructions, assuming that the memchr and memcmp library functions are properly coded. The only trick lies in handling potentially matching sequences in the file that start in one buffer and end in the next—that is, sequences that span buffers.

Organize your program so that you can do your processing within each block as fast as you could if there were only one block—which is to say at top speed—and make your blocks as large as possible in order to minimize the overhead associated with going from one block to the next. To boost the overall performance of Listing 5.

Take a moment to examine some interesting performance aspects of the C implementation, and all should become much clearer. Our next discovery is that, even though we read the file in large chunks, most of the execution time of Listing 5.

When I replaced the read function call in Listing 5. By contrast, Listing 5. All in all, the time required for DOS disk access calls is taking up at least 80 percent of execution time, and search time is less than 20 percent of overall execution time. In fact, search time is probably a good deal less than 20 percent of the total, given that the overhead of loading the program, running through the C startup code, opening the file, executing printf , and exiting the program and returning to the DOS shell are also included in my timings.

If, for example, your application will typically search buffers in which the first character of the search string occurs frequently as might be the case when searching a text buffer for a string starting with the space character an assembly implementation might be several times faster.

In contrast, Listing 5. It might also be worth converting the search engine to assembly for searches performed entirely in memory; with the overhead of file access eliminated, improvements in search-engine performance would translate directly into significantly faster overall performance.

One such application that would have much the same structure as Listing 5. Restartable blocks do minimize the overhead of DOS file-access calls in Listing 5.

The first lesson is less obvious than it seems. When I set out to write this chapter, I fully intended to write an assembly language version of Listing 5.

When you try to speed up code, take a moment to identify the hot spots in your program so that you know where optimization is needed and whether it will make a significant difference before you invest your time. As for restartable blocks: Here we tackled a considerably more complex application of restartable blocks than we did in Chapter 1—which turned out not to be so difficult after all. Focus on making the inner loop—the code that handles each block—as efficient as possible, then structure the rest of your code to support the inner loop.

I was fortunate enough to be seated next to Jeff at the dinner table, and, not surprisingly, our often animated conversation revolved around computers, computer writing, and more computers not necessarily in that order.

Although I was making a living at computer work and enjoying it at the time, I nonetheless harbored vague ambitions of being a science-fiction writer when I grew up. At any rate, I had accumulated a small collection of rejection slips, and fancied myself something of an old hand in the field. You should see what they pay for science fiction—even to the guys who win awards! Had I known I was seated next to a real, live science-fiction writer—an award-nominated writer, by God!

I was at a dinner put on by a computer magazine, seated next to an editor who had just finished a book about Turbo Pascal, and, gosh, it was obvious that the appropriate topic was computers.

To produce the best code, you must decide precisely what you need to accomplish, then put together the sequence of instructions that accomplishes that end most efficiently, regardless of what the instructions are usually used for. The point to all this: Yes, SHL shifts a pattern left—but a look-up table can do the same thing, and can often do it faster.

The instruction set is your raw material for writing high-performance code. Give it a shot! From that bit of sublime idiocy we can learn much about divining the full value of an instruction.

But what else do they do? For example, suppose you have an array base address in BX and an index into the array in SI. You could add the two registers together to address memory, like this:. The two approaches are functionally interchangeable but not equivalent from a performance standpoint, and which is better depends on the particular context. On a or , however, the balance shifts.

All memory addressing calculations are free on the Pentium, however. You may, however, be a tad more interested to hear that you can also use addressing modes to perform arithmetic that has nothing to do with memory addressing, and with a couple of advantages over arithmetic instructions, at that.

LEA accepts a standard memory addressing operand, but does nothing more than store the calculated memory offset in the specified register, which may be any general-purpose register. The operation of LEA is illustrated in Figure 6. What does that give us? The obvious solution is this:. An elegant alternative solution is simply:.

On both a and Pentium, LEA can also be slowed down by addressing interlocks. The can do two very interesting things: Well, the obvious advantage is that any two bit registers, or any bit register and any constant, or any two bit registers and any constant, can be added together, with the result stored in any register.

It can multiply any register used as an index. Besides, multiplying by 2, 4, or 8 amounts to a left shift of 1, 2, or 3 bits, so we can now add up to two bit registers and a constant, and shift or multiply one of the registers to some extent—all with a single instruction. Are you impressed yet with all that LEA can do on the ? Believe it or not, one more feature still awaits us. LEA can actually perform a fast multiply of a bit register by some values other than powers of two.

You see, the same bit register can be both base and index on the , and can be scaled as the index while being used unchanged as the base. That means that you can, for example, multiply EBX by 5 with:. Without LEA and scaling, multiplication of EBX by 5 would require either a relatively slow MUL , along with a set-up instruction or two, or three separate instructions along the lines of the following.

Multiplying a bit value by a non-power-of-two multiplier in just 2 cycles is a pretty neat trick, even though it works only on a or The full list of values that LEA can multiply a register by on a or is: The scene is Buffalo, New York, in the dead of winter, with the snow piled several feet deep.

Four college students, living in typical student housing, are frozen to the bone. One fabulously cold day, inspiration strikes:. Someone rushes out and buys a gas heater, and at considerable risk to life and limb hooks it up to an abandoned but still live gas pipe that once fed a stove on the third floor.

Someone else gets sheets of plastic and lines the walls of the bathroom to keep the moisture in, and yet another student gets a bucket full of rocks. The remaining chap brings up some old wooden chairs and sets them up to make benches along the sides of the bathroom. They crank up the gas heater, put the bucket of rocks in front of it, close the door, take off their clothes, and sit down to steam themselves. Surely warmer times await. The temperature climbs to 55 degrees, then 60, then 63, then 65, and finally creeps up to 68 degrees.

It is not, however, particularly warm for a sauna. Eventually someone acknowledges the obvious and allows that it might have been a stupid idea after all, and everyone agrees, and they shut off the heater and leave, each no doubt offering silent thanks that they had gotten out of this without any incidents requiring major surgery.

And so we see that the best idea in the world can fail for lack of either proper design or adequate horsepower. The primary cause of the Great Buffalo Sauna Fiasco was a lack of horsepower; the gas heater was flat-out undersized. This is analogous to trying to write programs that incorporate features like bitmapped text and searching of multisegment buffers without using high-performance assembly language. Any PC language can perform just about any function you can think of—eventually.

That heater would eventually have heated the room to degrees, too—along about the first of June or so. The Great Buffalo Sauna Fiasco also suffered from fundamental design flaws. A more powerful heater would indeed have made the room hotter—and might well have burned the house down in the process. Likewise, proper algorithm selection and good design are fundamental to performance.

The extra horsepower a superb assembly language implementation gives a program is worth bothering with only in the context of a good design. Assembly language optimization is a small but crucial corner of the PC programming world. But don't worry, the driver side back door is still M. Euisun Chung poised to take control of the dynastic conglomerate.

Meeting an industry standard could reduce liability for accidents. Watch Rob Austin careen around the Ring in under 10 minutes. One of the last MGBs ever made, found in a Denver wrecking yard. These car cams are all the rage nowadays. Surprising German luxury car leads the list.

Some cars fly off dealer lots in just days, while others tend to sit for months unsold. These are the slowest-selling vehicles in America on the used car market. Look underneath and you can tell this is no Ford. Gray models will be available.

It would allow residents to load their cars into a commuter tunnel right from their garage. ProPilot Assist is available on cheaper trims now. GM to recall more than 1 million vehicles in the U.

Daimler and Volvo factories and Carmax dealerships are closed. Say it with me:

Muhammad Ali's 1970 Rolls-Royce Silver Shadow is going for auction