随机存取存储器如何工作？为什么它是恒定时间随机访问？

Question

随机存取存储器如何工作？为什么它是恒定时间随机访问？

Kac*_*acy 20 hardware arrays ram assembly random-access

或者换句话说,为什么访问数组中的任意元素需要花费不变的时间(而不是O(n)或其他时间)？

我搜索了一下我的心脏寻找答案,并没有找到一个非常好的答案,所以我希望你们中的一个人可以与我分享你们的低级知识.

只是为了让你知道我希望得到的答案有多低,我会告诉你为什么我认为它需要不断的时间.

当我array[4] = 12在程序中说,我实际上只是将存储器地址的位表示存储到寄存器中.硬件中的该物理寄存器将根据馈送它的位表示打开相应的电信号.那些电子信号将以某种方式神奇地(希望有人可以解释魔法)在物理/主存储器中访问正确的存储器地址.

我知道这很粗糙,但它只是让你知道我正在寻找什么样的答案.

(编者注:从OP后来的评论中,他了解地址计算需要花费一些时间,并且只是想知道之后会发生什么.)

Answer 1

Mat*_*lia 19

因为软件喜欢O(1)"工作"内存,因此硬件被设计成以这种方式运行

基本点是程序的地址空间被认为是抽象地具有O(1)访问性能,即你想要读取的任何内存位置,它应该花费一些恒定的时间(无论如何它与它之间的距离无关)和最后的内存访问).因此,作为数组只是连续的地址空间块,它们应该继承这个属性(访问数组的元素只是将索引添加到数组的起始地址,然后取消引用获得的指针).

这个属性来自这样一个事实:一般来说,程序的地址空间与PC的物理RAM有一定的对应关系,正如名称(随机存取存储器)部分暗示的那样,它本身应具有的属性,无论是什么在你想要访问的RAM中的位置,你可以在恒定的时间内到达它(相反,例如,对于磁带驱动器,其中寻道时间取决于你必须移动到那里的磁带的实际长度).

现在,对于"常规"RAM,这个属性(至少是AFAIK)是真的 - 当处理器/主板/内存控制器要求RAM芯片获取一些数据时,它会在恒定时间内完成; 细节与软件开发并不相关,内存芯片的内部结构在过去发生了很多次变化,并且将来会再次发生变化.如果您对当前RAM的详细信息感兴趣,可以在这里查看有关DRAM的信息.

一般的概念是RAM芯片不包含必须移动的磁带,或者必须放置的磁盘臂; 当你在某个位置向他们询问一个字节时,工作(主要是改变一些硬件多路复用器的设置,将输出连接到存储字节状态的单元)对于你可能要求的任何位置是相同的; 因此,你获得了O(1)表现

这背后有一些开销(逻辑地址必须由MMU映射到物理地址,各种主板部件必须相互通信告诉RAM获取数据并将其带回处理器,...... ),但硬件设计在或多或少恒定的时间内这样做.

所以:

数组映射在地址空间上,映射在RAM上,具有O(1)随机访问; 作为所有映射(或多或少)O(1),数组保持RAM的O(1)随机访问性能.

这点的确此事向软件开发商,取而代之的,是的是,虽然我们看到一个平面地址空间,它通常映射在RAM,现代计算机上它是假的在访问任何元素具有相同的成本.在事实,访问是在同一个区域元素可以这样比周围的地址空间跳跃便宜,由于该处理器具有保留最近使用的数据和存储几个板载缓存(=小,但速度更快的片上存储器)的事实那是在同一个社区; 因此,如果你有良好的数据局部性,内存中的连续操作将不会持续命中ram(其延迟比缓存长得多),最后你的代码将运行得更快.

此外,在内存压力下,提供虚拟内存的操作系统可以决定将很少使用的地址空间页面移动到磁盘,并在访问它们时按需获取它们(响应页面错误); 这样的操作非常昂贵,并且再次强烈地偏离了访问任何虚拟存储器地址是相同的想法.

@KacyRaye这里有多个抽象工作.您的程序,编程语言,计算机硬件模型,芯片,电子学,量子效应和更深层次.我鼓励你学习所有你想要的所有内容,但你真的只需要了解顶级的几个级别才能成为一名优秀的程序员. (3认同)
@KacyRaye：这样看：RAM 芯片不包含必须移动的磁带，或必须定位的磁盘臂；当您在某个位置向他们请求一个字节时，对于您可能要求的任何位置，工作（主要是更改某些硬件多路复用器的设置，将输出连接到存储字节状态的单元）都是相同的；因此，您可以获得 O(1) 的性能。 (2认同)
@KacyRaye内存芯片是一个细胞网格.每个单元格保持一位.馈送到芯片的地址分为两半,用于行地址选择(RAS)和列地址选择(CAS),行和列唯一地选择一个要访问的单元. (2认同)
自"快速爆发"以来,RAM已经超过十年不是O(1).但是您不会注意到,因为该功能旨在与缓存行一起使用. (2认同)

Answer 2

Mar*_*som 8

从数组的开始到任何给定元素的计算只需要两个操作,乘法(乘以sizeof(元素))和加法.这两个操作都是恒定的时间.通常使用今天的处理器,它可以在基本上没有时间完成,因为处理器针对这种访问进行了优化.

@KacyRaye然后问*这个*问题而不是为什么数组是O(1),如果后者对你来说是显而易见的.这个答案的+1,想要自己写这个,直到看到一些评论和提到问题中的"神奇电子信号"*. (4认同)

Answer 3

old*_*mer 5

当我在程序中说array [4] = 12时，我实际上只是将内存地址的位表示形式存储到寄存器中。硬件中的此物理寄存器将根据我提供给它的位表示来打开相应的电信号。这些电信号将以某种方式神奇地（希望有人可以解释这种魔力）访问物理/主存储器中的正确存储器地址。

I am not quite sure what you are asking but I dont see any answers related to what is really going on in the magic of the hardware. Hopefully I understood enough to go through this long winded explanation (which is still very high level).

array[4] = 12;

Run Code Online (Sandbox Code Playgroud)

So from comments it sounds like it is understood that you have to get the base address of array, and then multiply by the size of an array element (or shift if that optimization is possible) to get the address (from your programs perspective) of the memory location. Right of the bat we have a problem. Are these items already in registers or do we have to go get them? The base address for array may or may not be in a register depending on code that surrounds this line of code, in particular code that precedes it. That address might be on the stack or in some other location depending on where you declared it and how. And that may or may not matter as to how long it takes. An optimizing compiler may (often) go so far as to pre-compute the address of array[4] and place that somewhere so it can go into a register and the multiply never happens at runtime, so it is absolutely not true that the computation of array[4] for a random access is a fixed amount of time compared to other random accesses. Depending on the processor, some immediate patterns are one instruction others take more that also has a factor on whether this address is read from .text or stack or etc, etc...To not chicken and egg that problem to death, assume we have the address of array[4] computed.

This is a write operation, from the programmers perspective. Starting with a simple processor, no cache, no write buffer, no mmu, etc. Eventually the simple processor will put the address on the edge of the processor core, with a write strobe and data, each processors bus is different than other processor families, but it is roughly the same the address and data can come out in the same cycle or in separate cycles. The command type (read, write) can happen at the same time or different. but the command comes out. The edge of the processor core is connected to a memory controller that decodes that address. The result is a destination, is this a peripheral if so which one and on what bus, is this memory, if so on what memory bus and so on. Assume ram, assume this simple processor has sram not dram. Sram is more expensive and faster in an apples to apples comparison. The sram has an address and write/read strobes and other controls. Eventually you will have the transaction type, read/write, the address and the data. The sram however its geometry is will route and store the individual bits in their individual pairs/groups of transistors.

A write cycle can be fire and forget. All the information that is needed to complete the transaction, this is a write, this is the address, this is the data, is known right then and there. The memory controller can if it chooses tell the processor that the write transaction is complete, even if the data is nowhere near the memory. That address/data pair will take its time getting to the memory and the processor can keep operating. Some systems though the design is such that the processors write transaction waits until a signal comes back to indicate that the write has made it all the way to the ram. In a fire and forget type setup, that address/data will be queued up somewhere, and work its way to the ram. The queue cant be infinitely deep otherwise it would be the ram itself, so it is finite, and it is possible and likely that many writes in a row can fill that queue faster than the other end can write to ram. At that point the current and or next write has to wait for the queue to indicate there is room for one more. So in situations like this, how fast your write happens, whether your simple processor is I/O bound or not has to do with prior transactions which may or may not be write instructions that preceded this instruction in question.

Now add some complexity. ECC or whatever name you want to call it (EDAC, is another one). The way an ECC memory works is the writes are all a fixed size, even if your implementation is four 8 bit wide memory parts giving you 32 bits of data per write, you have to have a fixed with that the ECC covers and you have to write the data bits plus the ecc bits all at the same time (have to compute the ecc over the full width). So if this was an 8 bit write for example into a 32 bit ECC protected memory then that write cycle requires a read cycle. Read the 32 bits (check the ecc on that read) modify the new 8 bits in that 32 bit pattern, compute the new ecc pattern, write the 32 bits plus ecc bits. Naturally that read portion of the write cycle can end up with an ecc error, which just makes life even more fun. Single bit errors can be corrected usually (what good is an ECC/EDAC if it cant), multi-bit errors not. How the hardware is designed to handle these faults affects what happens next, the read fault may just trickle back to the processor faulting the write transaction, or it may go back as an interrupt, etc. But here is another place where one random access is not the same as another, depending on the memory being accessed, and the size of the access a read-modify-write definitely takes longer than a simple write.

Dram can also fall into this fixed width category, even without ECC. Actually all memory falls into this category at some point. The memory array is optimized on the silicon for a certain height and width in units of bits. You cannot violate that memory it can only be read and written in units of that width at that level. The silicon libraries will include many geometries of ram, and the designers will chose those geometries for their parts, and the parts will have fixed limits and often you can use multiple parts to get some integer multiple width of that size, and sometimes the design will allow you to write to only one of those parts if only some of the bits are changing, or some designs will force all parts to light up. Notice how the next ddr family of modules that you plug into your home computer or laptop, the first wave is many parts on both sides of the board. Then as that technology gets older and more boring, it may change to fewer parts on both sides of the board, eventually becoming fewer parts on one side of the board before that technology is obsolete and we are already into the next.

This fixed width category also carries with it alignment penalties. Unfortunately most folks learn on x86 machines, which dont restrict you to aligned accesses like many other platforms. There is a definite performance penalty on x86 or others for unaligned accesses, if allowed. It is usually when folks go to a mips or usually an arm on some battery powered device is when they first learn as programmers about aligned accesses. And sadly find them to be painful rather than a blessing (due to the simplicity both in programming and for the hardware benefits that come from it). In a nutshell if your memory is say 32 bits wide and can only be accessed, read or write, 32 bits at a time that means it is limited to aligned accesses only. A memory bus on a 32 bit wide memory usually does not have the lower address bits a[1:0] because there is no use for them. those lower bits from a programmers perspective are zeros. if though our write was 32 bits against one of these 32 bit memories and the address was 0x1002. Then somebody along the line has to read the memory at address 0x1000 and take two of our bytes and modify that 32 bit value, then write it back. Then take the 32 bits at address 0x1004 and modify two bytes and write it back. four bus cycles for a single write. If we were writing 32 bits to address 0x1008 though it would be a simple 32 bit write, no reads.

sram vs dram. dram is painfully slow, but super cheap. a half to a quarter the number of transistors per bit. (4 for sram for example 1 for dram). Sram remembers the bit so long as the power is on. Dram has to be refreshed like a rechargable battery. Even if the power stays on a single bit will only be remembered for a very short period of time. So some hardware along the way (ddr controller, etc) has to regularly perform bus cycles telling that ram to remember a certain chunk of the memory. Those cycles steal time from your processor wanting to access that memory. dram is very slow, it may say 2133Mhz (2.133ghz) on the box. But it is really more like 133Mhz ram, right 0.133Ghz. The first cheat is ddr. Normally things in the digital world happen once per clock cycle. The clock goes to an asserted state then goes to a deasserted state (ones and zeros) one cycle is one clock. DDR means that it can do something on both the high half cycle and on the low half cycle. so that 2133Ghz memory really uses a 1066mhz clock. Then pipeline like parallelisms happen, you can shove in commands, in bursts, at that high rate, but eventually that ram has to actually get accessed. Overall dram is non-determinstic and very slow. Sram on the other hand, no refreshes required it remembers so long as the power is on. Can be several times faster (133mhz * N), and so on. It can be deterministic.

The next hurdle, cache. Cache is good and bad. Cache is generally made from sram. Hopefully you have an understanding of a cache. If the processor or someone upstream has marked the transaction as non-cacheable then it goes through uncached to the memory bus on the other side. If cacheable then the a portion of the address is looked up in a table and will result in a hit or miss. this being a write, depending on the cache and/or transaction settings, if it is a miss it may pass through to the other side. If there is a hit then the data will be written into the cache memory, depending on the cache type it may also pass through to the other side or that data may sit in the cache waiting for some other chunk of data to evict it and then it gets written to the other side. caches definitely make reads and sometimes make writes non-deterministic. Sequential accesses have the most benefit as your eviction rate is lower, the first access in a cache line is slow relative to the others, then the rest are fast. which is where we get this term of random access anyway. Random accesses go against the schemes that are designed to make sequential accesses faster.

Sometimes the far side of your cache has a write buffer. A relatively small queue/pipe/buffer/fifo that holds some number of write transactions. Another fire and forget deal, with those benefits.

Multiple layers of caches. l1, l2, l3...L1 is usually the fastest either by its technology or proximity, and usually the smallest, and it goes up from there speed and size and some of that has to do with cost of the memory. We are doing a write, but when you do a cache enabled read understand that if l1 has a miss it goes to l2 which if it has a miss goes to l3 which if it has a miss goes to main memory, then l3, l2 and l1 all will store a copy. So a miss on all 3 is of course the most painful and is slower than if you had no cache at all, but sequential reads will give you the cached items which are now in l1 and super fast, for the cache to be useful sequential reads over the cache line should take less time overall than reading that much memory directly from the slow dram. A system doesnt have to have 3 layers of caches, it can vary. Likewise some systems can separate instruction fetches from data reads and can have separate caches which dont evict each other, and some the caches are not separate and instruction fetches can evict data from data reads.

caches help with alignment issues. But of course there is an even more severe penalty for an unaligned access across cache lines. Caches tend to operate using chunks of memory called cache lines. These are often some integer multiple in size of the memory on the other side. a 32 bit memory for example the cache line might be 128 bits or 256 bits for example. So if and when the cache line is in the cache, then a read-modify-write due to an unaligned write is against faster memory, still more painful than aligned but not as painful. If it were an unaligned read and the address was such that part of that data is on one side of a cache line boundary and the other on the other then two cache lines have to be read. A 16 bit read for example can cost you many bytes read against the slowest memory, obviously several times slower than if you had no caches at all. Depending on how the caches and memory system in general is designed, if you do a write across a cache line boundary it may be similarly painful, or perhaps not as much it might have the fraction write to the cache, and the other fraction go out on the far side as a smaller sized write.

Next layer of complexity is the mmu. Allowing the processor and programmer the illusion of flat memory spaces and/or control over what is cached or not, and/or memory protection, and/or the illusion that all programs are running in the same address space (so your toolchain can always compile/link for address 0x8000 for example). The mmu takes a portion of the virtual address on the processor core side. looks that up in a table, or series of tables, those lookups are often in system address space so each one of those lookups may be one or more of everything stated above as each are a memory cycle on the system memory. Those lookups can result in ecc faults even though you are trying to do a write. Eventually after one or two or three or more reads, the mmu has determined what the address is on the other side of the mmu is, and the properties (cacheable or not, etc) and that is passed on to the next thing (l1, etc) and all of the above applies. Some mmus have a bit of a cache in them of some number of prior transactions, remember because programs are sequential, the tricks used to boost the illusion of memory performance are based on sequential accesses, not random accesses. So some number of lookups might be stored in the mmu so it doesnt have to go out to main memory right away...

So in a modern computer with mmus, caches, dram, sequential reads in particular, but also writes are likely to be faster than random access. The difference can be dramatic. The first transaction in a sequential read or write is at that moment a random access as it has not been seen ever or for a while. Once the sequence continues though the optimizations fall in order and the next few/some are noticeably faster. The size and alignment of your transaction plays an important role in performance as well. While there are so many non-deterministic things going on, as a programmer with this knowledge you modify your programs to run much faster, or if unlucky or on purpose can modify your programs to run much slower. Sequential is going to be, in general faster on one of these systems. random access is going to be very non-deterministic. array[4]=12; followed by array[37]=12; Those two high level operations could take dramatically different amounts of time, both in the computation of the write address and the actual writes themselves. But for example discarded_variable=array[3]; array[3]=11; array[4]=12; Can quite often execute significantly faster than array[3]=11; array[4]=12;

归档时间：	12 年，4 月前
查看次数：	5431 次
最近记录：	10 年，1 月前