How to Parse the Bitcoin BlockChain

      有的没的 2014-6-26 17:03:00

 

How to Parse the Bitcoin BlockChain


(Note as of January 8, 2014. A lot of people have been requesting a pre-built binary for the parser that has a command line interface. I just created one (for Windows 64bit machines only) and uploaded it. Here is the article documenting the tool with the download link. You can build it directly from source code yourself if you want to sync to the project.)

This article is going to describe in some detail how to parse the physical blockchain directly from your machine. The target audience for this article is developers, students, and researchers interested in writing their own software to directly parse the bitcoin blockchain data and interpret all of the transactions contained within. The scope of this article is to discuss how to read the blockchain data, how to interpret all inputs, outputs and transactions as well as to cover some nasty undocumented 'gotcha's' that you may encounter.

I must give thanks for the article '285 bytes that changed the world' by James that helped get me started figuring all of this out.

My motivation for writing this article is because it took me many weeks looking in multiple locations and also a lot of debugging to figure all of this out and I want to save others all of that time and hassle.

Corresponding with this article is the open source bitcoin parser that I have written and previously published on this site. If you don't want to write your own parser from scratch, you are welcome to take a look at mine which is documented here.

The scope of this document, however, is not to discuss how blocks on the blockchain are mined or how transactions are validated. Since this article describes how the blockchain is physically stored on your machine this means that it refers to blocks which have already been mined and transactions which have all already been validated. Therefore you don't need to know the details about mining and transaction validation simply to interpret the raw transaction data itself.

To get a copy of the blockchain on your machine it is necessary to run the official bitcoin blockchain wallet application called 'Bitcoin-QT'. When this application starts up, it will begin to download the entire bitcoin blockchain to your hard drive which, as of January 4, 2014, is comprised of 106 roughly 128 megabyte files totaling over 14 gigabytes of data. It could take days for your machine to download the entire blockchain.

As of January 4, 2014 this 14 gigabytes of data contains:

278,619 : Blocks 30,418,338  Transactions 66,951,352 Transaction Inputs 74,801,402 Transaction Outputs 

The raw blockchain data files are stored in the following locations on your hard drive:

Linux: ~/.bitcoin/blocks MacOS: ~/Library/Application Support/Bitcoin/blocks Windows: %APPDATA%Bitcoin\blocks   WinXP: C:\Documents and Settings\YourUserName\Application data\Bitcoin\blocks   Win7/Win8/Vista: C:\Users\YourUserName\AppData\Roaming\Bitcoin\blocks 

They will appear as a series of 128mb files blk00000.dat through blk00???.dat.

The following documentation is a more detailed description of the labeled items in the image above.

Block Header : Section B1 : Magic ID : Uint32 (4 bytes) 

The first 4 bytes of a block in the blockchain file is a 32 bit header referred to as a 'magic id'. This is simply an identifier to let us know that we are at the beginning of a block. This magic ID will always be the number 0xD9B4BEF9. If you encounter a value other than that, you probably have a bug in your parser.

It is important to note that it is possible that instead of finding the magic ID, you instead find a bunch of zero bytes. In my own copy of the blockchain I have encountered a case when parsing through a .dat file, where the header is missing. Instead there is a large block of zero bytes and then the blockchain picks up again later. I don't know why/how this occurs.

It is also important to note that when you run the Bitcoin-QT client for the first time, it will download the entire blockchain to your hard drive, with no missing gaps between blocks or orphan blocks. However, as the bitcoin-QT is continuously running on your machine, it will introduce gaps into the .dat file as well as occasionally write out orphan blocks. Your parser will need to take this into account.

Block Header : Section B2 : Block Length : Uint32 (4 bytes) 

The next 4 bytes of the blockheader contain the length of this block. Currently the maximum size of a single bitcoin block is 1mb; which is only big enough to support approximately 7 transactions per second, which quite frankly is rather disappointing. At a future time the block size may be increased but, if that were to occur, the already massive block chain would blow up out of control. This is a major issue for the bitcoin protocol. It cannot in any way support a massive number of transactions which could accommodate global commerce.

For today, however, you will never see a single block take up more than one megabyte of memory.

Another important note needs to be made here. The block-length is not equal to the amount of data in the block! Meaning, after you read the last entry in a block, that offset often can be, and will be, less than the total length of the block reported. Your parser must take this into account and seek the file pointer to wherever the next location in the file is indicated by the block length value.

Block Header : Section B3 : Version Number : Uint32 (4 bytes) 

The version number in the blockchain is always set to 1 in all cases. In the future this version number might get bumped and, if that were to occur, the layout of the data could, and likely would, change accordingly.

Block Header : Section B4 : Previous Block Hash : 32 bytes 

This entry contains a 32 byte hash of the 'previous' block. Think of this like a linked list. Each block in the blockchain points to the 'previous' block. It is important to note that the 'previous' block may not always be, well, the previous block! The blockchain can and sometimes will contain what are called 'orphan' blocks.

Let me reiterate, you may encounter two blocks which both point to the same 'previous' block! How do you know which one to use? I explain that in more detail later in this article.

An orphan block naturally occurs when two miners solve the same block at roughly the same time. They are both technically 'valid' however, only one of the two blocks is accepted into the main chain. As you are scanning the blocks in the bitcoin blockchain, you will need to know how to link the blocks which are in the main chain by using this hash.

It is important to note that the hash of a block is NOT included in the block itself! It is a computed value. I will discuss in some detail later in this document how this hash is computed and how you go about patching up the linked list.

A note about hashes. If you look at this hash in a memory dump, it will display in reverse order of what you see in the blockchain.info and blockexporer websites. For example, here's a link to block #0 (the genesis block) on the Blockchain.info website. Note that it displays the hash as

000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f 

However, if you compute the hash and look at it in memory, it will show up as:

6fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d6190000000000 

This is because the blockchain.info and blockexplorer websites display all of these values in 'big endian' format. This may be confusing at first when you are trying to debug your code.

Block Header : Section B5 : MerkleRoot : 32 bytes 

The next part of the block is a 32 byte hash called the 'merkle root'. The merkle-root hash is not needed to parse the bitcoin transactions. It is an optimization feature used by the Bitcoin-QT application and some other applications to rapidly access transaction data without needing to have full access to the entire blockchain transaction history in memory. You can read a discussion on the topic here.

Block Header : Section B6 : TimeStamp : uint32_t : 4 bytes 

This is the time stamp value indicating when this block was created. This is the only time value available in the bitcoin blockchain system and this also presents a huge problem. The timestamp is going to correspond to when the block was generated which is roughly once every ten minutes. At first you might think this is ok, since a ten minute resolution might be sufficient. However, it is not. Transactions are processed by miners in order of priority. The transactions contained in a block might, in fact, be much older than the block time itself! Transactions which do not include a fee could be many hours older than than the block they were finally included in.

I do not know why Satoshi did not decide to include a transaction time stamp. It's only 4 bytes, and it is a valuable piece of information that people really need access to. For now, all you can do is use the timestamp of the block and hope that's close enough for government work. And by government work, I mean close enough for tax and accounting purposes. Since the value of bitcoin is very volatile and can go up and down by massive amounts over a short period of time, clearly this coarse and inaccurate timing value is woefully inadequate.

This really needs to be fixed soon in my opinion.

This time value is in ANSI standard C 'time_t' format. You can learn more about it here.

Block Header : Section B7 : Target Difficulty : uint32 : 4 bytes 

This value represents the target difficulty for this block. This information can be ignored for the purposes of processing transactions. If you want to learn more about it, follow this link.

Block Header : Section B8 : nonce : uint32 : 4 bytes 

The 'nonce' value is a random number value used as part of the mining process. It is not needed to be able to parse and interpret transaction data and may be ignored for our purposes here. If you want to know more about it's use and meaning, follow this link.

Block Header : Section B9 : TransactionCount : Variable Length Integer, 1, 3, 5, or 9 bytes 

I'm not quite sure why Satoshi decided to use variable length integers. Maybe to save memory? So he could code 4 byte integers into 1 bytes? It's hard to say, because it offers very little value that I can see. Nevertheless, this is how he did it, and your parser will need to take it into account.

You can learn about how these variable length integers are coded by following this link.

You can see an example implementation in this source code. Search for the method:

"readVariableLengthInteger" 

We are now at the portion of the block which contains the transactions. For each transaction you will find the following data:

Transaction Section : T1 : Transaction Version Number : uint32 : 4 bytes 

The transaction version number is currently always expected to be one or two. However, at some point during the lifetime of the block chain several blocks were accepted with some completely garbage transaction version numbers. To be safe, your parser should expect a version number of one or two but not fail if it sees something else.

Transaction Section : T2 : The Number of Inputs : Variable Length Integer 

This is a variable length integer which designates the number of inputs contained in this transaction.

Input Section : I1 : Transaction Hash : 32 bytes 

This is the key magic piece needed to understand how inputs are handled. These 32 bytes represent a hash of some previous transaction in the block chain. It is very important to note, this transaction hash is stored nowhere in the blockchain itself!! It is a computed value. Later on in this article I will discuss in detail how this transaction hash is computed and how you can hook the inputs up correctly.

Input Section : I2 : Transaction Index : uint32 : 4 bytes 

Just as the previous value referred to a particular transaction this value says which output of that transaction comprises this input. If the transaction index is 0xFFFFFFFF then it means that this input refers to no previous output. You might wonder, who could such a thing occur. Well, it's because newly mined blocks get a block-reward of some value. Early on in the days of bitcoin this block-reward was equal to 50 bitcoins. Today is is worth 25 bitcoins, and down the road it will be worth half of that.

This is how new bitcoins are 'printed' or issued into the system. So, it is completely normal and expected that in a block the first input will refer to no previous output. The term often used for this is a 'coinbase' input.

Here is a link showing the single transaction of the genesis block in the blockchain. You will see that the input contains 50 bitcoins and refers to no previous output.

If the transaction index is non-zero, then it refers to a particular output in some previous transaction. It is important to note that inputs always completely spend all of a previous output.

This might seem confusing at first, but let me explain. Let's say you have a prior output corresponding to your address which has 10 bitcoins in it. Now, you decide that you want to send someone 2 bitcoins. You must spend all of the 10 bitcoins from the previous output. So, to send 2 bitcoins to someone you will create a transaction which has one input (the 10 bitcoins you previously had) and then two outputs. One of those outputs will contain 2 bitcoins for your friend and the other output will send the change of 8 bitcoins back to your address. Every set of transaction inputs fully spends some previous outputs in time.

If the total amount of bitcoins in the inputs is greater than the total number of bitcoins spent in the outputs, then whatever remains is known as a 'miner fee' and is awarded to the miner who created this block and included this transaction into the blockchain.

Input Section : I3 : Script Length : Variable Length Integer 

This is a variable length integer which contains the length of the input script.

Input Section : I4 : Script Data : (length bytes) 

This is the raw data which comprises the 'input script'. It is important to note, to parse all of the transactions in the block chain you do not need to know anything about the input script! You can simply ignore the contents of this data. The input-script combined with the output script, which we will cover in a moment, are used by the bitcoin system to validate transactions. However, we do not need to validate transactions. Since we are parsing the completed blockchain, we know that all of these transactions have already been validated.

If you want to learn more about what this script data contains and how it is used by the bitcoin protocol anyway, you can read more about it here.

Input Section : I5 : Sequence Number : uint32 :  4 bytes 

The final piece of data for an input is the 'sequence number' which is currently always assigned to the value of 0xFFFFFFFF. You can ignore this field.

Transaction Section : T3 : Output Count : Variable Length Integer 

Once all of the inputs have been processed, the next thing you will encounter is the variable length integer which describes the number of outputs in this transaction.

Output Section : O1 : Value : uint64 : 8 bytes 

This is the most fun part of the whole thing. We finally get to the money! This 64 bit unsigned integer represents the output value measured in 'Satoshis'. One Satoshi is equal to one hundred millionths of a bitcoin. So, for example, the first output in the genesis block will show up as 5,000,000,000 which is 50 bitcoins.

Output Section : O2 : Script Length : Variable Length Integer 

This is a variable length integer which specifies the length of the output script.

Output Section : O3 : Output Script : length bytes 

This is where the magic sauce happens. Even though we don't have to execute the script to validate it, we do need to inspect the script to figure out where the output goes.

The public key in the output script will be stored in one of two forms. Either as the full 65 byte public key or as the 20 byte hash of the public key. Early on most public keys were stored as 65 bytes but, later, the vast majority of all transactions store the public key in the 20 byte form both to save memory and for added security.

You can learn more about bitcoin public key addresses by looking at these links.

ECDSA key (Elliptic Curve Digital Signature)

https://en.bitcoin.it/wiki/ECDSA  http://en.wikipedia.org/wiki/Elliptic_Curve_DSA 

You can decipher the public key for almost all of the outputs in one of the following few forms without necessarily having to write a full script interpreter. These usage patterns appear to be able to decode essentially every single valid public key address in the entire blockchain. There is a chance that scripts could change in form or be more complicated in the future, but for now these seem sufficient.

Format 1 : 67 byte long output script containing a full ECDSA 65 byte public key address.

If you see an output script with a length of 67 bytes, the first byte is equal to 65 (indicating the length of the public key which follows), the next 65 bytes after the first byte are the public key, and the 67th byte (array index 66) is equal to 0xAC (172) which is the 'CHECKSIG' opcode.

Script[0] =65 : Length of the public key to follow Script[1-65]=The public key data Script[66]=OP_CHECKSIG (0xAC) 
Format 2 : 66 byte long output script. Contains a 65 byte public key address.

This script I believe is technically invalid. The 'length' field is missing, which is required for a valid script to execute. Nevertheless, this usage pattern does appear in the blockchain and is properly interpreted by a number of applications, so we need to accept it as well. This is in the same as format 1 except the first byte, the length field, is missing. In this use case the 66 byte long script begins with 65 bytes of the public key followed by the CHECKSIG opcode of 0xAC.

Script[0-64]=The public key address Script[65]=OP_CHECKSIG (0xAC) 
Fomat 3 : Script is 25 bytes long or more, contains a 20 byte public key hash address. This is the most common format of the vast majority of all output scripts.

The early blocks in the block chain will be in format #1 and format #2 but, after a while, most of the scripts will be in this form. They will be 25 bytes or more long and begin with the following pattern.

Script[0] = OP_DUP (0x76) Script[1] =OP_HASH160 (0xA9) Script[2] =20 (The length of the public key hash address which follows) Script[3-24] = The 20 byte public key address. 
Format 4 : Script is 5 bytes long and contains no public key.

This script is in error. It is invalid and represents an unspendable address. It is documented here, however, because it does show up in the blockchain a number of times.

Script[0] = OP_DUP (0x76) Script[1] = OP_HASH160 (0xA9) Script[2] =0 (A length of zero, not valid!) Script[3]=OP_EQUALVERIFY (0x88) Script[4]=OP_CHECKSIG (0xAC) 
Format 5 : If the script doesn't readily match any of the previous patterns, then you can search for it by scanning the output script for the following pattern.

Look for any place in the script where this pattern shows up:

Script[0] = OP_DUP (0x76) Script[1] = OP_HASH160 (0xA9) Script[2] =20 (A length value equal to the size of a public key hash) Script[3-22]=The 20 byte public key address Script[23]=OP_EQUALVERIFY (0x88) Script[24]=OP_CHECKSIG (0xAC) 

If you see this, then you have identified the public key for this output script. In reality the vast majority of all output scripts will be in the form of format #3.

Finally, a note about how to interpret a public key.

As far as your code is concerned, you should only deal with the 20 byte public key hash. The question immediately arises about what to do if you encountered the 65 byte form, how do you convert it to the 20 byte public key version?

That process is documented in the following source code located here; BitcoinAddress.cpp, the header file BitcoinAddress.h is located here.

Look for the method 'bitcoinPublicKeyToAddress'. You will see that this method accepts a 65 byte ECDSA key and converts it into the 25 byte public key hash form. You might immediately wonder, wait..I thought the public key was 20 bytes? Well, it is in fact 20 bytes. However, as a matter of convention, bitcoin software adds a prefix header byte at the beginning and a 4 byte checksum at the end. These are derived/computed values and do not need to be stored.

Let's say you have a 20 byte public key but you need the full 25 byte form, then you can use the routine: 'bitcoinRIPEMD160ToAddress' to see how the header and checksum are added.

Of course, ultimately, probably what you really care about is seeing the public key address in the ASCII form that is everywhere on the Internet.

For example, here is a public key address: 115syJ2uJ4y5c2a6Ec6iWeeijZctBp955Q

You can use this public key address to send me delicious and tasty bitcoin tips for writing this long and painful article.

So, how would you convert this public key address to the binary form? In the source code provided above, you would call the routine 'bitcoinAsciiToAddress' which will convert from the ASCII Base58 format to the 25 byte form, and if you only need the 20 byte version you skip the first header byte and just use the 20 bytes following.

To convert the 25 byte address to the ASCII form we are used to seeing, you can use the routine 'bitcoinAddressToAscii'.

The source code above is in 'snippet' form. What this means is that you can use it only with the header and implementation CPP provided. It has no other dependencies and is wholly self contained.

This source code contains fairly extensive documentation on the topic of converting between 65 byte public key to the 20/25 byte RIPEMD160 public hash address form as well as into and out of Base58 ASCII.

And, once again..that tip address is: 115syJ2uJ4y5c2a6Ec6iWeeijZctBp955Q

Transaction Section : Transaction Lock Time : uint32 : 4 bytes

The final data item at the end of each transaction is called the 'transaction lock time'. Currently this value is always set to zero, and we don't need to worry about it for now.

Important notes once you have finished reading the last transaction! Please read!

Once we have consumed the final transaction, this brings us to the end of the logical block. However, and this is important to note, we will not necessarily bring us to the end of the physical block! The 'block length' specified at the beginning of this block may actually go beyond the end of the last transaction which was consumed. That is why it is important that you read the entire block into memory rather than just reading each transaction and expecting the file pointer to be in the correct location for the next block.

Once you advance past 'block-length' worth of data, you should expect to either (A) reach the end of file, in which case you should try to open the next .dat file in sequence or (B) find a new header block magic id.

However, there are some rare cases where after skipping ahead in the file to where the start of the next block header should be, instead you will encounter a block of zero bytes. I do not know why/how this can occur, but I assure you that it can occur and your parser will need to take this into account.

If instead of finding the magic-ID header word you encounter a bunch of zero bytes, you will need to skip scan forward in the file until the magic-ID header appears and continue on the block chain from there.

First a Note about Hashes

The rest of this article is going to discuss how you go about putting all of the blocks and transactions together so that you can interpret the flow of data. However, before I can do this, I need to provide some background explanation on the topic of hashes.

Here is the Wikipedia article on the topic of Hash functions. Now, you can read all of that of course, but I'm going to just give a quick thumbnail sketch version here.

A hash is an operation on a set of variable length input data which produces an output signature that is typically much smaller in size. I suppose you could think of a social security number as a hash. Every social security number represents one person, but of course, that number doesn't tell you anything about the person himself.

It is important to note that hash functions are almost always irreversible. While a hash function provides a unique id/signature for a set of input data, you cannot reconstruct the input data from it!

There are a wide variety of hash functions available. Hash functions typically take the input data and manipulate it in such a way that the output hash is extremely unique, it does this using a variety of binary operations which scramble the bits of the input data into the output hash signature.

The goal of a good hash function is that given two different sets of input data they rarely, if ever, produce the same hash value. If two sets of input data produce the same hash value that is what is referred to as a 'collision' and it is not desirable.

The input of a hash function is any variable length of binary data. The output is a fixed size hash value. The smaller the hash output, the more likely there is to be a collision. The bigger the hash value, the less likely a collision can occur.

So, let's take an example of the most simple hash imaginable. Let's say your hash function is simply the one byte exclusive-or value of the input data.

Two things. First, this isn't a very good hash function and, second, you would get a lot of collisions. In fact, even with a perfect hash function you would get one collision on average every 256 times.

Now, let's say we use a 16 bit hash. Now you can expect collisions to occur (two different sets of input data producing the same output hash) once every 65,536 times. Now consider 32 bit, now 64 bit, etc. etc.

Well, bitcoin uses a 256 bit hash called SHA256. Here is a link to a Wikipedia article about it. Here is a link to an implementation of it that I have provided. The header file SHA256.h and the implementation SHA256.cpp. This implementation of SHA256 was created by Zilong Tan in 2011 and he released it under MIT license. He derived it from code originally written by Allan Saddi. All I did was repurpose the code for use in this bitcoin parser project.

You don't need to understand how SHA256 works to use it any more than you need to know how the arc-tangent function works on your calculator to use it. There is some concern as to whether or not SHA256 has been compromised by the NSA. SHA, which stands for 'secure hash algorithm', was in fact developed by the NSA. There are a lot of paranoid conspiracy theories out there on this topic. I can't really comment on them as it's not been my field of expertise. I take some solace from the fact that Edward Snowden is on the record as stating that 'strong encryption works', so let's just hope that he's right.

So, the theory goes that if SHA256 is a 'perfect' hash function then the chance of collision is so low that you don't even need to consider it as a possibility.

This is the magic of large numbers and cryptography in general. The human mind can grasp some numbers, like 1, or 10, or 100, or a thousand. The human mind can kind of grasp a million or even a billion in an abstract way. However, the human mind cannot grasp truly large numbers. Numbers on the order represented by these hashes are so large we cannot even begin to understand them.

So, apparently, the chance of a hash collision with the hash functions used by bitcoin are so remote that for all practical purposes we can ignore it. I tried to think of an analogy that people could kind of relate to and this is the best I came up with. The chance of guessing the private key for someone's bitcoin public address would be like winning the lottery every single day, for six days in a row, and then on the seventh day getting hit by lighting. That means you would have to win a major lottery, every single day, one after another, and, only then, once that happened, get struck by lighting. I hope this gives you some sense of how remote an event it would have to be for a hash collision to occur; it's considered for practical purposes such an unlikely event that we can think of it as virtually impossible.

And now a discussion about HashMap or HashSet container classes

So, now that we have discussed the concept of how to compute a hash for a given set of input data, you need a way to do a 'look up'. A HashMap or HashSet can be thought of as a 'sparse' array or associative container.

Let's say you wanted to do an array look up of 10,000 items. No problem. You create an array with 10,000 elements and you index them.

However..what if your array lookup is a 256 bit hash? Yeah, not gonna happen. You would need an array so large it's absurd to even discuss it.

But, it turns out, that you can access it kind of like an array using what is called a 'HashMap'

The way it works is like this. You actually do create an array, one that is reasonable in size. Say we make an array that contains 8,192 elements. Now, when you want to 'find' an object based on a 256 bit hash (32 bytes), what you do is instead just compute a 13 bit hash of that 256 bit hash. (13 bits is from 0-8191).

You can now use this 13 bit hash an an array index and if there is nothing there, great, you add the object. If there is already something there (a collision occurs) then you add this element onto a linked list which begins at the array index. Collisions are expected to occur here, but they should happen with such a low frequency that it is acceptable. You need to strike a balance on the size of your hash-map array with the number of collisions you can live with. If your array is too small, then collision might happen frequently making the hash-map performance unacceptably slow. General purpose hash containers typically 'grow' the base hash array as collisions occur over time.

With a good hash which minimizes collisions, you can do extremely fast array like access even with massive keys.

To do a lookup you generate the 13 bit hash, index the array, and then walk the linked list there to return any match found.

Now, a final note on this topic, the STL (standard template library) comes with a built in hash_map container class. This sounds great, right? Why not just use it?

Well, you can, and I'm not saying your shouldn't. However, it is going to cause problems. Most general purpose container classes like the STL have to handle deletions as well as insertions. They do an enormous amount of heap memory allocation, often allocating memory for every single entry in the hash map. When you are dealing with data sets like bitcoin which currently contains on the order of 30 million transactions, that is going to create a problem. If you try to insert 30 million transaction hashes into an STL hash_map it's going to be slow and consume outrageous amounts of memory.

My recommendation is to implement a simple hash template that does fixed memory allocation and knows that you are only adding to the hash, never removing. In this way you can allocate the memory you need to store all transactions up front just one time, with no additional overhead. You can also associate a compressed index/transaction number id based on the location of this fixed sized array.

It's your call either way, but that is how I implemented the hash table lookups in my bitcion parser code. In the blockchain.cpp file search for occurrences of the word 'SimpleHash' to see the implementation.

How to build the list of blocks in the main blockchain

Ok, now that you have a background in how hashes and hash maps work, we can discuss how to build the blockchain. Earlier in this article we explained how to read in a block from disk.

Once you have found the magic id, and read the block length, you must then read the block-prefix.

This is the 80 byte block prefix structure:

struct BlockPrefix {  uint32_t mVersion;     // The block version number.  uint8_t  mPreviousBlock[32];   // The 32 byte (256 bit) hash of the previous block in the blockchain  uint8_t  mMerkleRoot[32];   // The 32 bye merkle root hash  uint32_t mTimeStamp;     // The block time stamp  uint32_t mBits;      // The block bits field.  uint32_t mNonce;      // The block random number 'nonce' field. }; 

Here is an example of a struct that would hold a 256 bit (32 byte) SHA256 hash:

struct Hash256 {     uint8_t hash[32]; }; 

Once you have read this in you must compute the block-hash.

You do this by first computing the SHA256 hash of this 80 bytes of data. This results in an output 32 byte hash. You must then compute the SHA256 hash of this 32 bytes. This double hash is the 'block-hash'. Whenever any block refers to some 'previous' block hash it will be this computed value that it is referring to.

Here is what this would look like in code:

BlockPrefix prefix; fread(&prefix,sizeof(BlockPrefix),1,fph); Hash256 blockHash; computeSHA256(&prefix,sizeof(BlockPrefix),&blockHash); computeSHA256(&blockHash,sizeof(blockHash),&blockHash); 

Remember that if you print out the contents of the 32 byte 'blockhash' you must print it in reverse order to make it match up with the display in blockchain.info or blockexplorer

Once you have computed this hash you need to add it to a hash map. The hash map entry should be something like the following structure:

struct BlockHeader {    FILE *mFph;  // the file pointer where this block is located    uint32_t mFileOffset;  // the offset in the file where this block is located.    Hash256 mPreviousBlockHash; // the previous block hash of this block. }; 

You can then add this header to the hash map like this:

mBlockHeaders[ blockHash ] = header; 

Once you have done something like this you will have created a hash map of every block found in the block chain.

The next step is to actually build the main chain.

You do this by starting with the last block in the block chain. Starting with the last block you will then do a hash lookup for the 'previous block'. You do this until there is no 'previous' block, which will be the first block, block #0, the genesis block in the blockchain. When you do this, you will likely end up skipping some of the block headers you previously scanned in. Those blocks you skipped are considered 'orphan' blocks and you can ignore them.

Now you should have an ordered list of the bitcoin block chain from the first to the last. This hash map should not take up an enormous amount of memory, perhaps a few megabytes. With it, you can now seek to a specific file location and read the raw block data into memory and parse it in more detail from there.

Processing Transactions

Ok, now that you have built a list of every block in the block chain you are ready to process the transactions. Just as you had to build a hash map for each of the blocks, you are going to have to build a hash map for each of the transactions, since each input refers to the transaction hash of the previous output.

The transaction hash is computed as follows:

Take the block of memory that encompases the entire transaction, from transaction version number to the transaction lock time. You take this chunk of memory and compute the double SHA256 hash just like you did for the block prefix hash. You can then add this hash into a hash-map so that when processing inputs which refer to previous outputs you can look them up. You can store in that hash-map the file seek location so that you can re-read it from disk if necessary.

For each block, you simply iterate all inputs and outputs to produce the details of the flow of money for each transaction.

Since there are 30 million transactions to date, this is going to be a pretty damned big hash map. The blockchain also has this thing called a 'merkle-tree' which I believe is designed to help deal with this problem. I have not myself used the merkle-tree since I was happy just reading the whole damned thing into memory.

In the future, some day, when there are billions of bitcoin transactions it would be impossible to hold all of this in memory and it would have to be replaced with a database.

In fact, it's perfectly reasonable approach to just store all of this as data in an SQL database as you parse it. I'm sure a lot of people do that sort of thing.

Processing Addresses

If you want to know the 'balance' of any public key address, you do this by collating all inputs and outputs of all transactions associated with that address. Once again having a hash map may be useful. You can build a hash-map indexed by the 20 byte public key hash address. The entry for each address would contain the list of transactions which have either inputs or outputs (or both) associated with that address.

To compute the balance you simply iterate through the inputs and outputs and compute the sum,.

标签集:TAGS:
回复Comments() 点击Count()
喜欢就顶一下

回复Comments

{commentauthor}
{commentauthor}
{commenttime}
{commentnum}
{commentcontent}
作者:
{commentrecontent}