Monday, November 8, 2010

MP4 File Format Part 1

This is a MP4 file format notes that reference from ISO IEC 14496-12 2005 edition about Information technology — Coding of audio-visual objects — Part 12: ISO base media file format

This is not designed for details explanation of each atom. For detail information, please read the ISO IEC 14496-12 document.

General Format

In general, MP4 file format has the following structure

  • File type box that denote the mp4 media type
  • Media data box which contains the actual AV frames.
  • Within a mdat, there are chunks and samples
  • Movie box which is the container for all metadata
  • Each moov has have a mvhd (Movie header box)
  • It can contains N trak box. Each trak box contains media specific meta data information Usually, it will have 2 tracks (video and audio)
  • More importantly, it contains sample information such as stsd, stts, stsz stsc, stco, etc...

Mdat Atom

MPEG4 sample

H.264 sample

Mdat is the media data atom which contain video and audio frames. As you can see from the screenshot, it is separated into 2 tracks (video and audio). Each track has multiple chunks and each chunks has multiple samples. Usually, you can treat each sample as a AV frame.

The number of sample in the chunk is defined in stsc atom (sample to chunk box) and the chunk offset is defined in stco atom (chunk offset box).

For MPEG4 (see MPEG4 sample), the red box denote the start code for MPEG4 Elementary stream. ISO 14496-14 states that MPEG4 media-data is stored as access units, a range of contiguous bytes for each access unit (a single access unit is the definition of a ‘sample’ for an MPEG-4 media stream). See 3.1.1 of the document

For H.264 (see H.264 sample), the red box denote the frame size (4 bytes). The blue box is the start of the frame, in this case, it is H.264 Non-IDR frame. ISO 14496-15 states that H.264 sample needs a length field preceding each NAL. See 5.2.3 of that document.

STSC - Sample To Chunk Box

The stsc tells you the number of samples in a chunk. To read this, you need to read first chunk and samples per chunks together. In the screenshot, first chunk has 1, 3, 5, 6..... and samples per chunk has 4, 5, 4, 5.... This means the followings:

chunk 1 - 2 has 4 samples
chunk 3 - 4 has 5 samples
chunk 5 has 4 samples

and so on...

STCO - Chunk Offset Box

This box tells you the location of the chunk. This offset is referred from the start of file. In the screenshot, it has values of 1516, 4880,...

As this is a video track, that means the first video chunk is located at 1516 bytes of the file.

STSZ - Sample Size Box

This box tell you the size of each sample in the chunk. It also tells you the number of sample counts in this track.

If you look at the entry size, it state 2229, 529,....

That means the first sample has 2229 bytes and second samples has 529 bytes

STSD - Sample Description Box

This box tells you the codec type, initialization and any information requires for the coding in the track.

As you can see in the screenshot, it contains AVC configuration box. Those are the information required (SPS, PPS, etc..) for decoding this video track.

Reference: ISO IEC 14496-12, ISO IEC 14496-14, ISO IEC 14496-15


  1. Hi
    where can i get the application which prints the format of .mp4 file

    If its written by you is it possible to share it?


  2. You can get it at

  3. Hey Really Superb Spec. Thanks for sharing this. I hope you do not mind if I follow your blog, adding your rss to my client. :D
    Peter, see also mp4 convert to mov mac

  4. MP4 is the most common use video format for nowadays.
    and I always need to convert VOB to MP4 Mac for share.
    Nice Post!!

  5. i am capturing a h264 elementary over RTP and storing in MP4 format. i now all the data can be stored in a single chunk and in a single sample. Is this correct?
    on what basis i should choose number of chunks and samples per chunk?

  6. Yes. You can do a single chunk per single sample way.

    As for number of chunk and sample per chunk, for my own preference and simplicity, I will do in a GOV view. Because RTP come in I-P-P-P-P-P and need to be decode to NAL before storing into MP4 format. I will be more convenience to decode all frame as GOV, then treat each frame as sample and the whole GOV as chunk.

  7. This comment has been removed by the author.

  8. This comment has been removed by the author.

  9. i deleted a few comments, which i will try and contact you by email about.


Hadoop - How to setup a Hadoop Cluster

Below is a step-by-step guide which I had used to setup a Hadoop Cluster Scenario 3 VMs involved: 1) NameNode, ResourceManager - Host...