This is not designed for details explanation of each atom. For detail information, please read the ISO IEC 14496-12 document.
In general, MP4 file format has the following structure
- File type box that denote the mp4 media type
- Media data box which contains the actual AV frames.
- Within a mdat, there are chunks and samples
- Movie box which is the container for all metadata
- Each moov has have a mvhd (Movie header box)
- It can contains N trak box. Each trak box contains media specific meta data information Usually, it will have 2 tracks (video and audio)
- More importantly, it contains sample information such as stsd, stts, stsz stsc, stco, etc...
Mdat is the media data atom which contain video and audio frames. As you can see from the screenshot, it is separated into 2 tracks (video and audio). Each track has multiple chunks and each chunks has multiple samples. Usually, you can treat each sample as a AV frame.
The number of sample in the chunk is defined in stsc atom (sample to chunk box) and the chunk offset is defined in stco atom (chunk offset box).
For MPEG4 (see MPEG4 sample), the red box denote the start code for MPEG4 Elementary stream. ISO 14496-14 states that MPEG4 media-data is stored as access units, a range of contiguous bytes for each access unit (a single access unit is the definition of a ‘sample’ for an MPEG-4 media stream). See 3.1.1 of the document
For H.264 (see H.264 sample), the red box denote the frame size (4 bytes). The blue box is the start of the frame, in this case, it is H.264 Non-IDR frame. ISO 14496-15 states that H.264 sample needs a length field preceding each NAL. See 5.2.3 of that document.
STSC - Sample To Chunk Box
The stsc tells you the number of samples in a chunk. To read this, you need to read first chunk and samples per chunks together. In the screenshot, first chunk has 1, 3, 5, 6..... and samples per chunk has 4, 5, 4, 5.... This means the followings:
chunk 1 - 2 has 4 samples
chunk 3 - 4 has 5 samples
chunk 5 has 4 samples
and so on...
STCO - Chunk Offset Box
This box tells you the location of the chunk. This offset is referred from the start of file. In the screenshot, it has values of 1516, 4880,...
As this is a video track, that means the first video chunk is located at 1516 bytes of the file.
STSZ - Sample Size Box
This box tell you the size of each sample in the chunk. It also tells you the number of sample counts in this track.
If you look at the entry size, it state 2229, 529,....
That means the first sample has 2229 bytes and second samples has 529 bytes
STSD - Sample Description Box
This box tells you the codec type, initialization and any information requires for the coding in the track.
As you can see in the screenshot, it contains AVC configuration box. Those are the information required (SPS, PPS, etc..) for decoding this video track.
Reference: ISO IEC 14496-12, ISO IEC 14496-14, ISO IEC 14496-15