Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No step marker observed and hence the step time is unknown #578

Open
pritamdodeja opened this issue Mar 18, 2023 · 12 comments
Open

No step marker observed and hence the step time is unknown #578

pritamdodeja opened this issue Mar 18, 2023 · 12 comments

Comments

@pritamdodeja
Copy link

Consider Stack Overflow for getting support using TensorBoard—they have
a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Do not use this template for for setup, installation, or configuration
issues. Instead, use the “installation problem” issue template:

https://github.com/tensorflow/tensorboard/issues/new?template=installation_problem.md

To report a problem with TensorBoard itself, please fill out the
remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same
environment from which you normally run TensorFlow/TensorBoard, and
paste the output here:

https://raw.githubusercontent.com/tensorflow/tensorboard/master/tensorboard/tools/diagnose_tensorboard.py

Diagnostics

Diagnostics output
--- check: autoidentify                                                         
INFO: diagnose_tensorboard.py version 516a2f9433ba4f9c3a4fdb0f89735870eda054a1  
                                                                                
--- check: general                                                              
INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
INFO: os.name: posix                                                            
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='71d6fe811d18', release='6.0.5-200.fc36.x86_64', version='#1 SMP PREEMPT_DYNAMIC Wed Oct 26 15:55:21 UTC 2022', machine='x86_64')
INFO: sys.getwindowsversion(): N/A                                              
                                                                                
--- check: package_management                                                   
INFO: has conda-meta: False                                                     
INFO: $VIRTUAL_ENV: None                                                        
                                                                                
--- check: installed_packages                                                   
INFO: installed: tensorboard==2.11.0                                            
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
INFO: installed: tensorflow-estimator==2.11.0                                   
INFO: installed: tensorboard-data-server==0.6.1                                 
                                                                                
--- check: tensorboard_python_version                                           
INFO: tensorboard.version.VERSION: '2.11.0'                                     
                                                                                
--- check: tensorflow_python_version                                            
INFO: tensorflow.__version__: '2.11.0'                                          
INFO: tensorflow.__git_version__: 'v2.11.0-rc2-17-gd5b57ca93e5'                 
                                                                                
--- check: tensorboard_data_server_version                                      
INFO: data server binary: '/usr/local/lib/python3.8/dist-packages/tensorboard_data_server/bin/server'
INFO: data server binary version: b'rustboard 0.6.1'                            
                                                                                
--- check: tensorboard_binary_path                                              
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'                        
                                                                                
--- check: addrinfos                                                            
socket.has_ipv6 = True                                                          
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>                                 
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>                                
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>                          
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>                                 
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>                                 
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>                                     
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]
                                                                                
--- check: readable_fqdn                                                        
INFO: socket.getfqdn(): '71d6fe811d18'                                          
                                                                                
--- check: stat_tensorboardinfo                                                 
INFO: directory: /tmp/.tensorboard-info                                         
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=805882112, st_dev=51, st_nlink=2, st_uid=0, st_gid=0, st_size=6, st_atime=1677293427, st_mtime=1677293598, st_ctime=1677293598)
INFO: mode: 0o40777                                                             
                                                                                
--- check: source_trees_without_genfiles                                        
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.8/dist-packages']; bad_roots (0): []
                                                                                
--- check: full_pip_freeze                                                      
INFO: pip freeze --all:                                                         
absl-py==1.3.0                                                                  
anyio==3.6.2                                                                    
argon2-cffi==21.3.0                                                             
argon2-cffi-bindings==21.2.0                                                    
asttokens==2.1.0                                                                
astunparse==1.6.3                                                               
attrs==22.1.0                                                                   
backcall==0.2.0                                                                 
beautifulsoup4==4.11.1                                                          
bleach==5.0.1                                                                   
cachetools==5.2.0                                                               
certifi==2022.9.24                                                              
cffi==1.15.1                                                                    
charset-normalizer==2.1.1                                                       
contourpy==1.0.6                                                                
cycler==0.11.0                                                                  
debugpy==1.6.3                                                                  
decorator==5.1.1                                                                
defusedxml==0.7.1                                                               
entrypoints==0.4                                                                
executing==1.2.0                                                                
fastjsonschema==2.16.2                                                          
flatbuffers==22.10.26                                                           
fonttools==4.38.0                                                               
gast==0.4.0                                                                     
google-auth==2.14.1                                                             
google-auth-oauthlib==0.4.6                                                     
google-pasta==0.2.0                                                             
grpcio==1.50.0                                                                  
gviz-api==1.10.0                                                                
h5py==3.7.0                                                                     
idna==3.4                                                                       
importlib-metadata==5.0.0                                                       
importlib-resources==5.10.0                                                     
ipykernel==5.1.1                                                                
ipython==8.6.0                                                                  
ipython-genutils==0.2.0                                                         
ipywidgets==8.0.2                                                               
jedi==0.17.2                                                                    
Jinja2==3.1.2                                                                   
jsonschema==4.17.0                                                              
jupyter==1.0.0                                                                  
jupyter-client==7.4.7                                                           
jupyter-console==6.4.4                                                          
jupyter-core==5.0.0                                                             
jupyter-http-over-ws==0.0.8                                                     
jupyter-server==1.23.2                                                          
jupyterlab-pygments==0.2.2                                                      
jupyterlab-widgets==3.0.3                                                       
keras==2.11.0                                                                   
kiwisolver==1.4.4                                                               
libclang==14.0.6                                                                
Markdown==3.4.1                                                                 
MarkupSafe==2.1.1                                                               
matplotlib==3.6.2                                                               
matplotlib-inline==0.1.6                                                        
mistune==2.0.4                                                                  
nbclassic==0.4.8                                                                
nbclient==0.7.0                                                                 
nbconvert==7.2.5                                                                
nbformat==4.4.0                                                                 
nest-asyncio==1.5.6                                                             
notebook==6.5.2                                                                 
notebook-shim==0.2.2                                                            
numpy==1.23.4                                                                   
oauthlib==3.2.2                                                                 
opt-einsum==3.3.0                                                               
packaging==21.3                                                                 
pandocfilters==1.5.0                                                            
parso==0.7.1                                                                    
pexpect==4.8.0                                                                  
pickleshare==0.7.5                                                              
Pillow==9.3.0                                                                   
pip==20.2.4                                                                     
pkgutil-resolve-name==1.3.10                                                    
platformdirs==2.5.4                                                             
prometheus-client==0.15.0                                                       
prompt-toolkit==3.0.32                                                          
protobuf==3.19.6                                                                
psutil==5.9.4                                                                   
ptyprocess==0.7.0                                                               
pure-eval==0.2.2                                                                
pyasn1==0.4.8                                                                   
pyasn1-modules==0.2.8                                                           
pycparser==2.21                                                                 
Pygments==2.13.0                                                                
pyparsing==3.0.9                                                                
pyrsistent==0.19.2                                                              
python-dateutil==2.8.2                                                          
pyzmq==24.0.1                                                                   
qtconsole==5.4.0                                                                
QtPy==2.3.0                                                                     
requests==2.28.1                                                                
requests-oauthlib==1.3.1                                                        
rsa==4.9                                                                        
Send2Trash==1.8.0                                                               
setuptools==65.5.1                                                              
six==1.16.0                                                                     
sniffio==1.3.0                                                                  
soupsieve==2.3.2.post1                                                          
stack-data==0.6.1                                                               
tensorboard==2.11.0                                                             
tensorboard-data-server==0.6.1                                                  
tensorboard-plugin-profile==2.11.1                                              
tensorboard-plugin-wit==1.8.1                                                   
tensorflow-cpu==2.11.0                                                          
tensorflow-estimator==2.11.0                                                    
tensorflow-io-gcs-filesystem==0.27.0                                            
termcolor==2.1.0                                                                
terminado==0.17.0                                                               
tinycss2==1.2.1                                                                 
tornado==6.2                                                                    
traitlets==5.5.0                                                                
typing-extensions==4.4.0                                                        
urllib3==1.26.12                                                                
wcwidth==0.2.5                                                                  
webencodings==0.5.1                                                             
websocket-client==1.4.2                                                         
Werkzeug==2.2.2                                                                 
wheel==0.34.2                                                                   
widgetsnbextension==4.0.3                                                       
wrapt==1.14.1                                                                   
zipp==3.10.0                                                                    
                                                                                

Next steps

No action items identified. Please copy ALL of the above output,
including the lines containing only backticks, into your GitHub issue
or comment. Be sure to redact any sensitive information.
~
For browser-related issues, please additionally specify:

  • Browser type and version (e.g., Chrome 64.0.3282.140):
  • Screenshot, if it’s a visual issue:

image

Issue description

Running very standard example of tensorboard callback, code below, and getting No step marker observed issue

import tensorflow as tf
import datetime
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

def create_model():
  return tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28), name='layers_flatten'),
    tf.keras.layers.Dense(512, activation='relu', name='layers_dense'),
    tf.keras.layers.Dropout(0.2, name='layers_dropout'),
    tf.keras.layers.Dense(10, activation='softmax', name='layers_dense_2')
  ])

model = create_model()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch=(1,50))

model.fit(x=x_train, 
          y=y_train, 
          epochs=5, 
          validation_data=(x_test, y_test), 
          callbacks=[tensorboard_callback])



Please describe the bug as clearly as possible. How can we reproduce the
problem without additional resources (including external data files and
proprietary Python modules)?

Step markers are either not getting logged by Keras or are not being read by tensorboard. I would expect that this information is logged so that I can use the module for optimizing tf.data usage. The environment that this is run in is a standard tensorflow docker container with the only additional package installed being tensorboard_plugin_profile

@foxik has suggested this is a protobuf version issue and that upgrading to 3.20.3 fixed a similar issue for him. It didn't fix it for me, am attaching the logs from both versions pre and post upgrade. I originally opened the issue at tensorflow/tensorboard#6210 - @bmd3k asked me to recreate it here with all the information consolidated.

logs.oldprotobuf 2.zip

logs.protobuf.3.20.3.zip

@foxik
Copy link

foxik commented Mar 18, 2023

Hi,

I retried my experiment and I actually did a slightly different thing -- I installed tensorflow==1.12.0rc0 (which brought tensorboard==1.12.0) and then tensorboard-plugin-profile==2.11.1, and finally downgraded to protobuf==3.20.3. This allows me to open profile runs created by both TF 1.11 and TF 1.12.0rc0.

@JustASquid
Copy link

I'm running into the same issue. The workaround suggested by @foxik didn't work for me either.
Are there any suggestions for other workarounds? Profiling is currently not possible for our model; Trying to figure out which custom layer is causing the issue is not feasible.

@pritamdodeja
Copy link
Author

@JustASquid do you have the flexibility to run on a slightly older versions of tf*? I was able to get this to work by doing that. I can share my config with you later today in case that's a viable option.

@JustASquid
Copy link

@pritamdodeja I did a run in version 2.10, I no longer get this warning, but the training step markers are wrong:

image

This run was with profile_batch="10,20" with a 30-batch epoch.

Could this be related to the same issue?

@pritamdodeja
Copy link
Author

pritamdodeja commented Apr 1, 2023

@pritamdodeja I did a run in version 2.10, I no longer get this warning, but the training step markers are wrong:

image

This run was with profile_batch="10,20" with a 30-batch epoch.

Could this be related to the same issue?

@JustASquid the original symptom I faced was profiler wasn't available with the message related to the step markers in the screenshot above. Do you see that the profiler is available to you in tensorboard? Try running the reproducible example I have put as a snippet above and see what results you get.

@JustASquid
Copy link

@pritamdodeja to clarify, the warning doesn't show up anymore when downgrading from Tensorflow 2.11 to Tensorflow 2.10 for the training run.

But the issue now is that the step numbers are all wrong; As you can see from the x-axis which shows incorrect step numbers and the very strange "spiking". Could be related to #266 perhaps?

@pritamdodeja
Copy link
Author

@JustASquid It looks like the same issue to me. I don't know enough protocol buffers yet to be able to effectively debug it though. If/when that changes, I will post back here with an update.

@pritamdodeja
Copy link
Author

@JustASquid I just tested this issue on the following configuration and it's still broken. Things are actually worse now as you cannot go back to an older tf version because of cudnn dependency :( - Profiler no longer shows up. If I get the time, I'm going to do a deep dive on tensorboard profiler and protocol buffers. I'm using the latest protobuf but setting

export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

$ pip freeze | grep tensor
tensorboard==2.14.0
tensorboard-data-server==0.7.1
tensorboard-plugin-profile==2.13.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.13.0
tensorflow-data-validation==1.13.0
tensorflow-estimator==2.13.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.33.0
tensorflow-metadata==1.13.1
tensorflow-model-analysis==0.44.0
tensorflow-serving-api==2.12.2
tensorflow-transform==1.13.0

@pritamdodeja
Copy link
Author

I was able to understand why this is happening. The profiler is writing the profiler data in a different place in the hierarchy. Once that issue is solved, and the profile duration is long enough, for me, the step marker issue is going away. I will provide details in the next day or so.

@pritamdodeja
Copy link
Author

@JustASquid @foxik Here is my understanding of the possible cause of this:

Let's say you usually run tensorboard --logdir model_run to start tensorboard

tensorboard expects plugins/profile to exist in model_run/<run number>/<train|validation>

Starting with tensorflow 2.12 (possibly earlier) plugins/profile is instead appearing at model_run/<run number>

This is causing tensorboard to not see the profile data, and not activating the profiler in the UI, etc. Once you manually rectify this by copying the data using

cp -Rpv ../plugins .

in model_run/<run number>/<train|validation>

and refresh tensorboard, it should start seeing the profiler.

If I had to guess what introduced the change/error, I would say it's somewhere in the vicinity of

tensorflow/tensorflow/core/profiler/convert/xplane_to_tools_data.cc

More specifically, in the tensorflow repo, I suspect the following might be helpful to figure out what exactly broke this

git diff 7a500e 4d4873 tensorflow/tensorflow/core/profiler/convert/xplane_to_tools_data.h

My use-case is in the context of a tfx pipeline, but I believe this applies to other use cases where profiling is happening, so likely your log_dir and hierarchy might be different, but relatively, the problem should be the same.

@Gaura
Copy link

Gaura commented Nov 3, 2023

Hello,

Thanks for raising and discussing the issue. I am facing the same issue. Could you tell me if this is resolved?

Thanks.

@stellarpower
Copy link

In my case I am able to obtain stats for example code similar to what @pritamdodeja has provided above, but not when I change to my own loss function that I am trying to debug (and this runs okay). I get the impression the core profiler is not outputting those markers, as they don't appear to be in the protobuf file, so have opened here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants